Pandora Sourcecodes - pandora-kernel.git/log

[SCSI] sd: fix array cache flushing bug causing performance problems

Some arrays synchronize their full non volatile cache when the sd driver sends
a SYNCHRONIZE CACHE command. Unfortunately, they can have Terrabytes of this
and we send a SYNCHRONIZE CACHE for every barrier if an array reports it has a
writeback cache. This leads to massive slowdowns on journalled filesystems.

The fix is to allow userspace to turn off the writeback cache setting as a
temporary measure (i.e. without doing the MODE SELECT to write it back to the
device), so even though the device reported it has a writeback cache, the
user, knowing that the cache is non volatile and all they care about is
filesystem correctness, can turn that bit off in the kernel and avoid the
performance ruinous (and safety irrelevant) SYNCHRONIZE CACHE commands.

The way you do this is add a 'temporary' prefix when performing the usual
cache setting operations, so

echo temporary write through > /sys/class/scsi_disk/<disk>/cache_type

Reported-by: Ric Wheeler <rwheeler@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla2xxx: qla2x00_sp_compl can be static.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla2xxx: fix sparse warning "large integer implicitly truncated to unsigned type"

Found by 0 day test project

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Armen Baloyan <armen.baloyan@qlogic.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] scsi_debug: fix logical block provisioning support

provisioning map (map_storep) is a bitmap accessed by bitops.

So the allocation size should be a multiple of sizeof(unsigned long) and
also the bitmap should be cleared by using bitmap_clear() instead of
memset().

Otherwise it will cause problem on big-endian architecture if the number of
bits is not a multiple of BITS_PER_LONG.

I tried testing the logical block provisioning support in scsi_debug,
but it didn't work as I expected.

For example, load scsi_debug module with UNMAP command supported
and fill the storage with random data.

        # modprobe scsi_debug lbpu=1
        # dd if=/dev/urandom of=/dev/sdb

Then, try to unmap LBA 0, but Get LBA status reports:

        # sg_unmap --lba=0 --num=1 /dev/sdb
        # sg_get_lba_status --lba=0 /dev/sdb
        descriptor LBA: 0x0000000000000000  blocks: 16384  mapped

This is unexpected result.  Because UNMAP command to LBA 0 finished
without any errors, but Get LBA status shows that LBA 0 is still mapped.

This problem is due to the wrong translation between LBA and index of
provisioning map.  Fix it by using correct translation functions.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] scsi_debug: clear correct memory region when LBPRZ is enabled

The function unmap_region() clears memory region specified as the logical
block address and the number of logical blocks in ramdisk storage
(fake_storep) if lbpu and lbprz module parameters are enabled.

In the while loop of unmap_region(), it advances optimal unmap granularity
in logical blocks. But it only clears one logical block at LBA 'block' per
loop iteration. And furthermore, the 'block' is not pointing to a logical
block address which should be cleared, it is a index of probisioning map
(map_storep).

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] scsi_debug: prohibit scsi_debug_unmap_granularity == scsi_debug_unmap_alignment

scsi_debug prohibits setting scsi_debug_unmap_alignment to be greater
than scsi_debug_unmap_granularity.  But setting them to be the same value
is not prohibited.  In this case, the only difference with
scsi_debug_unmap_alignment == 0 is the logical blocks from 0 to
scsi_debug_unmap_alignment - 1 cannot be unmapped.  But the difference is
not properly handled in the current code.

So this prohibits such unusual setting.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] scsi_debug: call map_region() and unmap_region() only when needed

If the logical block provisioning is not enabled, map_region() and
unmap_region() have no effect and they don't need to be called.

So this makes map_region() and unmap_region() to be called only
when scsi_debug_lbp() returns true, i.e. logical block provisioning is
enabled.

While I'm at it, this also removes meaningless non-zero check for
scsi_debug_unmap_granularity.

Because scsi_debug_unmap_granularity cannot be zero with usual setting:
scsi_debug_unmap_granularity is 1 by default, and it can be changed to
zero with explicit module parameter setting only when the logical block
provisioning is disabled. But it is only meaningful module parameter
when the logical block provisioning is enabled.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] sd_dif: problem with verify of type 1 protection information (PI)

It appears to me that there is a problem with handling of type 1 protection
information.

It is considering a logical block reference tag of 0xffffffff to be an error,
but it is actually valid any time ((lba & 0xffffffff) == 0xffffffff) [for
example, 2TiB-1, 4TiB-1, 6TiB-1, etc.].

I'm going by what's written in 4.18.3 of SBC3, where there doesn't appear
to be any invalid value for the reference tag.

Signed-off-by: Jeremy Higdon <jeremy@sgi.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ipr: SATA DVD probing failed with 64bit adapter

Driver passed the wrong IOADL address to IOA adapter. The patch
fixes the issue.

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] scsi_transport_iscsi: fix error return code in iscsi_transport_init()

Fix to return -ENOMEM in the create workqueue error case
instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Update lpfc version for 8.3.39 driver release

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed driver handling of CLEAR_LA with NPIV enabled causing SID=0 frames out

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Reduced tmo value set to FLOGI WQE for quick recovery from FLOGI sequence timeout

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Add log message when completes with clean address bit set to zero

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed driver vector mapping to CPU affinity

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed iocb flags not being reset for scsi commands

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed system panic during EEH recovery due to midlayer acting on outstanding I/O

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed not returning FAILED status when SCSI invoking host reset handler failed

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed bad book keeping in posting els sgls to port

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed deadlock between hbalock and nlp_lock use

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed BlockGuard to take advantage of rdprotect/wrprotect info when available

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Reduced spinlock contention on SCSI buffer list

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed crash when processing bsg's sg list with high memory pages

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Remove lpfc_fcp_look_ahead module parameter

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fix driver issues with SCSI Host reset

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Doorbell formation information logged in dual-chute mode WQ and RQ setup

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fix driver issues with large s/g lists for BlockGuard

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fix driver issues with large lpfc_sg_seg_cnt values

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed pt2pt and loop discovery problems on topology changes.

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Remove driver dependency on HZ

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed BlockGuard error reporting

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] lpfc 8.3.39: Fixed VPI allocation issues after firmware dump is performed

Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ipr: Need to reset adapter after the 6th EEH error

Add reset adapter after the 6th EEH errors in ipr driver. This triggers
the adapter reset via the PCI config space even when the slot is frozen.

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla4xxx: Update driver version to 5.03.00-k9

Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla4xxx: Assign values using correct datatype

Assign values using correct datatype in function qla4xxx_copy_to_fwddb_param()

Signed-off-by: Adheer Chandravanshi <adheer.chandravanshi@qlogic.com>
Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla4xxx: Fix smatch warnings

Fix following smatch warnings:-
drivers/scsi/qla4xxx/ql4_os.c:6573
qla4xxx_sysfs_ddb_set_param() warn: possible memory leak of 'fw_ddb_entry'
drivers/scsi/qla4xxx/ql4_os.c:6596
qla4xxx_sysfs_ddb_delete() warn: variable dereferenced before check 'fnode_sess'
(see line 6584)
drivers/scsi/qla4xxx/ql4_os.c:6632
qla4xxx_sysfs_ddb_delete() error: potential NULL dereference 'fw_ddb_entry'.

Signed-off-by: Adheer Chandravanshi <adheer.chandravanshi@qlogic.com>
Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla4xxx: Fix sparse warning for qla4xxx_sysfs_ddb_tgt_create

Fix following warning:
drivers/scsi/qla4xxx/ql4_os.c:5507:5:
warning: symbol 'qla4xxx_sysfs_ddb_tgt_create' was not declared. Should it be
static?

Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] qla4xxx: Silence gcc warning

Fix followig gcc warning:-
drivers/scsi/qla4xxx/ql4_os.c: In function ‘qla4xxx_sysfs_ddb_get_param’:
drivers/scsi/qla4xxx/ql4_os.c:6279:
warning: comparison is always true due to limited range of data type
drivers/scsi/qla4xxx/ql4_os.c:6290:
warning: comparison is always true due to limited range of data type
drivers/scsi/qla4xxx/ql4_os.c: In function ‘qla4xxx_sysfs_ddb_delete’:
drivers/scsi/qla4xxx/ql4_os.c:6593:
warning: ‘ddb_size’ may be used uninitialized in this function

Signed-off-by: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] libosd: remover duplicate __bitwise annotation

__be32 is already a __bitwise type so we don't need the second __bitwise
here. It causes a Sparse error:
include/scsi/osd_protocol.h:110:26: error: invalid modifier

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] megaraid_sas: release lock on error path

We should unlock here before returning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Adam Radford <aradford@gmail.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ibmvfc: Driver version 1.0.11

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ibmvfc: Suppress ABTS if target gone

Adds support for a new VIOS feature that allows ibmvfc to
optimize terminate_rport_io by telling the VIOS the target
is no longer accessible on the fabric and that it should
not send an ABTS out on the fabric to the device.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ibmvfc: Send cancel when link is down

If attempting to abort requests due to a fail fail timeout
or error handling while the link is down, we cannot send
an abort out on the fabric. We can, however, send a cancel
to the VIOS. This fixes ibmvfc to send a cancel in this
case to prevent error handling from failing and/or
escalating.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ibmvfc: Support FAST_IO_FAIL in EH handlers

Adds support for receiving FAST_IO_FAIL from fc_block_scsi_eh
when in error recovery. This fixes cases of devices being
taken offline when they are no longer accessible on the fabric,
preventing them from coming back online when the fabric recovers.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] ibmvfc: Properly set cancel flags when cancelling abort

The flags on a cancel operation are intended to indicate what,
if any, TMF will follow the cancel request. This fixes a case
where we were incorrectly setting the abort task set flag on
the cancel flag when we were cancelling an abort task set.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Bump the driver version to 10.0.467.0

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix issue in passing the exp_cmdsn and max_cmdsn

Command Window value from the CQE was used to calculate the
max_cmdsn for that session.The command window value extracted
for SKH-R adapter was not proper. The value was extracted from
BE adapter completion event. Fixed the issue by getting the
cmd_wnd value from SKH-R CQE.

The exp_cmdsn and max_cmdsn values were not converted to BE format
before calling the __iscsi_complete_pdu(). Fixed the issue of converting
to BE format.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix possible reentrancy issue in be_iopoll

The driver creates "NAPI" context per core which is fine,
however the above routine declares the ret variable as static!
Thus there is only one instance of this variable!
When this routine is called from more than one thread of execution,
than the result is unpredictable.

         static unsigned int ret;
         .....

         ret = beiscsi_process_cq(pbe_eq);
                 <--------Another thread can enter here and change "ret".
         if (ret < budget) {
                ....
         }
                 <--------Another thread can enter here and change "ret".
         return ret;

Fix - remove the "static"

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2scsi: Update copyright dates to 2013

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix checking Adapter state while establishing CXN

Before tyring to establish a CXN with the target, check if the adapter is in
a stable state

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix dynamic CID allocation Mechanism in driver

Number of CID assigned to a function from adapter can be dynamic. The CID count
for each function was fixed number before. Code Fix done so that adapters with
fixed/dynamic CID count will work with the driver.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi : Fix the NOP-In handling code path

When target send a NOP-IN with valid TTT, driver issues a NOP-OUT
and the task was not freed from driver. The task list available for
the session used to run out, and as no more task list were available
no more iSCSI commands were exchanged on that session.
This patches fixed the issue, by calling iscsi_put_task.

Signed-off-by: Minh Tran <minhduc.tran@emulex.com>
Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix the Port Link Status issue

Check the Logical Link status also as part of the port link status.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix displaying the Active Session Count from driver

This patch fixes the displaying of number of active
sessions in use.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix displaying the FW Version from driver.

The mgmt_hba_attributes structure declared was not proper and
because of that the FW response returned for the MBX_CMD was not
matching. This issue went unnoticed as mgmt_hba_attribs structure
members were never used in the code path.

This fix of displaying the FW version had to change the mgmt_hba_attrib
structure also. The latest driver will also work with the older FW as
the issue was in the driver declaration.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix support for DEFQ extension

Fix support for DEFQ extension which will be used by latest
adapters

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix MACRO for checking the adapter type

Fixed the code flow based on the MACRO defined to check for
adapter.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix freeing CXN specific driver resources.

Free CXN specific resource held by driver when login redirection
or connection retry happens. Login redirection was failing
because WRB/SGL were not allocated from the CID on which
doorbell was rung.

Fixed the issue raised by MikeC

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix MSIX support in SKH-R to 32

This patch limits the max number of msix vectors to 32.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix MBX Command issues

- Check Ready Bit before posting the BMBX Hi Address
- Fix the parameters passed to beiscsi_mccq_compl
in beiscsi_open_conn()
- Fix tag value check in beiscsi_ep_connect.

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix returning Failure when MBX fails with Insufficient buffer error

When MBX command fails with insufficent buffer, check for the
response lenght returned. Return success if response length
is non-zero value which indicates valid data.

Signed-off-by: Minh Tran <minhduc.tran@emulex.com>
Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] be2iscsi: Fix lack of uninitialize pattern to FW

This patch sends uninitialize pattern to FW during driver unload
which is expected by FW for cleanup

Signed-off-by: John Soni Jose <sony.john-n@emulex.com>
Signed-off-by: Jayamohan Kallickal <jayamohan.kallickal@emulex.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] csiostor: off by one error

We need to store PROTO_ERR_IMPL_LOGO (26) things here, but the
first element isn't used so the array should have 27 elements.
This matches fwevt_to_rnevt[] which has 27 elements.

The patch solves a Smatch static checker warning on my system:
drivers/scsi/csiostor/csio_rnode.c:880 csio_rnode_fwevt_handler()
error: buffer overflow '(rn)->stats.n_evt_fw' 26 <= 26

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Naresh Kumar Inna <naresh@chelsio.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] bnx2fc: Bumped version to 1.0.14

Signed-off-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] bnx2fc: Update copyright dates

Signed-off-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] bnx2fc: Fix race condition between IO completion and abort

When IO is successfully completed while an abort is pending, eh_abort
incorrectly assumes that abort failed and performes recovery by issuing
cleanup. Howerver, cleanup timesout as the firmware has no clue about
this IO. Fix this by checking if the IO has already completed.

Signed-off-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] bnx2fc: Include chip number in the symbolic name

[jejb: move PCI_DEVICE_ID definitions to include/pci_ids.h]
Signed-off-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] bnx2fc: Enable cached tasks to improve performance

Set perf_config to 3 during firmware initialization to enable both
cached connections as well as cached tasks.

Signed-off-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] fnic: Incremented driver version

Signed-off-by: Brian Uchino <buchino@cisco.com>
Signed-off-by: Hiral Patel <hiralpat@cisco.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] fnic: Kernel panic due to FIP mode misconfiguration

If switch configured in FIP and adapter configured in non-fip mode, driver
panics while queueing FIP frame in non-existing fip_frame_queue. Added config
check before queueing FIP frame in misconfiguration case to avoid kernel panic.

Signed-off-by: Hiral Patel <hiralpat@cisco.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

[SCSI] fnic: FIP VLAN Discovery Feature Support

FIP VLAN discovery discovers the FCoE VLAN that will be used by all other FIP
protocols as well as by the FCoE encapsulation for Fibre Channel payloads on
the established virtual link. One of the goals of FC-BB-5 was to be as
nonintrusive as possible on initiators and targets, and therefore FIP VLAN
discovery occurs in the native VLAN used by the initiator or target to
exchange Ethernet traffic. The FIP VLAN discovery protocol is the only FIP
protocol running on the native VLAN; all other FIP protocols run on the
discovered FCoE VLANs.

If an administrator has manually configured FCoE VLANs on ENodes and FCFs,
there is no need to use this protocol. FIP and FCoE will run over the
configured VLANs.

An ENode without FCoE VLANs configuration would use this automated discovery
protocol to discover over which VLANs FCoE is running.

The ENode sends a FIP VLAN discovery request to a multicast MAC address called
All-FCF-MACs, which is a multicast MAC address to which all FCFs listen.

All FCFs that can be reached in the native VLAN of the ENode are expected to
respond on the same VLAN with a response that lists one or more FCoE VLANs
that are available for the ENode's VN_Port login. This protocol has the sole
purpose of allowing the ENode to discover all the available FCoE VLANs.

Now the ENode may enable a subset of these VLANs for FCoE Running the FIP
protocol in these VLANs on a per VLAN basis. And FCoE data transactions also
would occur on this VLAN. Hence, Except for FIP VLAN discovery, all other FIP
and FCoE traffic runs on the selected FCoE VLAN. Its only the FIP VLAN
Discovery protocol that is permitted to run on the Default native VLAN of the
system.

[**** NOTE ****]
We are working on moving this feature definitions and functionality to libfcoe
module. We need this patch to be approved, as Suse is looking forward to merge
this feature in SLES 11 SP3 release. Once this patch is approved, we will
submit patch which should move vlan discovery feature to libfoce.

[Fengguang Wu <fengguang.wu@intel.com>: kmalloc cast removal]
Signed-off-by: Anantha Prakash T <atungara@cisco.com>
Signed-off-by: Hiral Patel <hiralpat@cisco.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>

Merge git://git./linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
  sort):

   1) Add 'random' mode to team driver, from Jiri Pirko and Eric
      Dumazet.

   2) Make it so that any driver that supports configuration of multiple
      MAC addresses can provide the forwarding database add and del
      calls by providing a default implementation and hooking that up if
      the driver doesn't have an explicit set of handlers.  From Vlad
      Yasevich.

   3) Support GSO segmentation over tunnels and other encapsulating
      devices such as VXLAN, from Pravin B Shelar.

   4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

   5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
      Dukkipati.

   6) In the PHY layer, allow supporting wake-on-lan in situations where
      the PHY registers have to be written for it to be configured.

      Use it to support wake-on-lan in mv643xx_eth.

      From Michael Stapelberg.

   7) Significantly improve firewire IPV6 support, from YOSHIFUJI
      Hideaki.

   8) Allow multiple packets to be sent in a single transmission using
      network coding in batman-adv, from Martin Hundebøll.

   9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

  10) Generalize the VXLAN forwarding tables so that there is more
      flexibility in configurating various aspects of the endpoints.
      From David Stevens.

  11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
      from Dmitry Kravkov.

  12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
      Neira Ayuso.

  13) Start adding networking selftests.

  14) In situations of overload on the same AF_PACKET fanout socket, or
      per-cpu packet receive queue, minimize drop by distributing the
      load to other cpus/fanouts.  From Willem de Bruijn and Eric
      Dumazet.

  15) Add support for new payload offset BPF instruction, from Daniel
      Borkmann.

  16) Convert several drivers over to mdoule_platform_driver(), from
      Sachin Kamat.

  17) Provide a minimal BPF JIT image disassembler userspace tool, from
      Daniel Borkmann.

  18) Rewrite F-RTO implementation in TCP to match the final
      specification of it in RFC4138 and RFC5682.  From Yuchung Cheng.

  19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
      you like netlink, so I implemented netlink dumping of netlink
      sockets.") From Andrey Vagin.

  20) Remove ugly passing of rtnetlink attributes into rtnl_doit
      functions, from Thomas Graf.

  21) Allow userspace to be able to see if a configuration change occurs
      in the middle of an address or device list dump, from Nicolas
      Dichtel.

  22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
      Frederic Sowa.

  23) Increase accuracy of packet length used by packet scheduler, from
      Jason Wang.

  24) Beginning set of changes to make ipv4/ipv6 fragment handling more
      scalable and less susceptible to overload and locking contention,
      from Jesper Dangaard Brouer.

  25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
      instead.  From Hong Zhiguo.

  26) Optimize route usage in IPVS by avoiding reference counting where
      possible, from Julian Anastasov.

  27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

  28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
      Eitzenberger.

  29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
      nfnetlink_log, and nfnetlink_queue.  From Gao feng.

  30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

  31) Support several new r8169 chips, from Hayes Wang.

  32) Support tokenized interface identifiers in ipv6, from Daniel
      Borkmann.

  33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

  34) Add 802.1ad vlan offload support, from Patrick McHardy.

  35) Support mmap() based netlink communication, also from Patrick
      McHardy.

  36) Support HW timestamping in mlx4 driver, from Amir Vadai.

  37) Rationalize AF_PACKET packet timestamping when transmitting, from
      Willem de Bruijn and Daniel Borkmann.

  38) Bring parity to what's provided by /proc/net/packet socket dumping
      and the info provided by netlink socket dumping of AF_PACKET
      sockets.  From Nicolas Dichtel.

  39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
      Poirier"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
  filter: fix va_list build error
  af_unix: fix a fatal race with bit fields
  bnx2x: Prevent memory leak when cnic is absent
  bnx2x: correct reading of speed capabilities
  net: sctp: attribute printl with __printf for gcc fmt checks
  netlink: kconfig: move mmap i/o into netlink kconfig
  netpoll: convert mutex into a semaphore
  netlink: Fix skb ref counting.
  net_sched: act_ipt forward compat with xtables
  mlx4_en: fix a build error on 32bit arches
  Revert "bnx2x: allow nvram test to run when device is down"
  bridge: avoid OOPS if root port not found
  drivers: net: cpsw: fix kernel warn on cpsw irq enable
  sh_eth: use random MAC address if no valid one supplied
  3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
  tg3: fix to append hardware time stamping flags
  unix/stream: fix peeking with an offset larger than data in queue
  unix/dgram: fix peeking with an offset larger than data in queue
  unix/dgram: peek beyond 0-sized skbs
  openvswitch: Remove unneeded ovs_netdev_get_ifindex()
  ...

filter: fix va_list build error

This patch fixes the following build error.

In file included from include/linux/filter.h:52:0,
from arch/arm/net/bpf_jit_32.c:14:
include/linux/printk.h:54:2: error: unknown type name ‘va_list’
include/linux/printk.h:105:21: error: unknown type name ‘va_list’
include/linux/printk.h:108:30: error: unknown type name ‘va_list’

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:
"Assorted fixes and cleanups to the existing drivers plus a new driver
  for IMS Passenger Control Unit device they use for ther in-flight
  entertainment system."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (44 commits)
  Input: trackpoint - Optimize trackpoint init to use power-on reset
  Input: apbps2 - convert to devm_ioremap_resource()
  Input: ALPS - use %ph to print buffers
  ARM - shmobile: Armadillo800EVA: Move st1232 reset pin handling
  Input: st1232 - add reset pin handling
  Input: st1232 - convert to devm_* infrastructure
  Input: MT - handle semi-mt devices in core
  Input: adxl34x - use spi_get_drvdata()
  Input: ad7877 - use spi_get_drvdata() and spi_set_drvdata()
  Input: ads7846 - use spi_get_drvdata() and spi_set_drvdata()
  Input: ims-pcu - fix a memory leak on error
  Input: sysrq - supplement reset sequence with timeout functionality
  Input: tegra-kbc - support for defining row/columns based on SoC
  Input: imx_keypad - switch to using managed resources
  Input: arc_ps2 - add support for device tree
  Input: mma8450 - fix signed 12bits to 32bits conversion
  Input: eeti_ts - remove redundant null check
  Input: edt-ft5x06 - remove redundant null check before kfree
  Input: ad714x - add CONFIG_PM_SLEEP to suspend/resume functions
  Input: adxl34x - add CONFIG_PM_SLEEP to suspend/resume functions
  ...

af_unix: fix a fatal race with bit fields

Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
uses 64bit instructions to manipulate them.
If the 64bit word includes any atomic_t or spinlock_t, we can lose
critical concurrent changes.

This is happening in af_unix, where unix_sk(sk)->gc_candidate/
gc_maybe_cycle/lock share the same 64bit word.

This leads to fatal deadlock, as one/several cpus spin forever
on a spinlock that will never be available again.

A safer way would be to use a long to store flags.
This way we are sure compiler/arch wont do bad things.

As we own unix_gc_lock spinlock when clearing or setting bits,
we can use the non atomic __set_bit()/__clear_bit().

recursion_level can share the same 64bit location with the spinlock,
as it is set only with this spinlock held.

[1] bug fixed in gcc-4.8.0 :
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Reported-by: Ambrose Feinstein <ambrose@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnx2x'

Yuval Mintz says:

====================
This fixes 2 small bugs - one which may cause an unnecessary link flap,
and the other is a small memory leak when unloading while cnic is not
loaded.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Prevent memory leak when cnic is absent

bnx2x driver allocates searcher T2 tables, but it releases that memory
during unload only released if the cnic is loaded.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: correct reading of speed capabilities

When the bnx2x driver reads the port configuration - mask irrelevant bits.

Without this change, the unintended bits may cause the driver to needlessly
toggle the link, as a comparison in the link flap avoidance flow will show
that the old link did not advertise the same capabilities and thus cannot
be retained.

Signed-off-by: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sctp: attribute printl with __printf for gcc fmt checks

Let GCC check for format string errors in sctp's probe printl
function. This patch fixes the warning when compiled with W=1:

net/sctp/probe.c:73:2: warning: function might be possible candidate
for 'gnu_printf' format attribute [-Wmissing-format-attribute]

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: kconfig: move mmap i/o into netlink kconfig

Currently, in menuconfig, Netlink's new mmaped IO is the very first
entry under the ``Networking support'' item and comes even before
``Networking options'':

  [ ]   Netlink: mmaped IO
  Networking options  --->
  ...

Lets move this into ``Networking options'' under netlink's Kconfig,
since this might be more appropriate. Introduced by commit ccdfcc398
(``netlink: mmaped netlink: ring setup'').

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netpoll: convert mutex into a semaphore

Bart Van Assche recently reported a warning to me:

<IRQ>  [<ffffffff8103d79f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8103d7fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff814761dd>] mutex_trylock+0x16d/0x180
[<ffffffff813968c9>] netpoll_poll_dev+0x49/0xc30
[<ffffffff8136a2d2>] ? __alloc_skb+0x82/0x2a0
[<ffffffff81397715>] netpoll_send_skb_on_dev+0x265/0x410
[<ffffffff81397c5a>] netpoll_send_udp+0x28a/0x3a0
[<ffffffffa0541843>] ? write_msg+0x53/0x110 [netconsole]
[<ffffffffa05418bf>] write_msg+0xcf/0x110 [netconsole]
[<ffffffff8103eba1>] call_console_drivers.constprop.17+0xa1/0x1c0
[<ffffffff8103fb76>] console_unlock+0x2d6/0x450
[<ffffffff8104011e>] vprintk_emit+0x1ee/0x510
[<ffffffff8146f9f6>] printk+0x4d/0x4f
[<ffffffffa0004f1d>] scsi_print_command+0x7d/0xe0 [scsi_mod]

This resulted from my commit ca99ca14c which introduced a mutex_trylock
operation in a path that could execute in interrupt context.  When mutex
debugging is enabled, the above warns the user when we are in fact
exectuting in interrupt context
interrupt context.

After some discussion, It seems that a semaphore is the proper mechanism to use
here.  While mutexes are defined to be unusable in interrupt context, no such
condition exists for semaphores (save for the fact that the non blocking api
calls, like up and down_trylock must be used when in irq context).

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
CC: Bart Van Assche <bvanassche@acm.org>
CC: David Miller <davem@davemloft.net>
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: Fix skb ref counting.

Commit f9c2288837ba072b21dba955f04a4c97eaa77b1e (netlink:
implement memory mapped recvmsg) increamented skb->users
ref count twice for a dump op which does not look right.

Following patch fixes that.

CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/cmetcalf/linux-tile

Pull tile arch changes from Chris Metcalf:
"These are some minor new feature work and other changes that didn't
  merit getting pushed up after the 3.9 merge window closed.

  There should be a lot more activity in the 3.11 merge window"

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  arch/tile: Fix syscall return value passed to tracepoint
  tile: comment assumption about __insn_mtspr for <asm/irqflags.h>
  tile: ns2cycles should use __raw_get_cpu_var
  arch: remove KCORE_ELF again [tile]
  tile: remove two outdated Kconfig entries
  tile: support atomic64_dec_if_positive()
  tile: support TIF_SYSCALL_TRACEPOINT; select HAVE_SYSCALL_TRACEPOINTS
  tile: Add definition of NR_syscalls
  tile: move declaration of sys_call_table to <asm/syscall.h>
  arch/tile: Enable HAVE_ARCH_TRACEHOOK
  arch/tile: Call tracehook_report_syscall_{entry,exit} in syscall trace

init: Do not warn on non-zero initcall return

Commit f91eb62f71b3 ("init: scream bloody murder if interrupts are
enabled too early") added three new warnings.  The first two seemed
reasonable, but the third included a warning when an initcall returned
non-zero.  Although, the third WARN() does include an imbalanced preempt
disabled, or irqs disable, it shouldn't warn if it only had an initcall
that just returns non-zero.

In fact, according to Linus, it shouldn't print at all.  As it only
prints with initcall_debug set, and that already shows enough
information to fix things.

Link: http://lkml.kernel.org/r/CA+55aFzaBC5SFi7=F2mfm+KWY5qTsBmOqgbbs8E+LUS8JK-sBg@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

net_sched: act_ipt forward compat with xtables

Deal with changes in newer xtables while maintaining backward
compatibility. Thanks to Jan Engelhardt for suggestions.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'topic/omap3isp' of git://git./linux/kernel/git/mchehab/linux-media

Pull omap3isp clk support from Mauro Carvalho Chehab:
"This patch were sent in separate as it depends on a merge from clock
framework, that you merged in commit 362ed48dee50"

* 'topic/omap3isp' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] omap3isp: Use the common clock framework

Merge branch 'next' into for-linus

Prepare first set of updates for 3.10 merge window.

Merge branch 'ipc-scalability'

Merge IPC cleanup and scalability patches from Andrew Morton.

This cleans up many of the oddities in the IPC code, uses the list
iterator helpers, splits out locking and adds per-semaphore locks for
greater scalability of the IPC semaphore code.

Most normal user-level locking by now uses futexes (ie pthreads, but
also a lot of specialized locks), but SysV IPC semaphores are apparently
still used in some big applications, either for portability reasons, or
because they offer tracking and undo (and you don't need to have a
special shared memory area for them).

Our IPC semaphore scalability was pitiful.  We used to lock much too big
ranges, and we used to have a single ipc lock per ipc semaphore array.
Most loads never cared, but some do.  There are some numbers in the
individual commits.

* ipc-scalability:
  ipc: sysv shared memory limited to 8TiB
  ipc/msg.c: use list_for_each_entry_[safe] for list traversing
  ipc,sem: fine grained locking for semtimedop
  ipc,sem: have only one list in struct sem_queue
  ipc,sem: open code and rename sem_lock
  ipc,sem: do not hold ipc lock more than necessary
  ipc: introduce lockless pre_down ipcctl
  ipc: introduce obtaining a lockless ipc object
  ipc: remove bogus lock comment for ipc_checkid
  ipc/msgutil.c: use linux/uaccess.h
  ipc: refactor msg list search into separate function
  ipc: simplify msg list search
  ipc: implement MSG_COPY as a new receive mode
  ipc: remove msg handling from queue scan
  ipc: set EFAULT as default error in load_msg()
  ipc: tighten msg copy loops
  ipc: separate msg allocation from userspace copy
  ipc: clamp with min()

ipc: sysv shared memory limited to 8TiB

Trying to run an application which was trying to put data into half of
memory using shmget(), we found that having a shmall value below 8EiB-8TiB
would prevent us from using anything more than 8TiB.  By setting
kernel.shmall greater than 8EiB-8TiB would make the job work.

In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.

ipc/shm.c:
458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
459 {
...
465         int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
...
474         if (ns->shm_tot + numpages > ns->shm_ctlall)
475                 return -ENOSPC;

[akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
Signed-off-by: Robin Holt <holt@sgi.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc/msg.c: use list_for_each_entry_[safe] for list traversing

The ipc/msg.c code does its list operations by hand and it open-codes the
accesses, instead of using for_each_entry_[safe].

Signed-off-by: Nikola Pajkovsky <npajkovs@redhat.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc,sem: fine grained locking for semtimedop

Introduce finer grained locking for semtimedop, to handle the common case
of a program wanting to manipulate one semaphore from an array with
multiple semaphores.

If the call is a semop manipulating just one semaphore in an array with
multiple semaphores, only take the lock for that semaphore itself.

If the call needs to manipulate multiple semaphores, or another caller is
in a transaction that manipulates multiple semaphores, the sem_array lock
is taken, as well as all the locks for the individual semaphores.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

[davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
[davidlohr.bueso@hp.com: make refcounter atomic]
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Emmanuel Benisty <benisty.e@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc,sem: have only one list in struct sem_queue

Having only one list in struct sem_queue, and only queueing simple
semaphore operations on the list for the semaphore involved, allows us to
introduce finer grained locking for semtimedop.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc,sem: open code and rename sem_lock

Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
later that only locks the sem_array and does nothing else.

Open code the locking from ipc_lock() in sem_obtain_lock() so we can
introduce finer grained locking for the sem_array in the next patch.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc,sem: do not hold ipc lock more than necessary

Instead of holding the ipc lock for permissions and security checks, among
others, only acquire it when necessary.

Some numbers....

1) With Rik's semop-multi.c microbenchmark we can see the following
   results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+  59.40%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+   6.14%            a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   3.84%            a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   3.64%            a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   2.06%            a.out  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
+   1.86%            a.out  [kernel.kallsyms]  [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+  18.54%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+  11.72%            a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   7.70%            a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
+   6.58%            a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   6.54%            a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   4.71%            a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check

2) While on an Oracle swingbench DSS (data mining) workload the
   improvements are not as exciting as with Rik's benchmark, we can see
   some positive numbers.  For an 8 socket machine the following are the
   percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

[riel@redhat.com: fix two locking bugs]
[sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc: introduce lockless pre_down ipcctl

Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and
check permissions, mostly for IPC_RMID and IPC_SET commands.

Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
The locking version is retained, yet modified to call the nolock version
without affecting its semantics, thus transparent to all ipc callers.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc: introduce obtaining a lockless ipc object

Through ipc_lock() and therefore ipc_lock_check() we currently return the
locked ipc object. This is not necessary for all situations and can,
therefore, cause unnecessary ipc lock contention.

Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.

Both these functions must be called within the RCU read critical section.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc: remove bogus lock comment for ipc_checkid

This series makes the sysv semaphore code more scalable, by reducing the
time the semaphore lock is held, and making the locking more scalable for
semaphore arrays with multiple semaphores.

The first four patches were written by Davidlohr Buesso, and reduce the
hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking to be more
fine grained, providing a performance boost when multiple semaphores in a
semaphore array are being manipulated simultaneously.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's + Davidlohr's +
threads patches rwlock patches v3 patches
10 610652 726325 1783589 2142206
20 341570 365699 1520453 1977878
30 288102 307037 1498167 2037995
40 290714 305955 1612665 2256484
50 288620 312890 1733453 2650292
60 289987 306043 1649360 2388008
70 291298 306347 1723167 2717486
80 290948 305662 1729545 2763582
90 290996 306680 1736021 2757524
100 292243 306700 1773700 3059159

This patch:

There is no reason to be holding the ipc lock while reading ipcp->seq,
hence remove misleading comment.

Also simplify the return value for the function.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc/msgutil.c: use linux/uaccess.h

Signed-off-by: HoSung Jung <rain6557@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc: refactor msg list search into separate function

[fengguang.wu@intel.com: find_msg can be static]
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc: simplify msg list search

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>