cascardo/linux.git
8 years agoi40iw: pass hw_stats by reference rather than by value
Colin Ian King [Mon, 28 Mar 2016 11:14:46 +0000 (12:14 +0100)]
i40iw: pass hw_stats by reference rather than by value

passing hw_stats by value requires a 280 byte copy so instead
pass it by reference is much more efficient.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Chien Tin Tung <chien.tin.tung@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoi40iw: Remove unnecessary synchronize_irq() before free_irq()
Lars-Peter Clausen [Mon, 28 Mar 2016 09:31:26 +0000 (11:31 +0200)]
i40iw: Remove unnecessary synchronize_irq() before free_irq()

Calling synchronize_irq() right before free_irq() is quite useless. On one
hand the IRQ can easily fire again before free_irq() is entered, on the
other hand free_irq() itself calls synchronize_irq() internally (in a race
condition free way), before any state associated with the IRQ is freed.

Patch was generated using the following semantic patch:
// <smpl>
@@
expression irq;
@@
-synchronize_irq(irq);
 free_irq(irq, ...);
// </smpl>

Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Faisal Latif <faisal.latif#intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoi40iw: constify i40iw_vf_cqp_ops structure
Julia Lawall [Sun, 1 May 2016 12:22:21 +0000 (14:22 +0200)]
i40iw: constify i40iw_vf_cqp_ops structure

The i40iw_vf_cqp_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoi40e: constify i40e_client_ops structure
Julia Lawall [Sun, 1 May 2016 12:07:23 +0000 (14:07 +0200)]
i40e: constify i40e_client_ops structure

The i40e_client_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Do not register memory if never_register has been set
Bart Van Assche [Thu, 12 May 2016 17:51:01 +0000 (10:51 -0700)]
IB/srp: Do not register memory if never_register has been set

This makes it easier to test the code path that does not use
memory registration (srp_map_sg_dma()).

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Prevent mapping failures
Bart Van Assche [Thu, 12 May 2016 17:50:35 +0000 (10:50 -0700)]
IB/srp: Prevent mapping failures

If both max_sectors and the queue_depth are high enough it can
happen that the MR pool is depleted temporarily. This causes
the SRP initiator to report mapping failures. Although the SRP
initiator recovers from such mapping failures, prevent that
this can happen by allocating more memory regions.

Additionally, only enable memory registration if at least two
pages can be registered per memory region.

Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Swap two code blocks in srp_add_one()
Bart Van Assche [Thu, 12 May 2016 17:49:39 +0000 (10:49 -0700)]
IB/srp: Swap two code blocks in srp_add_one()

This patch does not change any functionality but makes the next
patch in this series easier to read.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: Enhance ib_map_mr_sg()
Bart Van Assche [Thu, 12 May 2016 17:49:15 +0000 (10:49 -0700)]
IB/core: Enhance ib_map_mr_sg()

The SRP initiator allows to set max_sectors to a value that exceeds
the largest amount of data that can be mapped at once with an mlx4
HCA using fast registration and a page size of 4 KB. Hence modify
ib_map_mr_sg() such that it can map partial sg-elements. If an
sg-element has been mapped partially, let the caller know
which fraction has been mapped by adjusting *sg_offset.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Fix srp_create_target() error handling
Bart Van Assche [Thu, 12 May 2016 17:48:48 +0000 (10:48 -0700)]
IB/srp: Fix srp_create_target() error handling

Avoid that the following kernel oops occurs if memory pool
allocation fails:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa048d0a0>] ib_drain_rq+0x0/0x20 [ib_core]
Call Trace:
 [<ffffffffa04af386>] srp_create_target+0xca6/0x13a9 [ib_srp]
 [<ffffffff813cc863>] dev_attr_store+0x13/0x20
 [<ffffffff81214b50>] sysfs_kf_write+0x40/0x50
 [<ffffffff81213f1c>] kernfs_fop_write+0x13c/0x180
 [<ffffffff81197683>] __vfs_write+0x23/0xf0
 [<ffffffff81198744>] vfs_write+0xa4/0x1a0
 [<ffffffff81199a44>] SyS_write+0x44/0xa0
 [<ffffffff8159e3e9>] entry_SYSCALL_64_fastpath+0x1c/0xac

Fixes: 1dc7b1f10dcb ("IB/srp: use the new CQ API")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org> # v4.5+
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Fix a memory descriptor leak in an error path
Bart Van Assche [Thu, 12 May 2016 17:48:13 +0000 (10:48 -0700)]
IB/srp: Fix a memory descriptor leak in an error path

If an error occurs after srp_fr_pool_get() succeeded and before the
descriptor is stored in srp_map_state (*state->fr.next++ = desc)
then srp_unmap_data() won't free the newly allocated memory
descriptor. Hence free the descriptor explicitly.

Fixes: f7f7aab1a5c0 ("IB/srp: Convert to new registration API")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Sagi Grimberg <sai@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Print "ib_srp: " prefix once
Bart Van Assche [Thu, 12 May 2016 17:47:38 +0000 (10:47 -0700)]
IB/srp: Print "ib_srp: " prefix once

pr_debug() already prints prefix PFX. Avoid that PFX is printed
twice if the debug statement in srp_add_target() is enabled.

Fixes: 34aa654ecb8e ("IB/srp: Avoid that I/O hangs due to a cable pull during LUN scanning")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/isert: convert to the generic RDMA READ/WRITE API
Christoph Hellwig [Tue, 3 May 2016 16:01:13 +0000 (18:01 +0200)]
IB/isert: convert to the generic RDMA READ/WRITE API

Replace the homegrown RDMA READ/WRITE code in isert with the generic API,
which also adds iWarp support to the I/O path as a side effect.  Note
that full iWarp operation will need a few additional patches from Steve.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: add RW API support for signature MRs
Christoph Hellwig [Tue, 3 May 2016 16:01:12 +0000 (18:01 +0200)]
IB/core: add RW API support for signature MRs

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srpt: convert to the generic RDMA READ/WRITE API
Christoph Hellwig [Tue, 3 May 2016 16:01:11 +0000 (18:01 +0200)]
IB/srpt: convert to the generic RDMA READ/WRITE API

Replace the homegrown RDMA READ/WRITE code in srpt with the generic API.
The only real twist here is that we need to allocate one Linux scatterlist
per direct buffer in the SRP command, and chain them before handing them
off to the target core.

As a side-effect of the conversion the driver will also chain the SEND
of the SRP response to the RDMA WRITE WRs for a DATA OUT command, and
properly account for RDMA WRITE WRs instead of just for RDMA READ WRs
like the driver previously did.

We now allocate half of the SQ size to RDMA READ/WRITE contexts, assuming
by default one RDMA READ or WRITE operation per command.  If a command
has multiple operations it will eat into the budget but will still succeed,
possible after waiting for WQEs to be available.

Also ensure the QPs request the maximum allowed SGEs so that RDMA R/W API
works correctly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agotarget: enhance and export target_alloc_sgl/target_free_sgl
Christoph Hellwig [Tue, 3 May 2016 16:01:10 +0000 (18:01 +0200)]
target: enhance and export target_alloc_sgl/target_free_sgl

The SRP target driver will need to allocate and chain it's own SGLs soon.
For this export target_alloc_sgl, and add a new argument to it so that it
can allocate an additional chain entry that doesn't point to a page.  Also
export transport_free_sgl after renaming it to target_free_sgl to free
these SGLs again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: generic RDMA READ/WRITE API
Christoph Hellwig [Tue, 3 May 2016 16:01:09 +0000 (18:01 +0200)]
IB/core: generic RDMA READ/WRITE API

This supports both manual mapping of lots of SGEs, as well as using MRs
from the QP's MR pool, for iWarp or other cases where it's more optimal.
For now, MRs are only used for iWARP transports.  The user of the RDMA-RW
API must allocate the QP MR pool as well as size the SQ accordingly.

Thanks to Steve Wise for testing, fixing and rewriting the iWarp support,
and to Sagi Grimberg for ideas, reviews and fixes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: add a need_inval flag to struct ib_mr
Steve Wise [Tue, 3 May 2016 16:01:08 +0000 (18:01 +0200)]
IB/core: add a need_inval flag to struct ib_mr

This is the first step toward moving MR invalidation decisions
to the core.  It will be needed by the upcoming RW API.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: add a simple MR pool
Christoph Hellwig [Tue, 3 May 2016 16:01:07 +0000 (18:01 +0200)]
IB/core: add a simple MR pool

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: refactor ib_create_qp
Christoph Hellwig [Tue, 3 May 2016 16:01:06 +0000 (18:01 +0200)]
IB/core: refactor ib_create_qp

Split the XRC magic into a separate function, and return early on failure
to make the initialization code readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: add a helper to check for READ WITH INVALIDATE support
Christoph Hellwig [Tue, 3 May 2016 16:01:05 +0000 (18:01 +0200)]
IB/core: add a helper to check for READ WITH INVALIDATE support

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/core: Add passing an offset into the SG to ib_map_mr_sg
Christoph Hellwig [Tue, 3 May 2016 16:01:04 +0000 (18:01 +0200)]
IB/core: Add passing an offset into the SG to ib_map_mr_sg

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/cma: pass the port number to ib_create_qp
Christoph Hellwig [Tue, 3 May 2016 16:01:03 +0000 (18:01 +0200)]
IB/cma: pass the port number to ib_create_qp

The new RW API will need this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoMerge branches 'mlx5-1' and 'srp-1' into k.o/for-4.7
Doug Ledford [Thu, 12 May 2016 18:21:02 +0000 (14:21 -0400)]
Merge branches 'mlx5-1' and 'srp-1' into k.o/for-4.7

8 years agonet/mlx5: Update mlx5_ifc hardware features
Saeed Mahameed [Wed, 13 Apr 2016 16:11:04 +0000 (19:11 +0300)]
net/mlx5: Update mlx5_ifc hardware features

Adding the needed mlx5_ifc hardware bits and structs
for the following features:

* Add vport to steering commands for SRIOV ACL support
* Add mlcr, pcmr and mcia registers for dump module EEPROM
* Add support for FCS, beacon led and disable_link bits to
  hca caps
* Add CQE period mode bit in CQ context for CQE based CQ
  moderation support
* Add umr SQ bit for fragmented memory registration
* Add needed bits and caps for Striding RQ support

In-order to avoid possible future conflicts between rdma and
net-next we added all expected updates to this file for this release.
If more changes will be submitted, we plan to do it only through
one of the subsystems, probably net-next.

All updated bits in this patch will be later used in
the up-coming submissions to net-next and rdma trees.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agonet/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets
Tariq Toukan [Wed, 13 Apr 2016 16:11:03 +0000 (19:11 +0300)]
net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets

All reserved fields after early_vf_enable are off by 1, since
early_vf_enable was not explicitly declared as array of size 1.

Reserved field before cqe_zip had a wrong size, it should
be 0x80 + 0x3f.

Fixes: b0844444590e ("net/mlx5_core: Introduce access function to read internal timer ")
Fixes: b4ff3a36d3e4 ("net/mlx5: Use offset based reserved field names in the IFC header file")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Move common code into the caller
Bart Van Assche [Fri, 22 Apr 2016 21:15:04 +0000 (14:15 -0700)]
IB/srp: Move common code into the caller

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Sagi Grimberg <sai@grimberg.m>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Move code out of a loop
Bart Van Assche [Fri, 22 Apr 2016 21:14:43 +0000 (14:14 -0700)]
IB/srp: Move code out of a loop

Since all srp_map_finish_fr() callers pass a non-zero value as
the fourth argument (sg_nents), the sg_nents == 0 check in that
function can be removed. Add a count == 0 check in the caller
of that function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Avoid that mapping failure triggers an infinite loop
Bart Van Assche [Fri, 22 Apr 2016 21:14:15 +0000 (14:14 -0700)]
IB/srp: Avoid that mapping failure triggers an infinite loop

The srp_queuecommand() function translates ENOMEM into QUEUE_FULL
which causes the SCSI mid-layer to retry the command. All other
error codes are translated into DID_ERROR which causes the SCSI
command to fail. Return E2BIG if mapping will always fail to
prevent that the SCSI mid-layer keeps resubmitting a command
forever.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Introduce target->mr_pool_size
Bart Van Assche [Fri, 22 Apr 2016 21:13:57 +0000 (14:13 -0700)]
IB/srp: Introduce target->mr_pool_size

This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Fix srp_map_data() error paths
Bart Van Assche [Fri, 22 Apr 2016 21:13:35 +0000 (14:13 -0700)]
IB/srp: Fix srp_map_data() error paths

Ensure that req->nmdesc is set correctly in srp_map_sg() if mapping
fails. Avoid that mapping failure causes a memory descriptor leak.
Report srp_map_sg() failure to the caller.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Document srp_map_data() return value
Bart Van Assche [Fri, 22 Apr 2016 21:13:09 +0000 (14:13 -0700)]
IB/srp: Document srp_map_data() return value

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Fix a comment
Bart Van Assche [Fri, 22 Apr 2016 21:12:47 +0000 (14:12 -0700)]
IB/srp: Fix a comment

The free request list was removed through patch "IB/srp: Use block layer tags".
Hence update a comment that refers to that free request list.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/srp: Fix a spelling error in a source code comment
Bart Van Assche [Fri, 22 Apr 2016 21:12:10 +0000 (14:12 -0700)]
IB/srp: Fix a spelling error in a source code comment

Change one occurrence of "boundries" into "boundaries".

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoMerge branches 'hfi1' and 'iw_cxgb4' into k.o/for-4.7
Doug Ledford [Thu, 5 May 2016 20:42:09 +0000 (16:42 -0400)]
Merge branches 'hfi1' and 'iw_cxgb4' into k.o/for-4.7

8 years agoRDMA/iw_cxgb4: remove abort_connection() usage from ep_timeout()
Hariprasad S [Wed, 4 May 2016 19:57:37 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: remove abort_connection() usage from ep_timeout()

Use c4iw_ep_disconnect() instead.  This is part of getting rid of
abort_connection() altogether so we properly clean up on send_abort()
failures.

This is the last user of abort_connection(), so remove it too.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: move QP -> ERROR on fatal disconnect errors
Hariprasad S [Wed, 4 May 2016 19:57:36 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: move QP -> ERROR on fatal disconnect errors

In c4iw_ep_disconnect(), if we fail to initiate a close operation, then
move the qp to ERROR to disassociate the ep from the qp.  Failure to do
this will leak the ep resources.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: don't use abort_connection in process_mpa_request()
Hariprasad S [Wed, 4 May 2016 19:57:35 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: don't use abort_connection in process_mpa_request()

Instead return whether the caller needs to disconnect. This is part of
getting rid of abort_connection() altogether so we properly clean up on
send_abort() failures.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: remove abort_connection() usage from accept/reject
Hariprasad S [Wed, 4 May 2016 19:57:34 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: remove abort_connection() usage from accept/reject

Use c4iw_ep_disconnect() instead. This is part of getting rid of
abort_connection() altogether so we properly clean up on send_abort()
failures.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: free resources when send_flowc() fails
Hariprasad S [Wed, 4 May 2016 19:57:33 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: free resources when send_flowc() fails

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: remove connection abort from process_mpa_reply
Hariprasad S [Wed, 4 May 2016 19:57:32 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: remove connection abort from process_mpa_reply

Instead, have the caller, rx_data() handle the close/abort like
it does for process_mpa_request(). This is part of getting rid of
abort_connection() altogether so we properly clean up on send_abort()
failures.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: ensure eps don't get freed while the mutex is held
Hariprasad S [Wed, 4 May 2016 19:57:31 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: ensure eps don't get freed while the mutex is held

In rx_data(), with the ep in FPDU_MODE, refcnt=2, if we get unexpected
streaming data, we call c4iw_modify_rc_qp() and move the qp from
RTS -> TERMINATE.  In c4iw_modify_rc_qp(), if rdma_fini() returns
an error, the ep will be dereferenced (refcnt=1).  Then rx_data()
calls c4iw_ep_disconnect() which starts the close operation.
But if send_halfclose() fails in c4iw_ep_disconnect(), we  will call
release_ep_resources() derefing the ep which reduces the refcnt to 0 and
and frees the ep. However we still has the ep mutex at that point, so we
have a touch-after-free bug.  There is a similar issue where
peer_close() calls c4iw_ep_disconnect().

The solution is to add a reference to the ep in c4iw_ep_disconnect()
after acquiring  the mutex, and release it after releasing the mutex.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: stop ep timer on close failure
Hariprasad S [Wed, 4 May 2016 19:57:30 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: stop ep timer on close failure

In c4iw_ep_disconnect(), if we start the ep timer to begin a close,
but send_halfclose() fails, we need to stop the timer and send a CLOSE
event up to the IWCM before releasing the resources. Otherwise, we can
crash when the ep timer fires if the ep is referencing a previous instance
of the device. This can happen as part of adapter reset/recovery, for
instance.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/iw_cxgb4: release ep resources on accept arp failure
Hariprasad S [Wed, 4 May 2016 19:57:29 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: release ep resources on accept arp failure

If ARP fails before the CPL_PASS_ACCEPT_RPL is seen by hardware, the tid
will be stuck in SYN_PEND and never released.  So create an arp failure
handler specifically for this message to release the endpoint resources.

In pass_accept_rpl_arp_failure(), put the parent endpoint so it will
be freed when destroyed.  Also we don't need to call release_tid() here
because _c4iw_free_ep() calls cxgb4_remove_tid() which releases the
hwtid.

If we get an ABORT_REQ_RSS instead of a PASS_ESTABLISH (because the
peer's ACK to our SYN is never received), then put the parent as well
in peer_abort().

Treat accept_cr() failures just like arp failures: put the parent ep
and release the ep resources destroying the tid

The ARP failure handlers are called in an atomic context, so we need to
schedule some of the processing which might block.  Namely _c4iw_free_ep()
which needs a mutex.  So create a "special" CPL opcode and handler and
schedule it via sched() to be run by process_work() in a blockable context.

Also rework the active open arp failure handler to make use of
release_ep_resources().  This allows both the active and passive arp
failure handlers to use the same deferred cleanup function.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/iser: Fix max_sectors calculation
Christoph Hellwig [Mon, 18 Apr 2016 21:06:28 +0000 (17:06 -0400)]
IB/iser: Fix max_sectors calculation

iSER currently has a couple places that set max_sectors in either the host
template or SCSI host, and all of them get it wrong.

This patch instead uses a single assignment that (hopefully) gets it right:
the max_sectors value must be derived from the number of segments in the
FR or FMR structure, but actually be one lower than the page size multiplied
by the number of sectors, as it has to handle the case of non-aligned I/O.

Without this I get trivial to reproduce hangs when running xfstests
(on XFS) over iSER to Linux targets.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/nes: don't leak skb if carrier down
Florian Westphal [Sun, 24 Apr 2016 20:18:59 +0000 (22:18 +0200)]
RDMA/nes: don't leak skb if carrier down

Alternatively one could free the skb, OTOH I don't think this test is
useful so just remove it.

Cc: <linux-rdma@vger.kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix for removing quad hash entries
Tatyana Nikolova [Fri, 22 Apr 2016 19:14:29 +0000 (14:14 -0500)]
RDMA/i40iw: Fix for removing quad hash entries

Fix for removing a quad hash entry when the
corresponding quad hash entry hasn't been added,
which is the case in loopback connections

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix for checking if the QP is destroyed
Tatyana Nikolova [Fri, 22 Apr 2016 19:14:28 +0000 (14:14 -0500)]
RDMA/i40iw: Fix for checking if the QP is destroyed

Fix for checking if the QP associated with a completion
has been destroyed while processing CQ elements.
If that is the case, move the CQ head to the next element
and continue completion processing.

Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix for using one sge for RDMA READ
Shiraz Saleem [Fri, 22 Apr 2016 19:14:27 +0000 (14:14 -0500)]
RDMA/i40iw: Fix for using one sge for RDMA READ

A check is added to validate the requested sge number.
iWARP doesn't support multiple sg elements for
RDMA READ work requests.

Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix for the size of kernel mode SQ
Shiraz Saleem [Fri, 22 Apr 2016 19:14:26 +0000 (14:14 -0500)]
RDMA/i40iw: Fix for the size of kernel mode SQ

Fix to calculate the SQ size based on the max
frag_count, requested by the application instead
of overwriting it with the max supported frag_count

Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix for a NOP WQE size
Mohammad Khan [Fri, 22 Apr 2016 19:14:25 +0000 (14:14 -0500)]
RDMA/i40iw: Fix for a NOP WQE size

Fix for filling in the WQE size for NOP

Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Correct STag mask to min of 14 bits
Chien Tin Tung [Fri, 22 Apr 2016 19:14:24 +0000 (14:14 -0500)]
RDMA/i40iw: Correct STag mask to min of 14 bits

STag index mask is calculated incorrectly, missing
the 14 bits minimum requirement. Add max macro to use
either # of MRs or 14 bits in the mask size calculation.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fixes for WQE alignment
Shiraz Saleem [Fri, 22 Apr 2016 19:14:23 +0000 (14:14 -0500)]
RDMA/i40iw: Fixes for WQE alignment

Invalidation after every WQE write is changed to invalidate
only if required. NOPs are padded so that WQE writes are
aligned to 64B boundary.

Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Adding queue drain functions
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:09 +0000 (10:33 -0500)]
RDMA/i40iw: Adding queue drain functions

Adding sq and rq drain functions, which block until all
previously posted wr-s in the specified queue have completed.
A completion object is signaled to unblock the thread,
when the last cqe for the corresponding queue is processed.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix SD calculation for initial HMC creation
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:08 +0000 (10:33 -0500)]
RDMA/i40iw: Fix SD calculation for initial HMC creation

Correct SD calculation by using base address returned from commit FPM.
This alleviates any assumptions on resource ordering and alignment
requirement. Also consolidate SD estimation code into i40iw_est_sd().

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix endian issues and warnings
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:07 +0000 (10:33 -0500)]
RDMA/i40iw: Fix endian issues and warnings

Fix endian warnings and errors due to u32 stored to u16.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Add base memory management extensions
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:06 +0000 (10:33 -0500)]
RDMA/i40iw: Add base memory management extensions

Implement fast register mr, Local invalidate, send with
invalidate and RDMA read with invalidate.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Initialize max enabled vfs variable
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:05 +0000 (10:33 -0500)]
RDMA/i40iw: Initialize max enabled vfs variable

Initialize max enabled vfs to max rdma vfs instead of 0.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Correct return code check in add_pble_pool
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:04 +0000 (10:33 -0500)]
RDMA/i40iw: Correct return code check in add_pble_pool

Move return code check to immediately after i40iw_hmc_sd_one call
where it is set instead of outside the then statement.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Add virtual channel message queue
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:03 +0000 (10:33 -0500)]
RDMA/i40iw: Add virtual channel message queue

Queue users of virtual channel on a waitqueue until the channel is
clear instead of failing the call when the channel is occupied.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Remove unused code and fix warning
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:02 +0000 (10:33 -0500)]
RDMA/i40iw: Remove unused code and fix warning

Remove unused code and fix warning.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Populate vendor_id and vendor_part_id fields
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:01 +0000 (10:33 -0500)]
RDMA/i40iw: Populate vendor_id and vendor_part_id fields

Populate PCI info fields from PCI device structure.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Set vendor_err only if there is an actual error
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:00 +0000 (10:33 -0500)]
RDMA/i40iw: Set vendor_err only if there is an actual error

Add a check for cq_poll_info.error before setting vendor_err
instead of always setting it.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Add qp table lock around AE processing
Ismail, Mustafa [Mon, 18 Apr 2016 15:32:59 +0000 (10:32 -0500)]
RDMA/i40iw: Add qp table lock around AE processing

QP may be freed during Async Event processing.
Add a lock around QP table to prevent it.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Do not set self-referencing pointer to NULL after free
Ismail, Mustafa [Mon, 18 Apr 2016 15:32:58 +0000 (10:32 -0500)]
RDMA/i40iw: Do not set self-referencing pointer to NULL after free

iwqp->allocated_buffer is a self-referencing pointer to iwqp.
Do not set iwqp->allocated_buffer to NULL after freeing it.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Correct max message size in query port
Ismail, Mustafa [Mon, 18 Apr 2016 15:32:57 +0000 (10:32 -0500)]
RDMA/i40iw: Correct max message size in query port

Fix to correct max reported message size in query port.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix refused connections
Ismail, Mustafa [Mon, 18 Apr 2016 15:32:56 +0000 (10:32 -0500)]
RDMA/i40iw: Fix refused connections

Make sure cm_node is setup before sending SYN packet and
ORD/IRD negotiation.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Correct QP size calculation
Ismail, Mustafa [Mon, 18 Apr 2016 15:32:55 +0000 (10:32 -0500)]
RDMA/i40iw: Correct QP size calculation

Include inline data size as part of SQ size calculation.
RQ size calculation uses only number of SGEs and does not
support 96 byte WQE size.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoRDMA/i40iw: Fix overflow of region length
Ismail, Mustafa [Mon, 18 Apr 2016 15:32:54 +0000 (10:32 -0500)]
RDMA/i40iw: Fix overflow of region length

Change region_length to u64 as a region can be > 4GB.

Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Serialize hrtimer function calls
Jubin John [Thu, 14 Apr 2016 15:31:53 +0000 (08:31 -0700)]
IB/hfi1: Serialize hrtimer function calls

hrtimer functions do not guarantee serialization, so we extend the
cca_timer_lock to cover the hrtimer_forward_now() in the hrtimer
callback handler and the hrtimer_start() in process_becn(). This
prevents races between these 2 functions to update the hrtimer state
leading to problems such as:
kernel BUG at kernel/hrtimer.c:1282!
encountered during validation of the CCA feature.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix MAD port poll for active cables
Dean Luick [Thu, 14 Apr 2016 15:31:48 +0000 (08:31 -0700)]
IB/hfi1: Fix MAD port poll for active cables

A MAD directive to start polling must go through the normal
link tuning and start steps in order to correctly handle
active cables.

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Correctly report neighbor link down reason
Dean Luick [Thu, 14 Apr 2016 15:31:42 +0000 (08:31 -0700)]
IB/hfi1: Correctly report neighbor link down reason

The code to save the link down reason for reporting to the SMA
was in a location before the actual reason was read.  Move the
SMA link down reason assignment to a better location.

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Use the neighbor link down reason only when valid
Dean Luick [Thu, 14 Apr 2016 15:31:36 +0000 (08:31 -0700)]
IB/hfi1: Use the neighbor link down reason only when valid

The 8051 uses a link down reason to inform the driver why the
link went down.  The neighbor planned link down reason code is
only valid when a link down idle message is received by the 8051.
Enhance the explanation on why the link went down.

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Ignore link downgrade with 0 lanes
Dean Luick [Thu, 14 Apr 2016 15:31:30 +0000 (08:31 -0700)]
IB/hfi1: Ignore link downgrade with 0 lanes

Versions of the 8051 firmware < 0.38 may report a link failure
as a link downgrade with a width of 0 followed by a link down
notification.  Ignore the zero width downgrade notification -
the driver should follow the link down path.

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Add RSM rule for user FECN handling
Dean Luick [Tue, 12 Apr 2016 18:32:06 +0000 (11:32 -0700)]
IB/hfi1: Add RSM rule for user FECN handling

Add a receive side mapping rule to extract expected user packets with
the FECN bit set and place them in an eager buffer.  This will allow
user libraries to recognize that a FECN was sent when using header
suppression and respond appropriately.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Create a routine to set a receive side mapping rule
Dean Luick [Tue, 12 Apr 2016 18:31:33 +0000 (11:31 -0700)]
IB/hfi1: Create a routine to set a receive side mapping rule

Move the rule setting code into its own routine for improved
searchability and reuse.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Move QOS decision logic into its own function
Dean Luick [Tue, 12 Apr 2016 18:31:11 +0000 (11:31 -0700)]
IB/hfi1: Move QOS decision logic into its own function

The decision to use QOS affects other resource allocation.
Move the QOS decision logic into its own function so it can
be called by other interested parties.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Extract RSM map table init from QOS
Dean Luick [Tue, 12 Apr 2016 18:30:51 +0000 (11:30 -0700)]
IB/hfi1: Extract RSM map table init from QOS

Refactor the allocation, tracking, and writing of the RSM map table
into its own set of routines.  This will allow the map table to be
passed to multiple users to fill in as needed.  Start with the original
user, QOS.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Reduce kernel context pio buffer allocation
Jianxin Xiong [Tue, 12 Apr 2016 18:30:28 +0000 (11:30 -0700)]
IB/hfi1: Reduce kernel context pio buffer allocation

The pio buffers were pooled evenly among all kernel contexts and
user contexts. However, the demand from kernel contexts is much
lower than user contexts. This patch reduces the allocation for
kernel contexts and thus makes more credits available for PSM,
helping performance. This is especially useful on high core-count
systems where large numbers of contexts are used.

A new context type SC_VL15 is added to distinguish the context used
for VL15 from other kernel contexts. The reason is that VL15 needs
to support 2KB sized packet while other kernel contexts need only
support packets up to the size determined by "piothreshold", which
has a default value of 256.

The new allocation method allows triple buffering of largest pio
packets configured for these contexts. This is sufficient to maintain
verbs performance. The largest pio packet size is 2048B for VL15
and "piothreshold" for other kernel contexts. A cap is applied to
"piothreshold" to avoid excessive buffer allocation.

The special case that SDMA is disable is handled differently. In
that case, the original pooling allocation is used to better
support the much higher pio traffic.

Notice that if adaptive pio is disabled (piothreshold==0), the pio
buffer size doesn't matter for non-VL15 kernel send contexts when
SDMA is enabled because pio is not used at all on these contexts
and thus the new allocation is still valid. If SDMA is disabled then
pooling allocation is used as mentioned in previous paragraph.

Adjustment is also made to the calculation of the credit return
threshold for the kernel contexts. Instead of purely based on
the MTU size, a percentage based threshold is also considered and
the smaller one of the two is chosen. This is necessary to ensure
that with the reduced buffer allocation credits are returned in
time to avoid unnecessary stall in the send path.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mark Debbage <mark.debbage@intel.com>
Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Change default number of user contexts
Jubin John [Tue, 12 Apr 2016 18:30:08 +0000 (11:30 -0700)]
IB/hfi1: Change default number of user contexts

Change the default number of user contexts to the number of real
(non-HT) cpu cores in order to reduce the division of hfi1 hardware
contexts in the case of high core counts with hyper-threading enabled.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Use global defines for upper bits in opcode
Mike Marciniszyn [Tue, 12 Apr 2016 18:29:20 +0000 (11:29 -0700)]
IB/hfi1: Use global defines for upper bits in opcode

The awkward coding for setting the allowed_ops field
was tripping an smatch warning.

This patch uses the more appropriate defines from include/rdma
to avoid the issue.

As part of the patch remove a mask that was duplicated
in rdmavt include files and use that mask as appropriate.

Fixes: 8bea6b1cfe6f ("IB/rdmavt: Add create queue pair functionality")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove unreachable code
Mike Marciniszyn [Tue, 12 Apr 2016 18:28:56 +0000 (11:28 -0700)]
IB/hfi1: Remove unreachable code

Remove unreachable code from RC ack handling to fix an
smatch error.

Fixes: 633d27399514 ("staging/rdma/hfi1: use mod_timer when appropriate")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix double QSFP resource acquire on cache refresh
Dean Luick [Tue, 12 Apr 2016 18:28:36 +0000 (11:28 -0700)]
IB/hfi1: Fix double QSFP resource acquire on cache refresh

The function refresh_qsfp_cache() acquires the i2c chain resource,
but one caller already holds the resource.  Change the acquire so
all calls to refresh_qsfp_cache() are covered by the acquire and
remove the acquire within refresh_qsfp_cache().

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Guard against concurrent I2C access across all chains
Dean Luick [Tue, 12 Apr 2016 18:26:21 +0000 (11:26 -0700)]
IB/hfi1: Guard against concurrent I2C access across all chains

The discrete ASIC board design makes the two I2C chains not
independent of each other.  That is, only one chain can safely
be accessed at a time.  For discrete ASIC devices, adjust the
resource locking so that access to one I2C chain will lock both
of the chains.

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove module presence check outside pre-LNI checks
Easwar Hariharan [Tue, 12 Apr 2016 18:25:57 +0000 (11:25 -0700)]
IB/hfi1: Remove module presence check outside pre-LNI checks

The pre-LNI SerDes and channel tuning algorithm already checks for
module presence assertion for the relevant port types. The extraneous
check removed in this patch blocks link up for port types for which
the module presence assertion is not relevant.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Always turn on CDRs for low power QSFP modules
Easwar Hariharan [Tue, 12 Apr 2016 18:25:31 +0000 (11:25 -0700)]
IB/hfi1: Always turn on CDRs for low power QSFP modules

Clock and data recovery mechanisms (CDRs) in active QSFP modules
can be turned on or off to improve the bit error rate observed on
the channel. Signal integrity and bit error rate requirements require
us to always turn on any CDRs present in low power cables (power
dissipation 2.5W or lower). However, we adhere to the platform
designer's settings (provided in the platform configuration) for
higher power cables (dissipation 3.5W or higher) if the platform
designer has determined that the platform requires the CDRs to be
turned on (or off) and is capable of supplying and cooling the higher
power modules.

This patch also introduces the get_qsfp_power_class function to
centralize the bit twiddling required to determine the QSFP power class
across the code. Reusing this function improves the readability of code
that depends on knowing the power class of the cable, such as the
active and optical channel tuning algorithm.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Check P_KEY for all sent packets from user mode
Sebastian Sanchez [Tue, 12 Apr 2016 18:22:21 +0000 (11:22 -0700)]
IB/hfi1: Check P_KEY for all sent packets from user mode

Add the P_KEY check for user-context mechanism for
both PIO and SDMA. For PIO, the
SendCtxtCheckEnable.DisallowKDETHPackets is set by
default. When the P_KEY is set,
SendCtxtCheckEnable.DisallowKDETHPackets is cleared.
For SDMA, a software check was included. This change
requires user processes to set the P_KEY before sending
any packets, otherwise, the sent packet will fail. The
original submission didn't have this check but it's
required.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mikto Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Adjust default MTU to be 10KB
Sebastian Sanchez [Tue, 12 Apr 2016 18:17:09 +0000 (11:17 -0700)]
IB/hfi1: Adjust default MTU to be 10KB

Increasing the default MTU size to 10KB improves performance
for PSM. Change the default MTU to 10KB but constrain
Verbs MTU to 8KB. Also update default MTU module parameter
description to be HFI1_DEFAULT_MAX_MTU.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Simplify init_qpmap_table()
Dean Luick [Tue, 12 Apr 2016 17:50:35 +0000 (10:50 -0700)]
IB/hfi1: Simplify init_qpmap_table()

Make init_qpmap_table() easier to understand by simplifying
the loop indexing and writing each register when it is "full",
removing the need for a follow-on register write.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Correctly obtain the full service class
Dean Luick [Tue, 12 Apr 2016 17:50:28 +0000 (10:50 -0700)]
IB/hfi1: Correctly obtain the full service class

The function hdr2sc was using an unshifted mask to obtain
the 5th bit of the service class.  Correct the issue by using
the shifted mask.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix QOS rule mappings
Dean Luick [Tue, 12 Apr 2016 17:50:22 +0000 (10:50 -0700)]
IB/hfi1: Fix QOS rule mappings

The QOS RSM rule mappings are off by one, referencing a kernel receive
context that does not exist.

Correctly start the QOS RSM map entries at FIRST_KERNEL_CONTEXT rather
than MIN_KERNEL_KCTXTS.  Remove the cruft that hid this.

Change the QP map table so all traffic not caught by QOS RSM goes to
the control context rather than the first QOS context.

Correct comments to match the actual code operation and intent.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Remove invalid QOS check
Dean Luick [Tue, 12 Apr 2016 17:50:16 +0000 (10:50 -0700)]
IB/hfi1: Remove invalid QOS check

Remove an invalid compare of the number of QOS RSM map table entries
against the number of physical receive contexts.  The RSM map table
has its own size and has no relation to the number of physical receive
contexts.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix QOS num_vl bit width
Dean Luick [Tue, 12 Apr 2016 17:50:10 +0000 (10:50 -0700)]
IB/hfi1: Fix QOS num_vl bit width

The bit width for num_vls, n, needs to be calculated based on
the pow2 rounded up of the number of vls.  Otherwise num_vls of 3,
5, 6, and 7 will have misplaced QOS RSM map entries.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix i2c resource reservation checks
Dean Luick [Tue, 12 Apr 2016 17:50:04 +0000 (10:50 -0700)]
IB/hfi1: Fix i2c resource reservation checks

The i2c and qsfp read/write routines should check for the resource
reservation of the incoming argument target rather than the implicit
target of the hardware HFI.

Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix sysfs file offset usage
Dean Luick [Tue, 12 Apr 2016 17:49:58 +0000 (10:49 -0700)]
IB/hfi1: Fix sysfs file offset usage

Two sysfs files do not pay attention to the file offset when
reading data. Fix that.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/rdmavt,hfi1,qib: Fix memory leak
Jubin John [Wed, 20 Apr 2016 13:05:24 +0000 (06:05 -0700)]
IB/rdmavt,hfi1,qib: Fix memory leak

rdi->ports has memory allocated in rvt_alloc_device(), but does not get
freed because the hfi1 and qib drivers drivers call ib_dealloc_device()
directly instead of going through rdmavt. Add a rvt_dealloc_device()
that frees rdi->ports and then calls ib_dealloc_device(). Switch hfi1
and qib drivers to calling rvt_dealloc_device() instead of
ib_dealloc_device() directly.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix buffer cache races which may cause corruption
Mitko Haralanov [Tue, 12 Apr 2016 17:46:53 +0000 (10:46 -0700)]
IB/hfi1: Fix buffer cache races which may cause corruption

There are two possible causes for node/memory corruption both
of which are related to the cache eviction algorithm. One way
to cause corruption is due to the asynchronous nature of the
MMU invalidation and the locking used when invalidating node.

The MMU invalidation routine would temporarily release the
RB tree lock to avoid a deadlock. However, this would allow
the eviction function to take the lock resulting in the removal
of cache nodes.

If the node being removed by the eviction code is the same as
the node being invalidated, the result is use after free.

The same is true in the other direction due to the temporary
release of the eviction list lock in the eviction loop.

Another corner case exists when dealing with the SDMA buffer
cache that could cause memory corruption of kernel memory.
The most common way, in which this corruption exhibits itself
is a linked list node corruption. In that case, the kernel will
complain that a node with poisoned pointers is being removed.
The fact that the pointers are already poisoned means that the
node has already been removed from the list.

To root cause of this corruption was a mishandling of the
eviction list maintained by the driver. In order for this
to happen four conditions need to be satisfied:

   1. A node describing a user buffer already exists in the
      interval RB tree,
   2. The beginning of the current user buffer matches that
      node but is bigger. This will cause the node to be
      extended.
   3. The amount of cached buffers is close or at the limit
      of the buffer cache size.
   4. The node has dropped close to the end of the eviction
      list. This will cause the node to be considered for
      eviction.

If all of the above conditions have been satisfied, it is
possible for the eviction algorithm to evict the current node,
which will free the node without the driver knowing.

To solve both issues described above:
   - the locking around the MMU invalidation loop and cache
     eviction loop has been improved so locks are not released in
     the loop body,
   - a new RB function is introduced which will "atomically" find
     and remove the matching node from the RB tree, preventing the
     MMU invalidation loop from touching it, and
   - the node being extended by the pin_vector_pages() function is
     removed from the eviction list prior to calling the eviction
     function.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Extract and reinsert MMU RB node on lookup
Mitko Haralanov [Tue, 12 Apr 2016 17:46:47 +0000 (10:46 -0700)]
IB/hfi1: Extract and reinsert MMU RB node on lookup

The page pinning function, which also maintains the pin cache,
behaves one of two ways when an exact buffer match is not found:
  1. If no node is not found (a buffer with the same starting address
     is not found in the cache), a new node is created, the buffer
     pages are pinned, and the node is inserted into the RB tree, or
  2. If a node is found but the buffer in that node is a subset of
     the new user buffer, the node is extended with the new buffer
     pages.

Both modes of operation require (re-)insertion into the interval RB
tree.

When the node being inserted is a new node, the operations are pretty
simple. However, when the node is already existing and is being
extended, special care must be taken.

First, we want to guard against an asynchronous attempt to
delete the node by the MMU invalidation notifier. The simplest way to
do this is to remove the node from the RB tree, preventing the search
algorithm from finding it.

Second, the node needs to be re-inserted so it lands in the proper place
in the tree and the tree is correctly re-balanced. This also requires
the node to be removed from the RB tree.

This commit adds the hfi1_mmu_rb_extract() function, which will search
for a node in the interval RB tree matching an address and length and
remove it from the RB tree if found. This allows for both of the above
special cases be handled in a single step.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Correctly compute node interval
Mitko Haralanov [Tue, 12 Apr 2016 17:46:41 +0000 (10:46 -0700)]
IB/hfi1: Correctly compute node interval

The computation of the interval of an interval RB node
was incorrect leading to data corruption due to the RB
search algorithm not properly finding the all RB nodes
in an MMU invalidation interval.

The problem stemmed from the fact that the beginning
address of the node's range was being aligned to a page
boundary. For certain buffer sizes, this would lead to
a end address calculation that was off by 1 page.

An important aspect of keeping the RB same is also
updating the node's range in the case it's being extended.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Protect the interval RB tree when cleaning up
Mitko Haralanov [Tue, 12 Apr 2016 17:46:35 +0000 (10:46 -0700)]
IB/hfi1: Protect the interval RB tree when cleaning up

The current implementation of the clean up function for
the interval RB trees has two flaws which may cause
problems in cases of concurrent executing of the function
and MMU notifier.

The flaws were due to the fact that deregistration of the
MMU callbacks was done after the tree was emptied and,
furthermore, the tree was not being locked.

This commit fixes both of these flaws by, first, switch the
order of operations, and, second, locking the tree while
traversing it to prevent any other operations.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
8 years agoIB/hfi1: Fix memory leak in user ExpRcv and SDMA
Mitko Haralanov [Tue, 12 Apr 2016 17:46:29 +0000 (10:46 -0700)]
IB/hfi1: Fix memory leak in user ExpRcv and SDMA

The driver had two memory leaks - one in the user
expected receive code and one in SDMA buffer cache.

The leak in the expected receive code only showed up
when the user/admin had set ulimit sufficiently low
and the driver did not have enough room in the cache
before hitting the limit of allowed cachable memory.

When this condition occurred, the driver returned
early signaling userland that it needed to free some
buffers to free up room in the cache.

The bug was that the driver was not cleaning up
allocated memory prior to returning early.

The leak in the SDMA buffer cache could occur (even
though it never did), when the insertion of a buffer
node in the interval RB tree failed. In this case, the
driver failed to unpin the pages of the node instead
erroneously returning success.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>