cascardo/linux.git
9 years agoSUNRPC: Fix stupid typo in xs_sock_set_reuseport
Trond Myklebust [Mon, 9 Feb 2015 22:20:14 +0000 (17:20 -0500)]
SUNRPC: Fix stupid typo in xs_sock_set_reuseport

Yes, kernel_setsockopt() hates you for using a char argument.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
Trond Myklebust [Mon, 9 Feb 2015 16:01:02 +0000 (11:01 -0500)]
SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG

Now that the linger code is gone, the xs_tcp_fin_timeout variable has
no real function. Keep it for now, since it is part of the /proc
interface, but only define it if that /proc interface is enabled.

Suggested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Handle connection reset more efficiently.
Trond Myklebust [Mon, 9 Feb 2015 14:41:32 +0000 (09:41 -0500)]
SUNRPC: Handle connection reset more efficiently.

If the connection reset is due to an active call on our side, then
the state change is sometimes not reported. Catch those instances
using xs_error_report() instead.
Also remove the xs_tcp_shutdown() call in xs_tcp_send_request() as
the change in behaviour makes it redundant.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
Trond Myklebust [Mon, 9 Feb 2015 00:21:27 +0000 (19:21 -0500)]
SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
Trond Myklebust [Mon, 9 Feb 2015 14:23:34 +0000 (09:23 -0500)]
SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release

Use of socket shutdown() means that we monitor the shutdown process
through the xs_tcp_state_change() callback, so it is preferable to
a full close in all cases unless we're destroying the transport.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
Trond Myklebust [Mon, 9 Feb 2015 14:26:39 +0000 (09:26 -0500)]
SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection

The previous behaviour left the connection half-open in order to try
to scrape the last replies from the socket. Now that we have more reliable
reconnection, change the behaviour to close down the socket faster.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
Trond Myklebust [Sun, 8 Feb 2015 21:00:01 +0000 (16:00 -0500)]
SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Remove TCP socket linger code
Trond Myklebust [Sun, 8 Feb 2015 20:50:27 +0000 (15:50 -0500)]
SUNRPC: Remove TCP socket linger code

Now that we no longer use the partial shutdown code when closing the
socket, we no longer need to worry about the TCP linger2 state.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Remove TCP client connection reset hack
Trond Myklebust [Sun, 8 Feb 2015 20:34:28 +0000 (15:34 -0500)]
SUNRPC: Remove TCP client connection reset hack

Instead we rely on SO_REUSEPORT to provide the reconnection semantics
that we need for NFSv2/v3.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: TCP/UDP always close the old socket before reconnecting
Trond Myklebust [Sun, 8 Feb 2015 21:49:48 +0000 (16:49 -0500)]
SUNRPC: TCP/UDP always close the old socket before reconnecting

It is not safe to call xs_reset_transport() from inside xs_udp_setup_socket()
or xs_tcp_setup_socket(), since they do not own the correct locks. Instead,
do it in xs_connect().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Add helpers to prevent socket create from racing
Trond Myklebust [Sun, 8 Feb 2015 23:19:25 +0000 (18:19 -0500)]
SUNRPC: Add helpers to prevent socket create from racing

The socket lock is currently held by the task that is requesting the
connection be established. While that is efficient in the case where
the connection happens quickly, it is racy in the case where it doesn't.
What we really want is for the connect helper to be able to block access
to the socket while it is being set up.

This patch does so by arranging to transfer the socket lock from the
task that is requesting the connect attempt, and then releasing that
lock once everything is done.
This scheme also gives us automatic protection against collisions with
the RPC close code, so we can kill the cancel_delayed_work_sync()
call in xs_close().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Ensure xs_reset_transport() resets the close connection flags
Trond Myklebust [Sun, 8 Feb 2015 23:35:25 +0000 (18:35 -0500)]
SUNRPC: Ensure xs_reset_transport() resets the close connection flags

Otherwise, we may end up looping.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Do not clear the source port in xs_reset_transport
Trond Myklebust [Sun, 8 Feb 2015 21:28:58 +0000 (16:28 -0500)]
SUNRPC: Do not clear the source port in xs_reset_transport

Now that we can reuse bound ports after a close, we never really want to
clear the transport's source port after it has been set. Doing so really
messes up the NFSv3 DRC on the server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Handle EADDRINUSE on connect
Trond Myklebust [Mon, 9 Feb 2015 02:44:04 +0000 (21:44 -0500)]
SUNRPC: Handle EADDRINUSE on connect

Now that we're setting SO_REUSEPORT, we still need to handle the
case where a connect() is attempted, but the old socket is still
lingering.
Essentially, all we want to do here is handle the error by waiting
a few seconds and then retrying.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Set SO_REUSEPORT socket option for TCP connections
Trond Myklebust [Sun, 8 Feb 2015 20:00:06 +0000 (15:00 -0500)]
SUNRPC: Set SO_REUSEPORT socket option for TCP connections

When using TCP, we need the ability to reuse port numbers after
a disconnection, so that the NFSv3 server knows that we're the same
client. Currently we use a hack to work around the TCP socket's
TIME_WAIT: we send an RST instead of closing, which doesn't
always work...
The SO_REUSEPORT option added in Linux 3.9 allows us to bind multiple
TCP connections to the same source address+port combination, and thus
to use ordinary TCP close() instead of the current hack.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoMerge tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs...
Trond Myklebust [Sun, 8 Feb 2015 15:37:34 +0000 (10:37 -0500)]
Merge tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: RDMA Client Sparse Fixes

This patch fixes a sparse warning in the initial submission.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Address sparse complaint in rpcr_to_rdmar()

9 years agoNFSv4.1: Fix pnfs_put_lseg races
Trond Myklebust [Thu, 5 Feb 2015 22:27:39 +0000 (17:27 -0500)]
NFSv4.1: Fix pnfs_put_lseg races

pnfs_layoutreturn_free_lseg_async() can also race with inode put in
the general case. We can now fix this, and also simplify the code.

Cc: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
Trond Myklebust [Thu, 5 Feb 2015 22:05:08 +0000 (17:05 -0500)]
NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS

In we want to be able to call pnfs_send_layoutreturn() from within the
writeback path, we really want it to use GFP_NOFS in order to prevent
recursion.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4.1: Pin the inode and super block in asynchronous layoutreturns
Trond Myklebust [Thu, 5 Feb 2015 21:35:16 +0000 (16:35 -0500)]
NFSv4.1: Pin the inode and super block in asynchronous layoutreturns

If we're sending an asynchronous layoutreturn, then we need to ensure
that the inode and the super block remain pinned.

Cc: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
9 years agoNFSv4.1: Pin the inode and super block in asynchronous layoutcommit
Trond Myklebust [Thu, 5 Feb 2015 21:50:30 +0000 (16:50 -0500)]
NFSv4.1: Pin the inode and super block in asynchronous layoutcommit

If we're sending an asynchronous layoutcommit, then we need to ensure
that the inode and the super block remain pinned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
9 years agoNFSv4: Ensure we reference the inode for return-on-close in delegreturn
Trond Myklebust [Thu, 5 Feb 2015 20:13:24 +0000 (15:13 -0500)]
NFSv4: Ensure we reference the inode for return-on-close in delegreturn

If we have to do a return-on-close in the delegreturn code, then
we must ensure that the inode and super block remain referenced.

Cc: Peng Tao <tao.peng@primarydata.com>
Cc: stable@vger.kernel.org # 3.17.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
9 years agoxprtrdma: Address sparse complaint in rpcr_to_rdmar()
Chuck Lever [Wed, 4 Feb 2015 21:59:32 +0000 (16:59 -0500)]
xprtrdma: Address sparse complaint in rpcr_to_rdmar()

With "make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__":

linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: warning: incorrect
  type in initializer (different base types)
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: expected restricted
  __be32 [usertype] *buffer
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30:    got unsigned int
  [usertype] *rq_buffer

As far as I can tell this is a false positive.

Reported-by: kbuild-all@01.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoNFSv4.1: Ask for no delegation on OPEN if using O_DIRECT
Trond Myklebust [Fri, 30 Jan 2015 19:21:14 +0000 (14:21 -0500)]
NFSv4.1: Ask for no delegation on OPEN if using O_DIRECT

If we're using NFSv4.1, then we have the ability to let the server know
whether or not we believe that returning a delegation as part of our OPEN
request would be useful.
The feature needs to be used with care, since the client sending the request
doesn't necessarily know how other clients are using that file, and how
they may be affected by the delegation.
For this reason, our initial use of the feature will be to let the server
know when the client believes that handing out a delegation would not be
useful.
The first application for this function is when opening the file using
O_DIRECT.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Add Anna Schumaker as co-maintainer for the NFS client
Trond Myklebust [Tue, 3 Feb 2015 22:13:58 +0000 (17:13 -0500)]
NFS: Add Anna Schumaker as co-maintainer for the NFS client

Anna has essentially been performing the duties of co-maintainer for
the past several years. In recognition of those efforts, I'd like to
add her to the maintainers file.

Cc: Anna Schumaker <anna.schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: NULL utsname dereference on NFS umount during namespace cleanup
Trond Myklebust [Fri, 30 Jan 2015 23:12:28 +0000 (18:12 -0500)]
SUNRPC: NULL utsname dereference on NFS umount during namespace cleanup

Fix an Oopsable condition when nsm_mon_unmon is called as part of the
namespace cleanup, which now apparently happens after the utsname
has been freed.

Link: http://lkml.kernel.org/r/20150125220604.090121ae@neptune.home
Reported-by: Bruno Prémont <bonbons@linux-vserver.org>
Cc: stable@vger.kernel.org # 3.18
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoMerge branch 'flexfiles'
Trond Myklebust [Tue, 3 Feb 2015 21:01:27 +0000 (16:01 -0500)]
Merge branch 'flexfiles'

* flexfiles: (53 commits)
  pnfs: lookup new lseg at lseg boundary
  nfs41: .init_read and .init_write can be called with valid pg_lseg
  pnfs: Update documentation on the Layout Drivers
  pnfs/flexfiles: Add the FlexFile Layout Driver
  nfs: count DIO good bytes correctly with mirroring
  nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET
  nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes
  nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags
  nfs/flexfiles: send layoutreturn before freeing lseg
  nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE
  nfs41: allow async version layoutreturn
  nfs41: add range to layoutreturn args
  pnfs: allow LD to ask to resend read through pnfs
  nfs: add nfs_pgio_current_mirror helper
  nfs: only reset desc->pg_mirror_idx when mirroring is supported
  nfs41: add a debug warning if we destroy an unempty layout
  pnfs: fail comparison when bucket verifier not set
  nfs: mirroring support for direct io
  nfs: add mirroring support to pgio layer
  pnfs: pass ds_commit_idx through the commit path
  ...

Conflicts:
fs/nfs/pnfs.c
fs/nfs/pnfs.h

9 years agopnfs: lookup new lseg at lseg boundary
Weston Andros Adamson [Fri, 30 Jan 2015 16:01:02 +0000 (11:01 -0500)]
pnfs: lookup new lseg at lseg boundary

Before mirroring support was added, the pageio descriptor's pg_lseg was
set to null when an RPC was sent. Because of this, pg_init was called
at lseg boundaries with pg_lseg = NULL, and it could be set to the new
lseg.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agonfs41: .init_read and .init_write can be called with valid pg_lseg
Peng Tao [Sat, 24 Jan 2015 14:14:52 +0000 (22:14 +0800)]
nfs41: .init_read and .init_write can be called with valid pg_lseg

With pgio refactoring in v3.15, .init_read and .init_write can be
called with valid pgio->pg_lseg. file layout was fixed at that time
by commit c6194271f (pnfs: filelayout: support non page aligned
layouts). But the generic helper still needs to be fixed.

Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agopnfs: Update documentation on the Layout Drivers
Tom Haynes [Mon, 12 Jan 2015 19:51:45 +0000 (11:51 -0800)]
pnfs: Update documentation on the Layout Drivers

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agopnfs/flexfiles: Add the FlexFile Layout Driver
Tom Haynes [Thu, 11 Dec 2014 22:02:04 +0000 (17:02 -0500)]
pnfs/flexfiles: Add the FlexFile Layout Driver

The flexfile layout is a new layout that extends the
file layout. It is currently being drafted as a specification at
https://datatracker.ietf.org/doc/draft-ietf-nfsv4-layout-types/

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Tao Peng <bergwolf@primarydata.com>
9 years agonfs: count DIO good bytes correctly with mirroring
Peng Tao [Mon, 19 Jan 2015 04:41:16 +0000 (12:41 +0800)]
nfs: count DIO good bytes correctly with mirroring

When resending to MDS, we might resend multiple mirroring
requests to MDS. As a result, nfs_direct_good_bytes() ends
up counting bytes multiple times, causing application to
get wrong return results in read/write syscalls.

Fix it by tracking start of a dreq and checking the range of
pgio header.

Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agonfs41: wait for LAYOUTRETURN before retrying LAYOUTGET
Peng Tao [Mon, 1 Dec 2014 00:22:23 +0000 (08:22 +0800)]
nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET

Also take care to stop waiting if someone clears retry bit.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agonfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes
Peng Tao [Mon, 1 Dec 2014 00:22:21 +0000 (08:22 +0800)]
nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes

To allow pnfs LD to ask direct writes to be resend.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agonfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags
Peng Tao [Mon, 1 Dec 2014 00:22:18 +0000 (08:22 +0800)]
nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags

Use it to indicate that LD wants to retry layoutget. LD can set
it whenever it wants the common pnfs code to return and retry
pnfs path through a new layout.

The bit gets cleared when client does a new layoutget, when client
closes the file (ROC case), or when kernel needs to evict the inode
(non-ROC case).

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agonfs/flexfiles: send layoutreturn before freeing lseg
Peng Tao [Mon, 20 Oct 2014 06:44:38 +0000 (14:44 +0800)]
nfs/flexfiles: send layoutreturn before freeing lseg

Otherwise we'll lose error tracking information when
encoding layoutreturn.

pnfs_put_lseg may be called from rpc callbacks. So we should not
call pnfs_send_layoutreturn directly because it can deadlock in
the rpc layer.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE
Peng Tao [Mon, 17 Nov 2014 01:30:41 +0000 (09:30 +0800)]
nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE

When it is set, generic pnfs would try to send layoutreturn right
before last close/delegation_return regard less NFS_LAYOUT_ROC is
set or not. LD can then make sure layoutreturn is always sent
rather than being omitted.

The difference against NFS_LAYOUT_RETURN is that
NFS_LAYOUT_RETURN_BEFORE_CLOSE does not block usage of the layout so
LD can set it and expect generic layer to try pnfs path at the
same time.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs41: allow async version layoutreturn
Peng Tao [Mon, 17 Nov 2014 01:30:40 +0000 (09:30 +0800)]
nfs41: allow async version layoutreturn

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs41: add range to layoutreturn args
Peng Tao [Mon, 17 Nov 2014 01:30:36 +0000 (09:30 +0800)]
nfs41: add range to layoutreturn args

So that callers can specify which range to return.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agopnfs: allow LD to ask to resend read through pnfs
Peng Tao [Mon, 10 Nov 2014 00:35:38 +0000 (08:35 +0800)]
pnfs: allow LD to ask to resend read through pnfs

If current IO cannot be completed due to some transient errors,
LD may want to ask generic layer to resend the request through
pnfs again.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs: add nfs_pgio_current_mirror helper
Peng Tao [Mon, 10 Nov 2014 00:35:35 +0000 (08:35 +0800)]
nfs: add nfs_pgio_current_mirror helper

Let it return current nfs_pgio_mirror in use depending on pg_mirror_count.
For read, we always use pg_mirrors[0], so this effectively gives us freedom
to use pg_mirror_idx to track the actual mirror to read from through out the
IO stack.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs: only reset desc->pg_mirror_idx when mirroring is supported
Peng Tao [Mon, 10 Nov 2014 00:35:34 +0000 (08:35 +0800)]
nfs: only reset desc->pg_mirror_idx when mirroring is supported

so that we don't reset desc->pg_mirror_idx for read unnecessarily.
Remove WARN_ON_ONCE from __nfs_pageio_add_request to allow LD to
set pg_mirror_idx for read where pg_mirror_count is always 1.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs41: add a debug warning if we destroy an unempty layout
Peng Tao [Fri, 10 Oct 2014 15:25:46 +0000 (23:25 +0800)]
nfs41: add a debug warning if we destroy an unempty layout

So that we can detect the case if some layout segments are still
pinned which is surely a bug that we need to fix.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agopnfs: fail comparison when bucket verifier not set
Weston Andros Adamson [Wed, 1 Oct 2014 16:58:25 +0000 (12:58 -0400)]
pnfs: fail comparison when bucket verifier not set

This skips the WARN_ON_ONCE, but doesnt change behavior (the memcmp would
fail).

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs: mirroring support for direct io
Weston Andros Adamson [Fri, 19 Sep 2014 16:48:33 +0000 (12:48 -0400)]
nfs: mirroring support for direct io

The current mirroring code only notices short writes to the first
mirror. This patch keeps per-mirror byte counts and only considers
a byte to be written once all mirrors report so.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agonfs: add mirroring support to pgio layer
Weston Andros Adamson [Fri, 19 Sep 2014 14:55:07 +0000 (10:55 -0400)]
nfs: add mirroring support to pgio layer

This patch adds mirrored write support to the pgio layer. The default
is to use one mirror, but pgio callers may define callbacks to change
this to any value up to the (arbitrarily selected) limit of 16.

The basic idea is to break out members of nfs_pageio_descriptor that cannot
be shared between mirrored DSes and put them in a new structure.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agopnfs: pass ds_commit_idx through the commit path
Weston Andros Adamson [Fri, 5 Sep 2014 22:20:21 +0000 (18:20 -0400)]
pnfs: pass ds_commit_idx through the commit path

Pass ds_commit_idx through the nfs commit path. It's used to select
the commit bucket when using pnfs and is ignored when not using pnfs.
Several functions had to be changed: nfs_retry_commit,
nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout
driver .mark_request_commit functions.

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs: rename pgio header ds_idx to ds_commit_idx
Weston Andros Adamson [Tue, 16 Sep 2014 21:35:51 +0000 (17:35 -0400)]
nfs: rename pgio header ds_idx to ds_commit_idx

'ds_commit_idx' is a better name - it is used to select the right
commit bucket for pnfs.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agonfs: handle overlapping reqs in lock_and_join
Weston Andros Adamson [Fri, 5 Sep 2014 20:34:29 +0000 (16:34 -0400)]
nfs: handle overlapping reqs in lock_and_join

This is needed for mirrored DS support, where multuple requests
cover the same range.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agopnfs: release lseg in pnfs_generic_pg_cleanup
Weston Andros Adamson [Wed, 10 Sep 2014 19:48:01 +0000 (15:48 -0400)]
pnfs: release lseg in pnfs_generic_pg_cleanup

This is needed to support mirrored writes - the first write can't just
trash the lseg, we need to keep it around until all mirrors have
written.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agonfs: introduce pg_cleanup op for pgio descriptors
Weston Andros Adamson [Wed, 10 Sep 2014 19:44:18 +0000 (15:44 -0400)]
nfs: introduce pg_cleanup op for pgio descriptors

Add a new operation to nfs_pageio_ops that is called on nfs_pageio_complete.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
9 years agonfs/filelayout: use pnfs_error_mark_layout_for_return
Peng Tao [Fri, 5 Sep 2014 16:53:29 +0000 (00:53 +0800)]
nfs/filelayout: use pnfs_error_mark_layout_for_return

Instead of calling layoutreturn directly, call pnfs_error_mark_layout_for_return
to mark layouts for return and let generic code return layout when
layout segments are freed.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Conflicts:
fs/nfs/filelayout/filelayout.c

9 years agonfs41: clear NFS_LAYOUT_RETURN if layoutreturn is sent or failed to send
Peng Tao [Fri, 5 Sep 2014 16:53:26 +0000 (00:53 +0800)]
nfs41: clear NFS_LAYOUT_RETURN if layoutreturn is sent or failed to send

So that pnfs path is not disabled for ever.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: send layoutreturn in last put_lseg
Peng Tao [Fri, 5 Sep 2014 16:53:25 +0000 (00:53 +0800)]
nfs41: send layoutreturn in last put_lseg

If current lseg is the last lseg marked with NFS_LSEG_LAYOUTRETURN,
send layoutreturn.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: don't use a layout if it is marked for returning
Peng Tao [Fri, 5 Sep 2014 16:53:24 +0000 (00:53 +0800)]
nfs41: don't use a layout if it is marked for returning

And if we are to return the same type of layouts, don't bother
sending more layoutgets.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: add a helper to mark layout for return
Peng Tao [Fri, 5 Sep 2014 16:53:23 +0000 (00:53 +0800)]
nfs41: add a helper to mark layout for return

It marks all matching layout segments as NFS_LSEG_LAYOUTRETURN,
which is an indicator for pnfs_put_lseg() to send layoutreturn,
and also prevents pnfs_update_layout() from using the returning
segments. Once it is set, it never gets cleared.

It also sets proper io failure bit so that pnfs path can be retried
after PNFS_LAYOUTGET_RETRY_TIMEOUT second.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: make a helper function to send layoutreturn
Peng Tao [Fri, 5 Sep 2014 16:53:22 +0000 (00:53 +0800)]
nfs41: make a helper function to send layoutreturn

It allows to specify different iomode to return.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: pass iomode through layoutreturn args
Peng Tao [Fri, 5 Sep 2014 16:53:21 +0000 (00:53 +0800)]
nfs41: pass iomode through layoutreturn args

So that it is possible to return a specific iomode layouts.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs: save server READ/WRITE/COMMIT status
Peng Tao [Wed, 27 Aug 2014 02:47:14 +0000 (10:47 +0800)]
nfs: save server READ/WRITE/COMMIT status

Flexfiles layout would want to use them to report DS IO status.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: serialize first layoutget of a file
Peng Tao [Fri, 22 Aug 2014 09:37:41 +0000 (17:37 +0800)]
nfs41: serialize first layoutget of a file

Per RFC 5661 Errata 3208:
| A client MAY always forget its layout state and associated
| layout stateid at any time (See also section 12.5.5.1).
| In such case, the client MUST use a non-layout stateid for the next
| LAYOUTGET operation. This will signal the server that the client has
| no more layouts on the file and its respective layout state can be
| released before issuing a new layout in response to LAYOUTGET.

In order to make such a signal unique to server, client needs to serialize
all layoutgets using non-layout stateid. We implement this by serializing
layoutgets when client has no layout segments at hand.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: close a small race window when adding new layout to global list
Peng Tao [Fri, 22 Aug 2014 09:37:40 +0000 (17:37 +0800)]
nfs41: close a small race window when adding new layout to global list

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs/flexclient: export pnfs_layoutcommit_inode
Peng Tao [Thu, 7 Aug 2014 02:12:38 +0000 (10:12 +0800)]
nfs/flexclient: export pnfs_layoutcommit_inode

flexfiles needs to start layoutcommit when necessary

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
9 years agonfs: set hostname when creating nfsv3 ds connection
Peng Tao [Mon, 7 Jul 2014 22:21:10 +0000 (06:21 +0800)]
nfs: set hostname when creating nfsv3 ds connection

lockd assumes hostname exists otherwise kernel oops.
It can be reproduced by following steps:
1. mount flexfile MDS
2. write some files
3. mount DS via nfsv3

BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8134f332>] strlen+0x2/0x20
 PGD 0
 Oops: 0000 [#1] SMP
 Modules linked in: nfsd(F) nfs_layout_flexfiles(F) rpcsec_gss_krb5(F) auth_rpcgss(F) nfsv4(F) dns_resolver(F) nfsv3(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) ebtable_nat(F) nf_conntrack_netbios_ns(F) nf_conntrack_broadcast(F) ipt_MASQUERADE(F) ip6table_nat(F) nf_nat_ipv6(F) ip6table_mangle(F) ip6t_REJECT(F) nf_conntrack_ipv6(F) nf_defrag_ipv6(F) iptable_nat(F) nf_nat_ipv4(F) nf_nat(F) iptable_mangle(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F) ebtable_filter(F) ebtables(F) ip6table_filter(F) ip6_tables(F) bnep(F) snd_ens1371(F) snd_rawmidi(F) snd_ac97_codec(F) btusb(F) ac97_bus(F) snd_seq(F) snd_seq_device(F) snd_pcm(F) ppdev(F) bluetooth(F) 6lowpan_iphc(F) rfkill(F) vmw_balloon(F) snd_timer(F) snd(F) soundcore(F) gameport(F) i2c_piix4(F) e1000(F) vmw_vmci(F) parport_pc(F) parport(F) shpchp(F) uinput(F) xfs(F) libcrc32c(F) vmwgfx(F) ttm(F) drm(F) mptspi(F) scsi_transport_spi(F) mptscsih(F) mptbase(F) i2c_core(F)
 CPU: 0 PID: 10397 Comm: mount.nfs Tainted: GF            3.14.7-100.pd_client.001.fc16.x86_64 #1
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
 task: ffff880008942600 ti: ffff880007990000 task.ti: ffff880007990000
 RIP: 0010:[<ffffffff8134f332>]  [<ffffffff8134f332>] strlen+0x2/0x20
 RSP: 0018:ffff880007991aa0  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff880038d39c20 RCX: 0000000000000004
 RDX: 0000000000000006 RSI: 0000000000000010 RDI: 0000000000000000
 RBP: ffff880007991b38 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000014600 R11: 0000000000000400 R12: ffffffff81cc8580
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000004
 FS:  00007f90cd2ef880(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000000001710000 CR4: 00000000001407f0
 Stack:
  ffffffffa045f52c ffff880001782230 ffff880004141e28 0006880007991ac8
  ffffffff816dc14b ffff880000000000 ffff880038d39c20 0000000000000010
  0000000481cc0006 0000000000000000 ffffffffa0410be8 000000000000c014
 Call Trace:
  [<ffffffffa045f52c>] ? nlmclnt_lookup_host+0x4c/0x2c0 [lockd]
  [<ffffffff816dc14b>] ? _raw_spin_unlock_bh+0x1b/0x20
  [<ffffffffa0410be8>] ? svc_destroy+0xb8/0x140 [sunrpc]
  [<ffffffffa045c323>] nlmclnt_init+0x53/0xc0 [lockd]
  [<ffffffffa047d2dc>] ? nfs_get_client+0x1cc/0x340 [nfs]
  [<ffffffffa047c2e7>] nfs_start_lockd+0xa7/0xd0 [nfs]
  [<ffffffffa047df71>] nfs_create_server+0x181/0x5c0 [nfs]
  [<ffffffffa04460f3>] nfs3_create_server+0x13/0x30 [nfsv3]
  [<ffffffffa048a0bc>] nfs_try_mount+0x21c/0x300 [nfs]
  [<ffffffff811ca32d>] ? __kmalloc_track_caller+0x1ad/0x240
  [<ffffffffa048b677>] ? nfs_fs_mount+0xc37/0xd80 [nfs]
  [<ffffffffa048ad05>] nfs_fs_mount+0x2c5/0xd80 [nfs]
  [<ffffffffa048a830>] ? nfs_clone_super+0x140/0x140 [nfs]
  [<ffffffffa048a240>] ? nfs_clone_sb_security+0x40/0x40 [nfs]
  [<ffffffff811e7e43>] mount_fs+0x43/0x1b0
  [<ffffffff81193100>] ? __alloc_percpu+0x10/0x20
  [<ffffffff812026e6>] vfs_kern_mount+0x76/0x120
  [<ffffffff81204917>] do_mount+0x237/0xa80

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: add rpc_count_iostats_idx
Weston Andros Adamson [Tue, 24 Jun 2014 14:59:52 +0000 (10:59 -0400)]
sunrpc: add rpc_count_iostats_idx

Add a call to tally stats for a task under a different statsidx than
what's contained in the task structure.

This is needed to properly account for pnfs reads/writes when the
DS nfs version != the MDS version.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agoNFSv4.1/NFSv3: Add pNFS callbacks for nfs3_(read|write|commit)_done()
Trond Myklebust [Sun, 22 Jun 2014 16:55:11 +0000 (12:55 -0400)]
NFSv4.1/NFSv3: Add pNFS callbacks for nfs3_(read|write|commit)_done()

Enable pNFS callbacks to allow flex files to work correctly with a
NFSv3-enabled data server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: allow to specify cred in nfs_initiate_pgio
Peng Tao [Fri, 13 Jun 2014 15:02:25 +0000 (23:02 +0800)]
nfs: allow to specify cred in nfs_initiate_pgio

so that flexfile layout client can pass in DS credential instead of
using user cred, which will be done in the next patch.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs4: export nfs4_sequence_done
Peng Tao [Tue, 10 Jun 2014 21:24:16 +0000 (05:24 +0800)]
nfs4: export nfs4_sequence_done

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs4: pass slot table to nfs40_setup_sequence
Peng Tao [Tue, 10 Jun 2014 21:24:15 +0000 (05:24 +0800)]
nfs4: pass slot table to nfs40_setup_sequence

flexclient needs this as there is no nfs_server to DS connection.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs: allow different protocol in nfs_initiate_commit
Peng Tao [Sun, 8 Jun 2014 23:10:14 +0000 (07:10 +0800)]
nfs: allow different protocol in nfs_initiate_commit

pnfs flexfile layout client may want to use NFSv3 ops rather
than the default MDS v4 ops.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agopnfs: Add nfs_rpc_ops in calls to nfs_initiate_pgio
Tom Haynes [Mon, 9 Jun 2014 20:12:20 +0000 (13:12 -0700)]
pnfs: Add nfs_rpc_ops in calls to nfs_initiate_pgio

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agonfs41: create NFSv3 DS connection if specified
Peng Tao [Fri, 30 May 2014 10:15:59 +0000 (18:15 +0800)]
nfs41: create NFSv3 DS connection if specified

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: allow LD to choose DS connection version/minor_version
Peng Tao [Fri, 30 May 2014 10:15:58 +0000 (18:15 +0800)]
nfs41: allow LD to choose DS connection version/minor_version

flexfile layout may need to set such when making DS connections.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfsv3: introduce nfs3_set_ds_client
Peng Tao [Fri, 30 May 2014 10:15:57 +0000 (18:15 +0800)]
nfsv3: introduce nfs3_set_ds_client

The flexfiles layout wants to create DS connection over NFSv3.
Add nfs3_set_ds_client to allow that to happen.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: move file layout macros to generic pnfs
Peng Tao [Fri, 30 May 2014 10:15:55 +0000 (18:15 +0800)]
nfs41: move file layout macros to generic pnfs

They can be reused by flexfile layout as well.

Also add a code such that if read fails on one DS and
there are other DSes available to use, don't resend
through MDS but through pNFS so that client can read
from other DSes.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: allow LD to choose DS connection auth flavor
Peng Tao [Thu, 29 May 2014 13:07:00 +0000 (21:07 +0800)]
nfs41: allow LD to choose DS connection auth flavor

flexfile layout may use different auth flavor as specified by MDS.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: pull nfs4_ds_connect from file layout to generic pnfs
Peng Tao [Thu, 29 May 2014 13:06:58 +0000 (21:06 +0800)]
nfs41: pull nfs4_ds_connect from file layout to generic pnfs

It can be reused by flexfiles layout client.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: pull decode_ds_addr from file layout to generic pnfs
Peng Tao [Thu, 29 May 2014 13:06:59 +0000 (21:06 +0800)]
nfs41: pull decode_ds_addr from file layout to generic pnfs

It can be reused by flexfile layout.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agonfs41: pull data server cache from file layout to generic pnfs
Peng Tao [Thu, 29 May 2014 13:06:57 +0000 (21:06 +0800)]
nfs41: pull data server cache from file layout to generic pnfs

Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs.

They can all be reused by flexfile layout as well.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
9 years agopnfs: Do not grab the commit_info lock twice when rescheduling writes
Tom Haynes [Thu, 11 Dec 2014 18:04:55 +0000 (13:04 -0500)]
pnfs: Do not grab the commit_info lock twice when rescheduling writes

Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agopnfs: Prepare for flexfiles by pulling out common code
Tom Haynes [Thu, 11 Dec 2014 20:34:59 +0000 (15:34 -0500)]
pnfs: Prepare for flexfiles by pulling out common code

The flexfilelayout driver will share some common code
with the filelayout driver. This set of changes refactors
that common code out to avoid any module depenencies.

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
9 years agoMerge tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma
Trond Myklebust [Tue, 3 Feb 2015 16:53:18 +0000 (11:53 -0500)]
Merge tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: Client side changes for RDMA

These patches improve the scalability of the NFSoRDMA client and take large
variables off of the stack.  Additionally, the GFP_* flags are updated to
match what TCP uses.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (21 commits)
  xprtrdma: Update the GFP flags used in xprt_rdma_allocate()
  xprtrdma: Clean up after adding regbuf management
  xprtrdma: Allocate zero pad separately from rpcrdma_buffer
  xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep
  xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
  xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
  xprtrdma: Add struct rpcrdma_regbuf and helpers
  xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
  xprtrdma: Simplify synopsis of rpcrdma_buffer_create()
  xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
  xprtrdma: Take struct ib_device_attr off the stack
  xprtrdma: Free the pd if ib_query_qp() fails
  xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
  xprtrdma: Move credit update to RPC reply handler
  xprtrdma: Remove rl_mr field, and the mr_chunk union
  xprtrdma: Remove rpcrdma_ep::rep_ia
  xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
  xprtrdma: Clean up hdrlen
  xprtrdma: Display XIDs in host byte order
  xprtrdma: Modernize htonl and ntohl
  ...

9 years agoNFS: a couple off by ones
Dan Carpenter [Tue, 16 Dec 2014 23:52:26 +0000 (02:52 +0300)]
NFS: a couple off by ones

These tests are off by one because if len == sizeof(nfs_export_path)
then we have truncated the name.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: prevent truncate on active swapfile
Omar Sandoval [Thu, 8 Jan 2015 09:18:30 +0000 (01:18 -0800)]
nfs: prevent truncate on active swapfile

Most filesystems prevent truncation of an active swapfile by way of
inode_newsize_ok, called from inode_change_ok. NFS doesn't call either
from nfs_setattr, presumably because most of these checks are expected
to be done server-side. However, the IS_SWAPFILE check can only be done
client-side, and truncating a swapfile can't possibly be good.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: don't call blocking operations while !TASK_RUNNING
Jeff Layton [Wed, 14 Jan 2015 18:08:57 +0000 (13:08 -0500)]
nfs: don't call blocking operations while !TASK_RUNNING

Bruce reported seeing this warning pop when mounting using v4.1:

     ------------[ cut here ]------------
     WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi
    CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
     0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78
     0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da
     ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000
    Call Trace:
     [<ffffffff8186ac78>] dump_stack+0x4c/0x65
     [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0
     [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0
     [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430
     [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130
     [<ffffffff810d941e>] groups_alloc+0x3e/0x130
     [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc]
     [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc]
     [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc]
     [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc]
     [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4]
     [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0
     [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4]
     [<ffffffff810d45bf>] kthread+0x11f/0x140
     [<ffffffff810ea815>] ? local_clock+0x15/0x30
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
     [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
    ---[ end trace 675220a11e30f4f2 ]---

nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE,
which is just wrong. Fix that by finishing the wait immediately if we've
found that the list has something on it.

Also, we don't expect this kthread to accept signals, so we should be
using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up
hung task warnings from the watchdog, so have the schedule_timeout
wake up every 60s if there's no callback activity.

Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoxprtrdma: Update the GFP flags used in xprt_rdma_allocate()
Chuck Lever [Mon, 26 Jan 2015 22:11:47 +0000 (17:11 -0500)]
xprtrdma: Update the GFP flags used in xprt_rdma_allocate()

Reflect the more conservative approach used in the socket transport's
version of this transport method. An RPC buffer allocation should
avoid forcing not just FS activity, but any I/O.

In particular, two recent changes missed updating xprtrdma:

 - Commit c6c8fe79a83e ("net, sunrpc: suppress allocation warning ...")
 - Commit a564b8f03986 ("nfs: enable swap on NFS")

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Clean up after adding regbuf management
Chuck Lever [Wed, 21 Jan 2015 16:04:41 +0000 (11:04 -0500)]
xprtrdma: Clean up after adding regbuf management

rpcrdma_{de}register_internal() are used only in verbs.c now.

MAX_RPCRDMAHDR is no longer used and can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Allocate zero pad separately from rpcrdma_buffer
Chuck Lever [Wed, 21 Jan 2015 16:04:33 +0000 (11:04 -0500)]
xprtrdma: Allocate zero pad separately from rpcrdma_buffer

Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.

This is for consistency with the other uses of internally
registered memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep
Chuck Lever [Wed, 21 Jan 2015 16:04:25 +0000 (11:04 -0500)]
xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep

The rr_base field is currently the buffer where RPC replies land.

An RPC/RDMA reply header lands in this buffer. In some cases an RPC
reply header also lands in this buffer, just after the RPC/RDMA
header.

The inline threshold is an agreed-on size limit for RDMA SEND
operations that pass from server and client. The sum of the
RPC/RDMA reply header size and the RPC reply header size must be
less than this threshold.

The largest RDMA RECV that the client should have to handle is the
size of the inline threshold. The receive buffer should thus be the
size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.

RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
Chuck Lever [Wed, 21 Jan 2015 16:04:16 +0000 (11:04 -0500)]
xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req

The rl_base field is currently the buffer where each RPC/RDMA call
header is built.

The inline threshold is an agreed-on size limit to for RDMA SEND
operations that pass between client and server. The sum of the
RPC/RDMA header size and the RPC header size must be less than or
equal to this threshold.

Increasing the r/wsize maximum will require MAX_SEGS to grow
significantly, but the inline threshold size won't change (both
sides agree on it). The server's inline threshold doesn't change.

Since an RPC/RDMA header can never be larger than the inline
threshold, make all RPC/RDMA header buffers the size of the
inline threshold.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
Chuck Lever [Wed, 21 Jan 2015 16:04:08 +0000 (11:04 -0500)]
xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req

Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.

A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call and reply headers.

For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.

That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array
and the fact that both the send and receive buffers are part of
struct rpcrdma_req.

Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced rpcrdma_req structs, and
restore the original.

This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).

This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().

xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().

A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Add struct rpcrdma_regbuf and helpers
Chuck Lever [Wed, 21 Jan 2015 16:04:00 +0000 (11:04 -0500)]
xprtrdma: Add struct rpcrdma_regbuf and helpers

There are several spots that allocate a buffer via kmalloc (usually
contiguously with another data structure) and then register that
buffer internally. I'd like to split the buffers out of these data
structures to allow the data structures to scale.

Start by adding functions that can kmalloc and register a buffer,
and can manage/preserve the buffer's associated ib_sge and ib_mr
fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
Chuck Lever [Wed, 21 Jan 2015 16:03:52 +0000 (11:03 -0500)]
xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()

Move the details of how to create and destroy rpcrdma_req and
rpcrdma_rep structures into helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Simplify synopsis of rpcrdma_buffer_create()
Chuck Lever [Wed, 21 Jan 2015 16:03:44 +0000 (11:03 -0500)]
xprtrdma: Simplify synopsis of rpcrdma_buffer_create()

Clean up: There is one call site for rpcrdma_buffer_create(). All of
the arguments there are fields of an rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
Chuck Lever [Wed, 21 Jan 2015 16:03:35 +0000 (11:03 -0500)]
xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack

Reduce stack footprint of the connection upcall handler function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Take struct ib_device_attr off the stack
Chuck Lever [Wed, 21 Jan 2015 16:03:27 +0000 (11:03 -0500)]
xprtrdma: Take struct ib_device_attr off the stack

Device attributes are large, and are used in more than one place.
Stash a copy in dynamically allocated memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Free the pd if ib_query_qp() fails
Chuck Lever [Wed, 21 Jan 2015 16:03:19 +0000 (11:03 -0500)]
xprtrdma: Free the pd if ib_query_qp() fails

If ib_query_qp() fails or the memory registration mode isn't
supported, don't leak the PD. An orphaned IB/core resource will
cause IB module removal to hang.

Fixes: bd7ed1d13304 ("RPC/RDMA: check selected memory registration ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
Chuck Lever [Wed, 21 Jan 2015 16:03:11 +0000 (11:03 -0500)]
xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt

Clean up: The rep_func field always refers to rpcrdma_conn_func().
rep_func should have been removed by commit b45ccfd25d50 ("xprtrdma:
Remove MEMWINDOWS registration modes").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Move credit update to RPC reply handler
Chuck Lever [Wed, 21 Jan 2015 16:03:02 +0000 (11:03 -0500)]
xprtrdma: Move credit update to RPC reply handler

Reduce work in the receive CQ handler, which can be run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.

This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header (the CPU stalls while
waiting for the header contents to be brought into the cache).

This further extends work begun by commit e7ce710a8802 ("xprtrdma:
Avoid deadlock when credit window is reset").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove rl_mr field, and the mr_chunk union
Chuck Lever [Wed, 21 Jan 2015 16:02:54 +0000 (11:02 -0500)]
xprtrdma: Remove rl_mr field, and the mr_chunk union

Clean up: Since commit 0ac531c18323 ("xprtrdma: Remove REGISTER
memory registration mode"), the rl_mr pointer is no longer used
anywhere.

After removal, there's only a single member of the mr_chunk union,
so mr_chunk can be removed as well, in favor of a single pointer
field.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove rpcrdma_ep::rep_ia
Chuck Lever [Wed, 21 Jan 2015 16:02:46 +0000 (11:02 -0500)]
xprtrdma: Remove rpcrdma_ep::rep_ia

Clean up: This field is not used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
Chuck Lever [Wed, 21 Jan 2015 16:02:37 +0000 (11:02 -0500)]
xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt

Clean up: Use consistent field names in struct rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>