cascardo/linux.git
9 years agodirect-io: only inc/dec inode->i_dio_count for file systems
Jens Axboe [Wed, 15 Apr 2015 23:05:48 +0000 (17:05 -0600)]
direct-io: only inc/dec inode->i_dio_count for file systems

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofs/9p: fix readdir()
Johannes Berg [Wed, 22 Apr 2015 09:55:14 +0000 (11:55 +0200)]
fs/9p: fix readdir()

Al Viro's IOV changes broke 9p readdir() because the new code
didn't abort the read when it returned nothing. The original
code checked if the combined error/length was <= 0 but in the
new code that accidentally got changed to just an error check.

Add back the return from the function when nothing is read.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: e1200fe68f20 ("9p: switch p9_client_read() to passing struct iov_iter *")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: assorted d_backing_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:21 +0000 (22:26 +0000)]
VFS: assorted d_backing_inode() annotations

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: fs/inode.c helpers: d_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:15 +0000 (22:26 +0000)]
VFS: fs/inode.c helpers: d_inode() annotations

these should be used on objects already in top layer

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: fs/cachefiles: d_backing_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:21 +0000 (22:26 +0000)]
VFS: fs/cachefiles: d_backing_inode() annotations

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: fs library helpers: d_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:15 +0000 (22:26 +0000)]
VFS: fs library helpers: d_inode() annotations

library helpers called by filesystem drivers on their own inodes

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: assorted weird filesystems: d_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:12 +0000 (22:26 +0000)]
VFS: assorted weird filesystems: d_inode() annotations

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: normal filesystems (and lustre): d_inode() annotations
David Howells [Tue, 17 Mar 2015 22:25:59 +0000 (22:25 +0000)]
VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: security/: d_inode() annotations
David Howells [Thu, 19 Feb 2015 10:47:02 +0000 (10:47 +0000)]
VFS: security/: d_inode() annotations

... except where that code acts as a filesystem driver, rather than
working with dentries given to it.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: security/: d_backing_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:22 +0000 (22:26 +0000)]
VFS: security/: d_backing_inode() annotations

most of the ->d_inode uses there refer to the same inode IO would
go to, i.e. d_backing_inode()

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: net/: d_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:16 +0000 (22:26 +0000)]
VFS: net/: d_inode() annotations

socket inodes and sunrpc filesystems - inodes owned by that code

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: net/unix: d_backing_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:21 +0000 (22:26 +0000)]
VFS: net/unix: d_backing_inode() annotations

places where we are dealing with S_ISSOCK file creation/lookups.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: kernel/: d_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:16 +0000 (22:26 +0000)]
VFS: kernel/: d_inode() annotations

relayfs and tracefs are dealing with inodes of their own;
those two act as filesystem drivers

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: audit: d_backing_inode() annotations
David Howells [Tue, 17 Mar 2015 22:26:21 +0000 (22:26 +0000)]
VFS: audit: d_backing_inode() annotations

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Fix up some ->d_inode accesses in the chelsio driver
David Howells [Fri, 6 Mar 2015 14:24:37 +0000 (14:24 +0000)]
VFS: Fix up some ->d_inode accesses in the chelsio driver

Fix up some ->d_inode accesses in the chelsio driver.

 (1) FILE_DATA() should just be replaced with file_inode().

 (2) set_debugfs_file_size() should be removed and debugfs_create_file_size()
     should be used to create the file.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Cachefiles should perform fs modifications on the top layer only
David Howells [Fri, 6 Mar 2015 14:08:58 +0000 (14:08 +0000)]
VFS: Cachefiles should perform fs modifications on the top layer only

Cachefiles should perform fs modifications (eg. vfs_unlink()) on the top layer
only and should not attempt to alter the lower layer.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: AF_UNIX sockets should call mknod on the top layer only
David Howells [Fri, 6 Mar 2015 14:05:26 +0000 (14:05 +0000)]
VFS: AF_UNIX sockets should call mknod on the top layer only

AF_UNIX sockets should call mknod on the top layer only and should not attempt
to modify the lower layer in a layered filesystem such as overlayfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoblock: loop: switch to VFS ITER_BVEC
Christoph Hellwig [Tue, 7 Apr 2015 16:23:29 +0000 (18:23 +0200)]
block: loop: switch to VFS ITER_BVEC

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoconfigfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
David Howells [Mon, 2 Mar 2015 16:40:32 +0000 (16:40 +0000)]
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode

Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Make pathwalk use d_is_reg() rather than S_ISREG()
David Howells [Tue, 17 Mar 2015 22:16:40 +0000 (22:16 +0000)]
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()

Make pathwalk use d_is_reg() rather than S_ISREG() to determine whether to
honour O_TRUNC.  Since this occurs after complete_walk(), the dentry type
field cannot change and the inode pointer cannot change as we hold a ref on
the dentry, so this should be safe.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
David Howells [Thu, 5 Mar 2015 12:46:49 +0000 (12:46 +0000)]
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()

Fix up debugfs to use d_is_dir(dentry) in place of
S_ISDIR(dentry->d_inode->i_mode).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
David Howells [Tue, 17 Mar 2015 17:33:52 +0000 (17:33 +0000)]
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk

Where we have:

     if (!dentry->d_inode || d_is_negative(dentry)) {

type constructions in pathwalk we should be able to eliminate the check of
d_inode and rely solely on the result of d_is_negative() or d_is_positive().

What we do have to take care to do is to read d_inode after calling a
d_is_xxx() typecheck function to get the barriering right.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoNFS: Don't use d_inode as a variable name
David Howells [Wed, 4 Mar 2015 16:38:26 +0000 (16:38 +0000)]
NFS: Don't use d_inode as a variable name

Don't use d_inode as a variable name as it now masks a function name.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Impose ordering on accesses of d_inode and d_flags
David Howells [Thu, 5 Mar 2015 14:09:22 +0000 (14:09 +0000)]
VFS: Impose ordering on accesses of d_inode and d_flags

Impose ordering on accesses of d_inode and d_flags to avoid the need to do
this:

if (!dentry->d_inode || d_is_negative(dentry)) {

when this:

if (d_is_negative(dentry)) {

should suffice.

This check is especially problematic if a dentry can have its type field set
to something other than DENTRY_MISS_TYPE when d_inode is NULL (as in
unionmount).

What we really need to do is stick a write barrier between setting d_inode and
setting d_flags and a read barrier between reading d_flags and reading
d_inode.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoVFS: Add owner-filesystem positive/negative dentry checks
David Howells [Wed, 11 Feb 2015 13:40:17 +0000 (13:40 +0000)]
VFS: Add owner-filesystem positive/negative dentry checks

Supply two functions to test whether a filesystem's own dentries are positive
or negative (d_really_is_positive() and d_really_is_negative()).

The problem is that the DCACHE_ENTRY_TYPE field of dentry->d_flags may be
overridden by the union part of a layered filesystem and isn't thus
necessarily indicative of the type of dentry.

Normally, this would involve a negative dentry (ie. ->d_inode == NULL) having
->d_layer.lower pointed to a lower layer dentry, DCACHE_PINNING_LOWER set and
the DCACHE_ENTRY_TYPE field set to something other than DCACHE_MISS_TYPE - but
it could also involve, say, a DCACHE_SPECIAL_TYPE being overridden to
DCACHE_WHITEOUT_TYPE if a 0,0 chardev is detected in the top layer.

However, inside a filesystem, when that fs is looking at its own dentries, it
probably wants to know if they are really negative or not - and doesn't care
about the fallthrough bits used by the union.

To this end, a filesystem should normally use d_really_is_positive/negative()
when looking at its own dentries rather than d_is_positive/negative() and
should use d_inode() to get at the inode.

Anyone looking at someone else's dentries (this includes pathwalk) should use
d_is_xxx() and d_backing_inode().

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonfs: generic_write_checks() shouldn't be done on swapout...
Al Viro [Thu, 9 Apr 2015 18:11:08 +0000 (14:11 -0400)]
nfs: generic_write_checks() shouldn't be done on swapout...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoocfs2: use __generic_file_write_iter()
Al Viro [Thu, 9 Apr 2015 18:01:33 +0000 (14:01 -0400)]
ocfs2: use __generic_file_write_iter()

we can do that now - all we need is to clear IOCB_DIRECT from ->ki_flags in
"can't do dio" case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agomirror O_APPEND and O_DIRECT into iocb->ki_flags
Al Viro [Thu, 9 Apr 2015 17:52:01 +0000 (13:52 -0400)]
mirror O_APPEND and O_DIRECT into iocb->ki_flags

... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch generic_write_checks() to iocb and iter
Al Viro [Thu, 9 Apr 2015 16:55:47 +0000 (12:55 -0400)]
switch generic_write_checks() to iocb and iter

... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success.  Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/ext4/file.c

9 years agoocfs2: move generic_write_checks() before the alignment checks
Al Viro [Thu, 9 Apr 2015 15:14:45 +0000 (11:14 -0400)]
ocfs2: move generic_write_checks() before the alignment checks

Alignment checks for dio depend upon the range truncation done by
generic_write_checks().  They can be done as soon as we got ocfs2_rw_lock()
and that actually makes ocfs2_prepare_inode_for_write() simpler.

The only thing to watch out for is restoring the original count
in "unlock and redo without dio" case.  Position doesn't need to be
restored, since we change it only in O_APPEND case and in that case it
will be reassigned anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoocfs2_file_write_iter: stop messing with ppos
Al Viro [Thu, 9 Apr 2015 11:25:03 +0000 (07:25 -0400)]
ocfs2_file_write_iter: stop messing with ppos

it's &iocb->ki_pos; no need to obfuscate.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'for-linus' into for-next
Al Viro [Sun, 12 Apr 2015 02:29:51 +0000 (22:29 -0400)]
Merge branch 'for-linus' into for-next

9 years agoudf_file_write_iter: reorder and simplify
Al Viro [Tue, 7 Apr 2015 19:26:36 +0000 (15:26 -0400)]
udf_file_write_iter: reorder and simplify

it's easier to do generic_write_checks() first

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: ->direct_IO() doesn't need generic_write_checks()
Al Viro [Tue, 7 Apr 2015 19:06:19 +0000 (15:06 -0400)]
fuse: ->direct_IO() doesn't need generic_write_checks()

already done by caller.  We used to call __fuse_direct_write(), which
called generic_write_checks(); now the former got expanded, bringing
the latter to the surface.  It used to be called all along and calling
it from there had been wrong all along...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoext4_file_write_iter: move generic_write_checks() up
Al Viro [Tue, 7 Apr 2015 18:48:22 +0000 (14:48 -0400)]
ext4_file_write_iter: move generic_write_checks() up

simpler that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoxfs_file_aio_write_checks: switch to iocb/iov_iter
Al Viro [Tue, 7 Apr 2015 18:25:18 +0000 (14:25 -0400)]
xfs_file_aio_write_checks: switch to iocb/iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agogeneric_write_checks(): drop isblk argument
Al Viro [Sat, 4 Apr 2015 08:05:48 +0000 (04:05 -0400)]
generic_write_checks(): drop isblk argument

all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoblkdev_write_iter: expand generic_file_checks() call in there
Al Viro [Tue, 7 Apr 2015 15:35:14 +0000 (11:35 -0400)]
blkdev_write_iter: expand generic_file_checks() call in there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agolift generic_write_checks() into callers of __generic_file_write_iter()
Al Viro [Tue, 7 Apr 2015 15:28:12 +0000 (11:28 -0400)]
lift generic_write_checks() into callers of __generic_file_write_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago__generic_file_write_iter: keep ->ki_pos and return value consistent
Al Viro [Tue, 7 Apr 2015 14:22:53 +0000 (10:22 -0400)]
__generic_file_write_iter: keep ->ki_pos and return value consistent

A side effect worth noting: in O_APPEND case we set ->ki_pos early,
so if it turns out to be an error or a zero-length write, we'll
end up with ->ki_pos modified.  Safe, since all callers never
look at the ->ki_pos after the call of __generic_file_write_iter()
returning non-positive, all the way to caller of ->write_iter() and
those discard ->ki_pos when getting that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocifs: fold cifs_iovec_write() into the only caller
Al Viro [Tue, 7 Apr 2015 02:44:11 +0000 (22:44 -0400)]
cifs: fold cifs_iovec_write() into the only caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agontfs: move iov_iter_truncate() closer to generic_write_checks()
Al Viro [Sun, 5 Apr 2015 18:06:24 +0000 (14:06 -0400)]
ntfs: move iov_iter_truncate() closer to generic_write_checks()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonew_sync_write(): discard ->ki_pos unless the return value is positive
Al Viro [Tue, 7 Apr 2015 00:50:38 +0000 (20:50 -0400)]
new_sync_write(): discard ->ki_pos unless the return value is positive

That allows ->write_iter() instances much more convenient life wrt
iocb->ki_pos (and fixes several filesystems with borderline POSIX
violations when zero-length write succeeds and changes the current
position).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agodirect_IO: remove rw from a_ops->direct_IO()
Omar Sandoval [Mon, 16 Mar 2015 11:33:53 +0000 (04:33 -0700)]
direct_IO: remove rw from a_ops->direct_IO()

Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agodirect_IO: use iov_iter_rw() instead of rw everywhere
Omar Sandoval [Mon, 16 Mar 2015 11:33:52 +0000 (04:33 -0700)]
direct_IO: use iov_iter_rw() instead of rw everywhere

The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoRemove rw from dax_{do_,}io()
Omar Sandoval [Mon, 16 Mar 2015 11:33:51 +0000 (04:33 -0700)]
Remove rw from dax_{do_,}io()

And use iov_iter_rw() instead.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoRemove rw from {,__,do_}blockdev_direct_IO()
Omar Sandoval [Mon, 16 Mar 2015 11:33:50 +0000 (04:33 -0700)]
Remove rw from {,__,do_}blockdev_direct_IO()

Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonew helper: iov_iter_rw()
Omar Sandoval [Tue, 17 Mar 2015 21:04:02 +0000 (14:04 -0700)]
new helper: iov_iter_rw()

Get either READ or WRITE out of iter->type.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago->aio_read and ->aio_write removed
Al Viro [Sat, 4 Apr 2015 05:14:53 +0000 (01:14 -0400)]
->aio_read and ->aio_write removed

no remaining users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agopcm: another weird API abuse
Al Viro [Sat, 4 Apr 2015 04:19:32 +0000 (00:19 -0400)]
pcm: another weird API abuse

readv() and writev() should _not_ ignore all but the first ->iov_len,
among other things.  Really weird abuse of those syscalls - it
expects a vector element per channel, with identical lengths (it
actually assumes them to be identical - no checking is done).
readv() and writev() are really bad match for that.  Unfortunately,
userland API is userland API and we can't do anything about them.

Converted to ->read_iter/->write_iter.  Please, _please_ don't do
anything of that kind when designing new interfaces.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoinfinibad: weird APIs switched to ->write_iter()
Al Viro [Sat, 4 Apr 2015 04:11:32 +0000 (00:11 -0400)]
infinibad: weird APIs switched to ->write_iter()

Things Not To Do When Writing A Driver, part 1001st:
have writev() and write() on the same file doing completely
different things.  As in, "interpret very different sets of
commands".

We _can_ handle that, but it's a bloody bad idea.
Don't do that in new drivers.  Ever.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agokill do_sync_read/do_sync_write
Al Viro [Sat, 4 Apr 2015 02:10:20 +0000 (22:10 -0400)]
kill do_sync_read/do_sync_write

all remaining instances of aio_{read,write} (all 4 of them) have explicit
->read and ->write resp.; do_sync_read/do_sync_write is never called by
__vfs_read/__vfs_write anymore and no other users had been left.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: use iov_iter_get_pages() for non-splice path
Al Viro [Sat, 4 Apr 2015 02:06:08 +0000 (22:06 -0400)]
fuse: use iov_iter_get_pages() for non-splice path

store reference to iter instead of that to iovec

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: switch to ->read_iter/->write_iter
Al Viro [Sat, 4 Apr 2015 01:53:39 +0000 (21:53 -0400)]
fuse: switch to ->read_iter/->write_iter

we just change the calling conventions here; more work to follow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch drivers/char/mem.c to ->read_iter/->write_iter
Al Viro [Fri, 3 Apr 2015 19:57:04 +0000 (15:57 -0400)]
switch drivers/char/mem.c to ->read_iter/->write_iter

Note that _these_ guys have ->read() and ->write() left in place - they are
eqiuvalent to what we'd get if we replaced those with NULL, but we are
talking about hot paths here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agomake new_sync_{read,write}() static
Al Viro [Fri, 3 Apr 2015 19:41:18 +0000 (15:41 -0400)]
make new_sync_{read,write}() static

All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocoredump: accept any write method
Al Viro [Fri, 3 Apr 2015 19:23:17 +0000 (15:23 -0400)]
coredump: accept any write method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch /dev/loop to vfs_iter_write()
Al Viro [Fri, 3 Apr 2015 19:21:59 +0000 (15:21 -0400)]
switch /dev/loop to vfs_iter_write()

all writable files that might be used as backing store for /dev/loop
already support ->write_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoserial2002: switch to __vfs_read/__vfs_write
Al Viro [Fri, 3 Apr 2015 19:14:42 +0000 (15:14 -0400)]
serial2002: switch to __vfs_read/__vfs_write

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoashmem: use __vfs_read()
Al Viro [Fri, 3 Apr 2015 19:09:38 +0000 (15:09 -0400)]
ashmem: use __vfs_read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoexport __vfs_read()
Al Viro [Fri, 3 Apr 2015 19:09:18 +0000 (15:09 -0400)]
export __vfs_read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoautofs: switch to __vfs_write()
Al Viro [Fri, 3 Apr 2015 19:07:48 +0000 (15:07 -0400)]
autofs: switch to __vfs_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonew helper: __vfs_write()
Al Viro [Fri, 3 Apr 2015 19:06:43 +0000 (15:06 -0400)]
new helper: __vfs_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch '9p-iov_iter' into for-next
Al Viro [Sun, 12 Apr 2015 02:28:58 +0000 (22:28 -0400)]
Merge branch '9p-iov_iter' into for-next

9 years agoswitch hugetlbfs to ->read_iter()
Al Viro [Fri, 3 Apr 2015 15:31:35 +0000 (11:31 -0400)]
switch hugetlbfs to ->read_iter()

... and fix the case when the area we are asked to read crosses
a hugepage boundary

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocoda: switch to ->read_iter/->write_iter
Al Viro [Fri, 3 Apr 2015 14:58:11 +0000 (10:58 -0400)]
coda: switch to ->read_iter/->write_iter

... and request the same from the local cache - all filesystems with
anything usable for that support those already.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoncpfs: switch to ->read_iter/->write_iter
Al Viro [Fri, 3 Apr 2015 03:30:18 +0000 (23:30 -0400)]
ncpfs: switch to ->read_iter/->write_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonet/9p: remove (now-)unused helpers
Al Viro [Fri, 3 Apr 2015 03:11:36 +0000 (23:11 -0400)]
net/9p: remove (now-)unused helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agop9_client_attach(): set fid->uid correctly
Al Viro [Fri, 3 Apr 2015 01:47:49 +0000 (21:47 -0400)]
p9_client_attach(): set fid->uid correctly

it's almost always equal to current_fsuid(), but there's an exception -
if the first writeback fid is opened by non-root *and* that happens before
root has done any lookups in /, we end up doing attach for root.  The
current code leaves the resulting FID owned by root from the server POV
and by non-root from the client one.  Unfortunately, it means that e.g.
massive dcache eviction will leave that user buggered - they'll end
up redoing walks from / *and* picking that FID every time.  As soon as
they try to create something, the things will get nasty.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: we are leaking glock.client_id in v9fs_file_getlock()
Al Viro [Thu, 2 Apr 2015 16:02:03 +0000 (12:02 -0400)]
9p: we are leaking glock.client_id in v9fs_file_getlock()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch to ->read_iter/->write_iter
Al Viro [Thu, 2 Apr 2015 03:59:57 +0000 (23:59 -0400)]
9p: switch to ->read_iter/->write_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: get rid of v9fs_direct_file_read()
Al Viro [Thu, 2 Apr 2015 03:49:24 +0000 (23:49 -0400)]
9p: get rid of v9fs_direct_file_read()

do it in ->direct_IO()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch p9_client_read() to passing struct iov_iter *
Al Viro [Thu, 2 Apr 2015 03:42:28 +0000 (23:42 -0400)]
9p: switch p9_client_read() to passing struct iov_iter *

... and make it loop

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: get rid of v9fs_direct_file_write()
Al Viro [Thu, 2 Apr 2015 02:32:23 +0000 (22:32 -0400)]
9p: get rid of v9fs_direct_file_write()

just handle it in ->direct_IO()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: fold v9fs_file_write_internal() into the caller
Al Viro [Thu, 2 Apr 2015 02:04:46 +0000 (22:04 -0400)]
9p: fold v9fs_file_write_internal() into the caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch ->writepage() to direct use of p9_client_write()
Al Viro [Thu, 2 Apr 2015 01:54:42 +0000 (21:54 -0400)]
9p: switch ->writepage() to direct use of p9_client_write()

Don't mess with kmap() - just use ITER_BVEC.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch p9_client_write() to passing it struct iov_iter *
Al Viro [Thu, 2 Apr 2015 00:17:51 +0000 (20:17 -0400)]
9p: switch p9_client_write() to passing it struct iov_iter *

... and make it loop until it's done

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonet/9p: switch the guts of p9_client_{read,write}() to iov_iter
Al Viro [Wed, 1 Apr 2015 23:57:53 +0000 (19:57 -0400)]
net/9p: switch the guts of p9_client_{read,write}() to iov_iter

... and have get_user_pages_fast() mapping fewer pages than requested
to generate a short read/write.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonommu: use __vfs_read()
Al Viro [Tue, 31 Mar 2015 16:35:13 +0000 (12:35 -0400)]
nommu: use __vfs_read()

... instead of open-coding the call of ->read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoacct: check FMODE_CAN_WRITE
Al Viro [Tue, 31 Mar 2015 16:30:48 +0000 (12:30 -0400)]
acct: check FMODE_CAN_WRITE

it's not calling ->write() directly anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoaio_run_iocb(): kill dead check
Al Viro [Tue, 31 Mar 2015 15:54:59 +0000 (11:54 -0400)]
aio_run_iocb(): kill dead check

We check if ->ki_pos is positive.  However, by that point we have
already done rw_verify_area(), which would have rejected such
unless the file had been one of /dev/mem, /dev/kmem and /proc/kcore.
All of which do not have vectored rw methods, so we would've bailed
out even earlier.

This check had been introduced before rw_verify_area() had been added there
- in fact, it was a subset of checks done on sync paths by rw_verify_area()
(back then the /dev/mem exception didn't exist at all).  The rest of checks
(mandatory locking, etc.) hadn't been added until later.  Unfortunately,
by the time the call of rw_verify_area() got added, the /dev/mem exception
had already appeared, so it wasn't obvious that the older explicit check
downstream had become dead code.  It *is* a dead code, though, since the few
files for which the exception applies do not have ->aio_{read,write}() or
->{read,write}_iter() and for them we won't reach that check anyway.

What's more, even if we ever introduce vectored methods for /dev/mem
and friends, they'll have to cope with negative positions anyway, since
readv(2) and writev(2) are using the same checks as read(2) and write(2) -
i.e. rw_verify_area().

Let's bury it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoioctx_alloc(): remove pointless check
Al Viro [Tue, 31 Mar 2015 15:43:52 +0000 (11:43 -0400)]
ioctx_alloc(): remove pointless check

Way, way back kiocb used to be picked from arrays, so ioctx_alloc()
checked for multiplication overflow when calculating the size of
such array.  By the time fs/aio.c went into the tree (in 2002) they
were already allocated one-by-one by kmem_cache_alloc(), so that
check had already become pointless.  Let's bury it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agolustre: kill unused members of struct vvp_thread_info
Al Viro [Tue, 31 Mar 2015 03:39:16 +0000 (23:39 -0400)]
lustre: kill unused members of struct vvp_thread_info

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoexpand __fuse_direct_write() in both callers
Al Viro [Tue, 31 Mar 2015 02:15:58 +0000 (22:15 -0400)]
expand __fuse_direct_write() in both callers

it's actually shorter that way *and* later we'll want iocb in scope
of generic_write_check() caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()
Al Viro [Tue, 31 Mar 2015 02:08:36 +0000 (22:08 -0400)]
fuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocuse: switch to iov_iter
Al Viro [Sat, 21 Mar 2015 13:01:45 +0000 (09:01 -0400)]
cuse: switch to iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'for-davem' into for-next
Al Viro [Sun, 12 Apr 2015 02:27:19 +0000 (22:27 -0400)]
Merge branch 'for-davem' into for-next

9 years agosg_start_req(): use import_iovec()
Al Viro [Sun, 22 Mar 2015 00:25:30 +0000 (20:25 -0400)]
sg_start_req(): use import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agosg_start_req(): make sure that there's not too many elements in iovec
Al Viro [Sun, 22 Mar 2015 00:08:18 +0000 (20:08 -0400)]
sg_start_req(): make sure that there's not too many elements in iovec

unfortunately, allowing an arbitrary 16bit value means a possibility of
overflow in the calculation of total number of pages in bio_map_user_iov() -
we rely on there being no more than PAGE_SIZE members of sum in the
first loop there.  If that sum wraps around, we end up allocating
too small array of pointers to pages and it's easy to overflow it in
the second loop.

X-Coverup: TINC (and there's no lumber cartel either)
Cc: stable@vger.kernel.org # way, way back
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoblk_rq_map_user(): use import_single_range()
Al Viro [Sun, 22 Mar 2015 00:06:04 +0000 (20:06 -0400)]
blk_rq_map_user(): use import_single_range()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agosg_io(): use import_iovec()
Al Viro [Sun, 22 Mar 2015 00:02:55 +0000 (20:02 -0400)]
sg_io(): use import_iovec()

... and don't skip access_ok() validation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoprocess_vm_access: switch to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 18:47:11 +0000 (14:47 -0400)]
process_vm_access: switch to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch keyctl_instantiate_key_common() to iov_iter
Al Viro [Tue, 17 Mar 2015 13:59:38 +0000 (09:59 -0400)]
switch keyctl_instantiate_key_common() to iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch {compat_,}do_readv_writev() to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 23:40:11 +0000 (19:40 -0400)]
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoaio_setup_vectored_rw(): switch to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 23:34:53 +0000 (19:34 -0400)]
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agovmsplice_to_user(): switch to import_iovec()
Al Viro [Sat, 21 Mar 2015 23:17:55 +0000 (19:17 -0400)]
vmsplice_to_user(): switch to import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agokill aio_setup_single_vector()
Al Viro [Sat, 21 Mar 2015 23:11:55 +0000 (19:11 -0400)]
kill aio_setup_single_vector()

identical to import_single_range()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'iov_iter' into for-next
Al Viro [Sun, 12 Apr 2015 02:26:51 +0000 (22:26 -0400)]
Merge branch 'iov_iter' into for-next

9 years agoaio: simplify arguments of aio_setup_..._rw()
Al Viro [Sat, 21 Mar 2015 00:40:18 +0000 (20:40 -0400)]
aio: simplify arguments of aio_setup_..._rw()

We don't need req in either of those.  We don't need nr_segs in caller.
We don't really need len in caller either - iov_iter_count(&iter) will do.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoaio: lift iov_iter_init() into aio_setup_..._rw()
Al Viro [Sat, 21 Mar 2015 00:17:32 +0000 (20:17 -0400)]
aio: lift iov_iter_init() into aio_setup_..._rw()

the only non-trivial detail is that we do it before rw_verify_area(),
so we'd better cap the length ourselves in aio_setup_single_rw()
case (for vectored case rw_copy_check_uvector() will do that for us).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>