cascardo/linux.git
10 years agoBtrfs: check if leaf's parent exists before pushing items around
Liu Bo [Wed, 22 May 2013 12:07:06 +0000 (12:07 +0000)]
Btrfs: check if leaf's parent exists before pushing items around

During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: dont do log_removal in insert_new_root
Liu Bo [Wed, 22 May 2013 12:06:51 +0000 (12:06 +0000)]
Btrfs: dont do log_removal in insert_new_root

As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.

As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.

As above, insert_new_root can get rid of log_removal.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: return error code in btrfs_check_trunc_cache_free_space()
Wei Yongjun [Tue, 21 May 2013 02:39:21 +0000 (02:39 +0000)]
Btrfs: return error code in btrfs_check_trunc_cache_free_space()

Fix to return error code instead always return 0 from function
btrfs_check_trunc_cache_free_space().
Introduced by commit 7b61cd92242542944fc27024900c495a6a7b3396
(Btrfs: don't use global block reservation for inode cache truncation)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix estale with btrfs send
Josef Bacik [Mon, 20 May 2013 15:26:50 +0000 (11:26 -0400)]
Btrfs: fix estale with btrfs send

This fixes bugzilla 57491.  If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems.  When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup.  So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it.  The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup.  With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: device delete to get errors from the kernel
Anand Jain [Fri, 17 May 2013 10:52:45 +0000 (10:52 +0000)]
btrfs: device delete to get errors from the kernel

when user runs command btrfs dev del the raid requisite error if any
goes to the /var/log/messages, its not good idea to clutter messages
with these user (knowledge) errors, further user don't have to review
the system messages to know problem with the cli it should be dropped
to the user as part of the cli return.

to bring this feature created a set of the ERROR defined
BTRFS_ERROR_DEV* error codes and created their error string.

I expect this enum to be added with other error which we might
want to communicate to the user land

v3:
moved the code with in the file no logical change

v1->v2:
introduce error codes for the device mgmt usage

v1:
adds a parameter in the ioctl arg struct to carry the error string

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: do delay iput in sync_fs
Josef Bacik [Thu, 16 May 2013 15:14:33 +0000 (11:14 -0400)]
Btrfs: do delay iput in sync_fs

We get lock inversion with umount if we allow iputs from sync_fs, so use the
delay iput flag to keep this from happening.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: make the state of the transaction more readable
Miao Xie [Fri, 17 May 2013 03:53:43 +0000 (03:53 +0000)]
Btrfs: make the state of the transaction more readable

We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
  enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
  }
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
  unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
   __TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH |
   __TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH |
   __TRANS_JOIN |
   __TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
   __TRANS_START |
   __TRANS_ATTACH |
   __TRANS_JOIN |
   __TRANS_JOIN_NOLOCK),
  }
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove the time check in btrfs_commit_transaction()
Miao Xie [Wed, 15 May 2013 07:48:30 +0000 (07:48 +0000)]
Btrfs: remove the time check in btrfs_commit_transaction()

We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
  operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
  snapshot creation, btrfs doesn't commit the transaction on its
  own initiative.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure
Miao Xie [Wed, 15 May 2013 07:48:29 +0000 (07:48 +0000)]
Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure

We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.

Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).

So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set
Miao Xie [Wed, 15 May 2013 07:48:28 +0000 (07:48 +0000)]
Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set

It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: don't wait for all the writers circularly during the transaction commit
Miao Xie [Wed, 15 May 2013 07:48:27 +0000 (07:48 +0000)]
Btrfs: don't wait for all the writers circularly during the transaction commit

btrfs_commit_transaction has the following loop before we commit the
transaction.

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
 (should_grow && cur_trans->num_joined != joined));

This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.

Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.

In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove the code for the impossible case in cleanup_transaction()
Miao Xie [Wed, 15 May 2013 07:48:26 +0000 (07:48 +0000)]
Btrfs: remove the code for the impossible case in cleanup_transaction()

If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup unnecessary assignment when cleaning up all the residual transaction
Miao Xie [Wed, 15 May 2013 07:48:25 +0000 (07:48 +0000)]
Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction

When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: just flush the delalloc inodes in the source tree before snapshot creation
Miao Xie [Wed, 15 May 2013 07:48:24 +0000 (07:48 +0000)]
Btrfs: just flush the delalloc inodes in the source tree before snapshot creation

Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.

This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce per-subvolume ordered extent list
Miao Xie [Wed, 15 May 2013 07:48:23 +0000 (07:48 +0000)]
Btrfs: introduce per-subvolume ordered extent list

The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce per-subvolume delalloc inode list
Miao Xie [Wed, 15 May 2013 07:48:22 +0000 (07:48 +0000)]
Btrfs: introduce per-subvolume delalloc inode list

When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce grab/put functions for the root of the fs/file tree
Miao Xie [Wed, 15 May 2013 07:48:20 +0000 (07:48 +0000)]
Btrfs: introduce grab/put functions for the root of the fs/file tree

The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup the similar code of the fs root read
Miao Xie [Wed, 15 May 2013 07:48:19 +0000 (07:48 +0000)]
Btrfs: cleanup the similar code of the fs root read

There are several functions whose code is similar, such as
  btrfs_find_last_root()
  btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
  btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: make the snap/subv deletion end more early when the fs is R/O
Miao Xie [Tue, 14 May 2013 10:20:43 +0000 (10:20 +0000)]
Btrfs: make the snap/subv deletion end more early when the fs is R/O

The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot()
Miao Xie [Tue, 14 May 2013 10:20:42 +0000 (10:20 +0000)]
Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot()

If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: make the cleaner complete early when the fs is going to be umounted
Miao Xie [Tue, 14 May 2013 10:20:41 +0000 (10:20 +0000)]
Btrfs: make the cleaner complete early when the fs is going to be umounted

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove unnecessary ->s_umount in cleaner_kthread()
Miao Xie [Tue, 14 May 2013 10:20:40 +0000 (10:20 +0000)]
Btrfs: remove unnecessary ->s_umount in cleaner_kthread()

In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.

We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup: don't check the same thing twice
Stefan Behrens [Mon, 13 May 2013 13:53:35 +0000 (13:53 +0000)]
Btrfs: cleanup: don't check the same thing twice

btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL
Stefan Behrens [Mon, 13 May 2013 14:42:57 +0000 (14:42 +0000)]
Btrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL

No need to check for NULL in send.c and disk-io.c.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: delete unused function
Stefan Behrens [Mon, 13 May 2013 13:53:33 +0000 (13:53 +0000)]
Btrfs: delete unused function

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: remove useless copy in quota_ctl
Liu Bo [Mon, 13 May 2013 11:10:44 +0000 (11:10 +0000)]
Btrfs: remove useless copy in quota_ctl

We don't need to copy it back to user side as it remains unchanged.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoMinor format cleanup.
Andreas Philipp [Sat, 11 May 2013 11:12:54 +0000 (11:12 +0000)]
Minor format cleanup.

Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: cleanup unused arguments in send.c
Tsutomu Itoh [Wed, 8 May 2013 07:51:52 +0000 (07:51 +0000)]
Btrfs: cleanup unused arguments in send.c

sctx is removed from the argument of the function that
doesn't use sctx.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix a comment
Stefan Behrens [Tue, 7 May 2013 10:23:30 +0000 (10:23 +0000)]
Btrfs: fix a comment

The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: add ioctl to wait for qgroup rescan completion
Jan Schmidt [Mon, 6 May 2013 19:14:17 +0000 (19:14 +0000)]
Btrfs: add ioctl to wait for qgroup rescan completion

btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist
Wang Shilong [Mon, 6 May 2013 11:03:27 +0000 (11:03 +0000)]
Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist

When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.

By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below

fsstress -p 4 -n 10000 -d $dir

With this patch:

real    0m50.153s
user    0m0.081s
sys     0m6.294s

real    0m51.113s
user    0m0.092s
sys     0m6.220s

real    0m52.610s
user    0m0.096s
sys     0m6.125s avg 6.213
-----------------------------------------------------
Without the patch:

real    0m54.825s
user    0m0.061s
sys     0m10.665s

real    1m6.401s
user    0m0.089s
sys     0m11.218s

real    1m13.768s
user    0m0.087s
sys     0m10.665s       avg 10.849

we can see the sys time reduce ~43%.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: show compiled-in config features at module load time
David Sterba [Tue, 30 Apr 2013 16:51:59 +0000 (16:51 +0000)]
btrfs: show compiled-in config features at module load time

We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
Also kill version.h file that serves no purpose, we don't use any
version tag for kernel module.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: move ifdef around sanity checks out of init_btrfs_fs
David Sterba [Tue, 30 Apr 2013 16:51:58 +0000 (16:51 +0000)]
btrfs: move ifdef around sanity checks out of init_btrfs_fs

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: add prefix to sanity tests messages
David Sterba [Tue, 30 Apr 2013 16:51:57 +0000 (16:51 +0000)]
btrfs: add prefix to sanity tests messages

And change the message level to KERN_INFO.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agobtrfs: add debug check for extent_io range alignment
David Sterba [Tue, 30 Apr 2013 15:22:23 +0000 (15:22 +0000)]
btrfs: add debug check for extent_io range alignment

The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix check on same raid type flag twice
Henrik Nordvik [Mon, 29 Apr 2013 18:09:23 +0000 (18:09 +0000)]
Btrfs: fix check on same raid type flag twice

Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.

Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: stop all workers before cleaning up roots
Josef Bacik [Thu, 30 May 2013 20:55:44 +0000 (16:55 -0400)]
Btrfs: stop all workers before cleaning up roots

Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread.  That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup.  This
should fix the race.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
10 years agoBtrfs: fix use-after-free bug during umount
Liu Bo [Sun, 26 May 2013 13:50:27 +0000 (13:50 +0000)]
Btrfs: fix use-after-free bug during umount

Commit be283b2e674a09457d4563729015adb637ce7cc1
(    Btrfs: use helper to cleanup tree roots) introduced the following bug,

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
 IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
[...]
 Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
 RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
 Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
[...]
 Call Trace:
  [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
  [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
  [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
  [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
  [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
  [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
  [<ffffffff81068d3d>] kthread+0x8d/0x95
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
  [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]

We've free'ed commit_root before actually getting to free block groups where
caching thread needs valid extent_root->commit_root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
10 years agoBtrfs: init relocate extent_io_tree with a mapping
Josef Bacik [Fri, 31 May 2013 17:04:36 +0000 (13:04 -0400)]
Btrfs: init relocate extent_io_tree with a mapping

Dave reported a NULL pointer deref.  This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping.  To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
10 years agobtrfs: Drop inode if inode root is NULL
Naohiro Aota [Thu, 6 Jun 2013 09:56:34 +0000 (09:56 +0000)]
btrfs: Drop inode if inode root is NULL

There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
10 years agoBtrfs: don't delete fs_roots until after we cleanup the transaction
Josef Bacik [Thu, 6 Jun 2013 14:29:40 +0000 (10:29 -0400)]
Btrfs: don't delete fs_roots until after we cleanup the transaction

We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root.  Thanks

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
11 years agoMerge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs...
Chris Mason [Sat, 18 May 2013 01:53:17 +0000 (21:53 -0400)]
Merge branch 'for-chris' of git://git./linux/kernel/git/josef/btrfs-next

11 years agoBtrfs: use a btrfs bioset instead of abusing bio internals
Chris Mason [Fri, 17 May 2013 22:30:14 +0000 (18:30 -0400)]
Btrfs: use a btrfs bioset instead of abusing bio internals

Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs.  They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index.  The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
11 years agoBtrfs: make sure roots are assigned before freeing their nodes
Josef Bacik [Fri, 17 May 2013 18:06:51 +0000 (14:06 -0400)]
Btrfs: make sure roots are assigned before freeing their nodes

If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point.  Just add checks to make sure these
roots are set to keep us from panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: explicitly use global_block_rsv for quota_tree
Stefan Behrens [Thu, 16 May 2013 14:48:19 +0000 (14:48 +0000)]
Btrfs: explicitly use global_block_rsv for quota_tree

The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: do away with non-whole_page extent I/O
Alexandre Oliva [Wed, 15 May 2013 15:38:55 +0000 (11:38 -0400)]
btrfs: do away with non-whole_page extent I/O

end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error.  This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.

It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case.  Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!

I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes.  The
warnings should probably just go away some time.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
Miao Xie [Wed, 15 May 2013 07:48:21 +0000 (07:48 +0000)]
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context

btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
Miao Xie [Wed, 15 May 2013 07:48:18 +0000 (07:48 +0000)]
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()

We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: pause the space balance when remounting to R/O
Miao Xie [Wed, 15 May 2013 07:48:17 +0000 (07:48 +0000)]
Btrfs: pause the space balance when remounting to R/O

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix unprotected root node of the subvolume's inode rb-tree
Miao Xie [Wed, 15 May 2013 07:48:16 +0000 (07:48 +0000)]
Btrfs: fix unprotected root node of the subvolume's inode rb-tree

The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix accessing a freed tree root
Miao Xie [Wed, 15 May 2013 07:48:15 +0000 (07:48 +0000)]
Btrfs: fix accessing a freed tree root

inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: return errno if possible when we fail to allocate memory
Liu Bo [Tue, 14 May 2013 02:12:15 +0000 (02:12 +0000)]
Btrfs: return errno if possible when we fail to allocate memory

We need to set return value explicitly, otherwise we'll lose the error
value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: update the global reserve if it is empty
Miao Xie [Mon, 13 May 2013 13:55:12 +0000 (13:55 +0000)]
Btrfs: update the global reserve if it is empty

Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't steal the reserved space from the global reserve if their space type...
Miao Xie [Mon, 13 May 2013 13:55:11 +0000 (13:55 +0000)]
Btrfs: don't steal the reserved space from the global reserve if their space type is different

If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: optimize the error handle of use_block_rsv()
Miao Xie [Mon, 13 May 2013 13:55:10 +0000 (13:55 +0000)]
Btrfs: optimize the error handle of use_block_rsv()

cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't use global block reservation for inode cache truncation
Miao Xie [Mon, 13 May 2013 13:55:09 +0000 (13:55 +0000)]
Btrfs: don't use global block reservation for inode cache truncation

It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't abort the current transaction if there is no enough space for inode...
Miao Xie [Mon, 13 May 2013 13:55:08 +0000 (13:55 +0000)]
Btrfs: don't abort the current transaction if there is no enough space for inode cache

The filesystem with inode cache was forced to be read-only when we umounted it.

Steps to reproduce:
 # mkfs.btrfs -f ${DEV}
 # mount -o inode_cache ${DEV} ${MNT}
 # dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
 # btrfs fi syn ${MNT}
 # dd if=${MNT}/file1 of=/dev/null bs=1M
 # rm -f ${MNT}/file1
 # btrfs fi syn ${MNT}
 # umount ${MNT}

It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.

But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoCorrect allowed raid levels on balance.
Andreas Philipp [Sat, 11 May 2013 11:13:03 +0000 (11:13 +0000)]
Correct allowed raid levels on balance.

Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix possible memory leak in replace_path()
Stefan Behrens [Wed, 8 May 2013 08:56:09 +0000 (08:56 +0000)]
Btrfs: fix possible memory leak in replace_path()

In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.

Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix possible memory leak in the find_parent_nodes()
Wang Shilong [Wed, 8 May 2013 08:10:25 +0000 (08:10 +0000)]
Btrfs: fix possible memory leak in the find_parent_nodes()

In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't allow device replace on RAID5/RAID6
Stefan Behrens [Tue, 7 May 2013 17:28:03 +0000 (17:28 +0000)]
Btrfs: don't allow device replace on RAID5/RAID6

This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.

One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).

0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918                            num_stripes = map->num_stripes;
4919                            max_errors = nr_parity_stripes(map);
4920
4921                            raid_map = kmalloc(sizeof(u64) * num_stripes,
4922                                               GFP_NOFS);
4923                            if (!raid_map) {
4924                                    ret = -ENOMEM;
4925                                    goto out;
4926                            }
4927

There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: handle running extent ops with skinny metadata
Josef Bacik [Thu, 9 May 2013 17:49:30 +0000 (13:49 -0400)]
Btrfs: handle running extent ops with skinny metadata

Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata.  We also lose
the level of the metadata we are working on.  So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: remove warn on in free space cache writeout
Josef Bacik [Wed, 8 May 2013 20:44:57 +0000 (16:44 -0400)]
Btrfs: remove warn on in free space cache writeout

This catches block groups that are too large to properly cache.  We deal with
this case fine, so the warning just confuses users.  Remove the warning.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't null pointer deref on abort
Josef Bacik [Wed, 8 May 2013 17:30:11 +0000 (13:30 -0400)]
Btrfs: don't null pointer deref on abort

I'm sorry, theres no excuse for this sort of work.  We need to use
root->leafsize since eb may be NULL.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: don't stop searching after encountering the wrong item
Gabriel de Perthuis [Mon, 6 May 2013 17:40:18 +0000 (17:40 +0000)]
btrfs: don't stop searching after encountering the wrong item

The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix off-by-one in fiemap
Liu Bo [Wed, 1 May 2013 16:23:41 +0000 (16:23 +0000)]
Btrfs: fix off-by-one in fiemap

lock_extent/unlock_extent expect an exclusive end.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: annotate quota tree for lockdep
David Sterba [Tue, 30 Apr 2013 17:29:29 +0000 (17:29 +0000)]
btrfs: annotate quota tree for lockdep

Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.

There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID.  No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: allow superblock mismatch from older mkfs
Chris Mason [Tue, 7 May 2013 15:00:13 +0000 (11:00 -0400)]
Btrfs: allow superblock mismatch from older mkfs

We've added new checks to make sure the super block crc is correct
during mount.  A fresh filesystem from an older mkfs won't have the
crc set.  This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
11 years agobtrfs: enhance superblock checks
David Sterba [Wed, 6 Mar 2013 14:57:46 +0000 (15:57 +0100)]
btrfs: enhance superblock checks

The superblock checksum is not verified upon mount. <awkward silence>

Add that check and also reorder existing checks to a more logical
order.

Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.

First transaction commit calculates correct superblock checksum and
saves it to disk.

Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
11 years agobtrfs: fix misleading variable name for flags
David Sterba [Mon, 29 Apr 2013 13:39:40 +0000 (13:39 +0000)]
btrfs: fix misleading variable name for flags

The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: use unsigned long type for extent state bits
David Sterba [Mon, 29 Apr 2013 13:38:46 +0000 (13:38 +0000)]
btrfs: use unsigned long type for extent state bits

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: improve the loop of scrub_stripe
Liu Bo [Sat, 27 Apr 2013 02:56:57 +0000 (02:56 +0000)]
Btrfs: improve the loop of scrub_stripe

1) Right now scrub_stripe() is looping in some unnecessary cases:
* when the found extent item's objectid has been out of the dev extent's range
  but we haven't finish scanning all the range within the dev extent
* when all the items has been processed but we haven't finish scanning all the
  range within the dev extent

In both cases, we can just finish the loop to save costs.

2) Besides, when the found extent item's length is larger than the stripe
len(64k), we don't have to release the path and search again as it'll get at the
same key used in the last loop, we can instead increase the logical cursor in
place till all space of the extent is scanned.

3) And we use 0 as the key's offset to search btree, then get to previous item
to find a smaller item, and again have to move to the next one to get the right
item.  Setting offset=-1 and previous_item() is the correct way.

4) As we won't find any checksum at offset unless this 'offset' is in a data
extent, we can just find checksum when we're really going to scrub an extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: read entire device info under lock
David Sterba [Fri, 26 Apr 2013 15:20:23 +0000 (15:20 +0000)]
btrfs: read entire device info under lock

There's a theoretical possibility of reading stale (or even more
theoretically, freed) data from DEV_INFO ioctl when the device would
disappear between an early mutex unlock and data being copied from the
device structure.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: remove unused gfp mask parameter from release_extent_buffer callchain
David Sterba [Fri, 26 Apr 2013 14:56:29 +0000 (14:56 +0000)]
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain

It's unused since 0b32f4bbb423f02ac.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: handle errors returned from get_tree_block_key
David Sterba [Fri, 26 Apr 2013 12:56:04 +0000 (12:56 +0000)]
btrfs: handle errors returned from get_tree_block_key

Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: make static code static & remove dead code
Eric Sandeen [Thu, 25 Apr 2013 20:41:01 +0000 (20:41 +0000)]
btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: deal with errors in write_dev_supers
Josef Bacik [Mon, 29 Apr 2013 14:05:57 +0000 (10:05 -0400)]
Btrfs: deal with errors in write_dev_supers

If you try to mount -o loop a restored file system it will panic if the file
ends up being smaller than the original disk.  This is because we go to try and
get a block for a super that may be past the EOF which makes __getblk return
NULL for a buffer head when we aren't expecting it to.  Fix this by dealing with
this case and just jacking up the errors count.  With this patch we no longer
panic when mounting a restored file system loopback.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: remove almost all of the BUG()'s from tree-log.c
Josef Bacik [Thu, 25 Apr 2013 20:23:32 +0000 (16:23 -0400)]
Btrfs: remove almost all of the BUG()'s from tree-log.c

There were a whole bunch and I was doing it for other things.  I haven't tested
these error paths but at the very least this is better than panicing.  I've only
left 2 BUG_ON()'s since they are logic errors and I want to replace them with a
ASSERT framework that we can compile out for production users.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: deal with free space cache errors while replaying log
Josef Bacik [Thu, 25 Apr 2013 19:55:30 +0000 (15:55 -0400)]
Btrfs: deal with free space cache errors while replaying log

So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed.  So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this.  With this patch
we just fail to mount instead of panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: automatic rescan after "quota enable" command
Jan Schmidt [Thu, 25 Apr 2013 16:04:52 +0000 (16:04 +0000)]
Btrfs: automatic rescan after "quota enable" command

When qgroup tracking is enabled, we do an automatic cycle of the new rescan
mechanism.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: rescan for qgroups
Jan Schmidt [Thu, 25 Apr 2013 16:04:51 +0000 (16:04 +0000)]
Btrfs: rescan for qgroups

If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.

A filesystem under rescan can still be umounted. The rescan continues on the
next mount.  Status information is provided with a separate ioctl while a
rescan operation is in progress.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: split btrfs_qgroup_account_ref into four functions
Jan Schmidt [Thu, 25 Apr 2013 16:04:50 +0000 (16:04 +0000)]
Btrfs: split btrfs_qgroup_account_ref into four functions

The function is separated into a preparation part and the three accounting
steps mentioned in the qgroups documentation. The goal is to make steps two
and three usable by the rescan functionality. A side effect is that the
function is restructured into readable subunits.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: allocate new chunks if the space is not enough for global rsv
Miao Xie [Thu, 25 Apr 2013 10:12:38 +0000 (10:12 +0000)]
Btrfs: allocate new chunks if the space is not enough for global rsv

When running the 208th of xfstests, the fs returned the enospc
error when there was lots of free space in the disk.

By bisect debug, we found it was introduced by commit 96f1bb5777.
This commit makes the space check for the global reservation in
can_overcommit() be inconsistent with should_alloc_chunk().
can_overcommit() requires that the free space is 2 times the size
of the global reservation, or we can't do overcommit. And instead,
we need reclaim some reserved space, and if we still don't have
enough free space, we need allocate a new chunk. But unfortunately,
should_alloc_chunk() just requires that the free space is 1 time
the size of the global reservation, that is we would not try to
allocate a new chunk if the free space size is in the middle of
these two requires, and just return the enospc error. Fix it.

Cc: Jim Schutt <jaschut@sandia.gov>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: separate sequence numbers for delayed ref tracking and tree mod log
Jan Schmidt [Wed, 24 Apr 2013 16:57:33 +0000 (16:57 +0000)]
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log

Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.

However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.

To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.

The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.

Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agobtrfs: move leak debug code to functions
Eric Sandeen [Mon, 22 Apr 2013 16:12:31 +0000 (16:12 +0000)]
btrfs: move leak debug code to functions

Clean up the leak debugging in extent_io.c by moving
the debug code into functions.  This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.

Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.

Thanks to Dave Sterba for the Kconfig bit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: return free space in cow error path
Liu Bo [Mon, 22 Apr 2013 10:53:47 +0000 (10:53 +0000)]
Btrfs: return free space in cow error path

Replace some BUG_ONs with proper handling and take allocated space back to
free space cache for later use.

We don't have to worry about extent maps since they'd be freed in releasepage
path.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: set UUID in root_item for created trees
Stefan Behrens [Fri, 19 Apr 2013 15:08:05 +0000 (15:08 +0000)]
Btrfs: set UUID in root_item for created trees

It is a rare exception that a new tree is created, like the qgroups
tree. So far these new trees have an all-zero UUID in their root
items. All trees that mkfs.btrfs has created get an UUID during the
first mount when btrfs_read_root_item() rewrites the root_item to
the v2 structure style. These UUID are never used so far, but
anyway, since it is better to have it uniform for all trees, this
commit adds some lines that generate and write an UUID for newly
created trees.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: delete unused parameter to btrfs_read_root_item()
Stefan Behrens [Fri, 19 Apr 2013 15:08:04 +0000 (15:08 +0000)]
Btrfs: delete unused parameter to btrfs_read_root_item()

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix error handling in btrfs_ioctl_send()
Tsutomu Itoh [Fri, 19 Apr 2013 01:04:46 +0000 (01:04 +0000)]
Btrfs: fix error handling in btrfs_ioctl_send()

fget() returns NULL if error. So, we should check NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: remove unused variable in __process_changed_new_xattr()
Tsutomu Itoh [Thu, 18 Apr 2013 07:10:44 +0000 (07:10 +0000)]
Btrfs: remove unused variable in __process_changed_new_xattr()

Variable 'p' is not used any more. So, remove it.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: various abort cleanups
Josef Bacik [Thu, 25 Apr 2013 17:44:38 +0000 (13:44 -0400)]
Btrfs: various abort cleanups

I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort.  This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.

The second half of this are related to the cleanup parts.  First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong.  The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes.  With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: cleanup destroy_marked_extents
Josef Bacik [Wed, 24 Apr 2013 20:41:19 +0000 (16:41 -0400)]
Btrfs: cleanup destroy_marked_extents

We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: check return value of commit when recovering log
Josef Bacik [Wed, 24 Apr 2013 20:40:05 +0000 (16:40 -0400)]
Btrfs: check return value of commit when recovering log

We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out.  We need to do this before
we free up the log_tree_root since the caller will handle all of that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't panic if we're trying to drop too many refs
Josef Bacik [Wed, 24 Apr 2013 20:38:50 +0000 (16:38 -0400)]
Btrfs: don't panic if we're trying to drop too many refs

This is just obnoxious.  Just print a message, abort the transaction, and return
an error.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: cleanup fs roots if we fail to mount
Josef Bacik [Wed, 24 Apr 2013 20:35:41 +0000 (16:35 -0400)]
Btrfs: cleanup fs roots if we fail to mount

We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime.  So we need to clean this up
so we don't leave extent buffers hanging around on the cache.  With this patch
we no longer leak extent buffers on failure to mount.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix extent logging with O_DIRECT into prealloc
Josef Bacik [Wed, 24 Apr 2013 20:32:55 +0000 (16:32 -0400)]
Btrfs: fix extent logging with O_DIRECT into prealloc

This is the same as the fix from commit

Btrfs: fix bad extent logging

but for O_DIRECT.  I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent.  We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: fix all callers of read_tree_block
Josef Bacik [Tue, 23 Apr 2013 18:17:42 +0000 (14:17 -0400)]
Btrfs: fix all callers of read_tree_block

We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly.  You need to check
and make sure the extent_buffer is uptodate before you use it.  This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not.  With this we no longer
leak EB's when things go horribly wrong.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: only exclude supers in the range of our block group
Josef Bacik [Tue, 23 Apr 2013 16:55:21 +0000 (12:55 -0400)]
Btrfs: only exclude supers in the range of our block group

If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree.  This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it.  This fixes the problem by only adding excluded
extents if it falls in the block group range we care about.  With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: add tree block level sanity check
Josef Bacik [Tue, 23 Apr 2013 15:30:14 +0000 (11:30 -0400)]
Btrfs: add tree block level sanity check

With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level.  So add this
to the sanity checks in the endio handler for tree blocks.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
11 years agoBtrfs: don't try and free ebs twice in log replay
Josef Bacik [Tue, 23 Apr 2013 15:08:33 +0000 (11:08 -0400)]
Btrfs: don't try and free ebs twice in log replay

This work is done by btrfs_free_path() anyway so there's no need for this
duplicate work.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>