- Mar 01, 2013
-
-
Wang Shilong authored
The original code is a little confusing and not clear, The right way to deal with the kernel code like this: [...] if (ret) goto out; [...] So i move the common clean_up code to the place labeled with out_fail, this will be easier to maintain. Signed-off-by:
Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Wang Shilong authored
commit eb6b88d9 leads into another bug. If it is just because qgroup_reserve fails, the function btrfs_qgroup_free should not be called, otherwise, it will cause the wrong quota accounting. Signed-off-by:
Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
- Feb 28, 2013
-
-
Miao Xie authored
There are two problems in the space reservation of the snapshot/ subvolume creation. - don't reserve the space for the root item insertion - the space which is reserved in the qgroup is different with the free space reservation. we need reserve free space for 7 items, but in qgroup reservation, we need reserve space only for 3 items. So we implement new metadata reservation functions for the snapshot/subvolume creation. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
- Feb 26, 2013
-
-
Qu Wenruo authored
Though most of the btrfs codes are using ALIGN macro for page alignment, there are still some codes using open-coded alignment like the following: ------ u64 mask = ((u64)root->stripesize - 1); u64 ret = (val + mask) & ~mask; ------ Or even hidden one: ------ num_bytes = (end - start + blocksize) & ~(blocksize - 1); ------ Sometimes these open-coded alignment is not so easy to understand for newbie like me. This commit changes the open-coded alignment to the ALIGN macro for a better readability. Also there is a previous patch from David Sterba with similar changes, but the patch is for 3.2 kernel and seems not merged. http://www.spinics.net/lists/linux-btrfs/msg12747.html Cc: David Sterba <dave@jikos.cz> Signed-off-by:
Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Alexandre Oliva authored
I've experienced filesystem freezes with permanent spikes in the active process count for quite a while, particularly on filesystems whose available raw space has already been fully allocated to chunks. While looking into this, I found a pretty obvious error in do_chunk_alloc: it sets space_info->chunk_alloc, but if btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving that flag set, which causes any other threads waiting for space_info->chunk_alloc to become zero to spin indefinitely. I haven't double-checked that this patch fixes the failure I've observed fully (it's not exactly trivial to trigger), but it surely is a bug and the fix is trivial, so... Please put it in :-) What I saw in that function also happens to explain why in some cases I see filesystems allocate a huge number of chunks that remain unused (leading to the scenario above, of not having more chunks to allocate). It happens for data and metadata, but not necessarily both. I'm guessing some thread sets the force_alloc flag on the corresponding space_info, and then several threads trying to get disk space end up attempting to allocate a new chunk concurrently. All of them will see the force_alloc flag and bump their local copy of force up to the level they see first, and they won't clear it even if another thread succeeds in allocating a chunk, thus clearing the force flag. Then each thread that observed the force flag will, on its turn, force the allocation of a new chunk. And any threads that come in while it does that will see the force flag still set and pick it up, and so on. This sounds like a problem to me, but... what should the correct behavior be? Clear force_flag once we copy it to a local force? Reset force to the incoming value on every loop? Set the flag to our incoming force if we have it at first, clear our local flag, and move it from the space_info when we determined that we are the thread that's going to perform the allocation? btrfs: clear chunk_alloc flag on retryable failure From: Alexandre Oliva <oliva@gnu.org> If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc without clearing chunk_alloc in space_info. As a result, any further calls to do_chunk_alloc on that filesystem will start busy-waiting for chunk_alloc to be cleared, but it never will be. This patch adjusts do_chunk_alloc so that it clears this flag in case of an error. Signed-off-by:
Alexandre Oliva <oliva@gnu.org> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
- Feb 20, 2013
-
-
Zach Brown authored
Very large fallocate requests are cpu bound and result in extents with a repeating pattern of ever decreasing size: $ time fallocate -l 1T file real 0m13.039s ( an excerpt of the extents from btrfs-debug-tree: ) prealloc data disk byte 1536292564992 nr 397312 prealloc data disk byte 1536292962304 nr 196608 prealloc data disk byte 1536293158912 nr 98304 prealloc data disk byte 1536293257216 nr 49152 prealloc data disk byte 1536293306368 nr 24576 prealloc data disk byte 1536293330944 nr 12288 prealloc data disk byte 1536293343232 nr 8192 prealloc data disk byte 1536293351424 nr 4096 prealloc data disk byte 1536293355520 nr 4096 prealloc data disk byte 1536293359616 nr 4096 The excessive cpu use comes from __btrfs_prealloc_file_range() trying to allocate the entire remaining size after each extent is allocated. btrfs_reserve_extent() repeatedly cuts this requested size in half until it gets down to the size that the allocators can return. We limit the problem for now by capping each reservation at 256 meg. The small extents come from a masking bug when decreasing the requested reservation size. The high 32bits are cleared and the remaining low bits might happen to reserve a small size. Fix this by using round_down() which properly casts the mask. After these fixes huge fallocate requests are fast and result in nice large extents: $ time fallocate -l 1T file real 0m0.082s prealloc data disk byte 1112425889792 nr 268435456 prealloc data disk byte 1112694325248 nr 268435456 prealloc data disk byte 1112962760704 nr 268435456 Reported-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
Zach Brown <zab@redhat.com> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
David Sterba authored
The warning in use_block_rsv is not useful for users and may fill the logs unnecessarily. Signed-off-by:
David Sterba <dsterba@suse.cz> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The deadlock problem happened when running fsstress(a test program in LTP). Steps to reproduce: # mkfs.btrfs -b 100M <partition> # mount <partition> <mnt> # <Path>/fsstress -p 3 -n 10000000 -d <mnt> The reason is: btrfs_direct_IO() |->do_direct_IO() |->get_page() |->get_blocks() | |->btrfs_delalloc_resereve_space() | |->btrfs_add_ordered_extent() ------- Add a new ordered extent |->dio_send_cur_page(page0) -------------- We didn't submit bio here |->get_page() |->get_blocks() |->btrfs_delalloc_resereve_space() |->flush_space() |->btrfs_start_ordered_extent() |->wait_event() ---------- Wait the completion of the ordered extent that is mentioned above But because we didn't submit the bio that is mentioned above, the ordered extent can not complete, we would wait for its completion forever. There are two methods which can fix this deadlock problem: 1. submit the bio before we invoke get_blocks() 2. reserve the space before we do dio Though the 1st is the simplest way, we need modify the code of VFS, and it is likely to break contiguous requests, and introduce performance regression for the other filesystems. So we have to choose the 2nd way. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Sometimes xfstest 83 will fail to remount the scratch device because we've gotten ourselves so full that we cannot cleanup the orphan items. In this case check to see if we're doing the orphan cleanup and if we are allow us to steal our reservation from the global block rsv. With this patch I've not been able to reproduce the failed mount problem. Thanks, Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
People have been complaining about random ENOSPC errors that will clear up after a umount or just a given amount of time. Chris was able to reproduce this with stress.sh and lots of processes and so was I. Basically the overcommit stuff would really let us get out of hand, in my tests I saw up to 30 gigs of outstanding reservations with only 2 gigs total of metadata space. This usually worked out fine but with so much outstanding reservation the flushing stuff short circuits to make sure we don't hang forever flushing when we really need ENOSPC. Plus we allocate chunks in order to alleviate the pressure, but this doesn't actually help us since we only use the non-allocated area in our over commit logic. So instead of basing overcommit on the amount of non-allocated space, instead just do it based on how much total space we have, and then limit it to the non-allocated space in case we are short on space to spill over into. This allows us to have the same performance as well as no longer giving random ENOSPC. Thanks, Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
No point in DEFINE_WAIT(wait) if it's not used! Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Because of how little we allocate chunks now we can get really tight on metadata space before we will allocate a new chunk. This resulted in being unable to add device extents when allocating a new metadata chunk as we did not have enough space. This is because we were allowed to overcommit too much metadata without actually making sure we had enough space to make allocations. The idea behind overcommit is that we are allowed to say "sure you can have that reservation" when most of the free space is occupied by reservations, not actual allocations. But in this case where a majority of the total space is in use by actual allocations we can screw ourselves by not being able to make real allocations when it matters. So make sure we have enough real space for our global reserve, and if not then don't allow overcommitting. Thanks, Reported-and-tested-by:
Jim Schutt <jaschut@sandia.gov> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by:
Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
fs_info->delalloc_bytes is accessed very frequently, so use percpu counter instead of the u64 variant for it to reduce the lock contention. This patch also fixed the problem that we access the variant without the lock protection.At worst, we would not flush the delalloc inodes, and just return ENOSPC error when we still have some free space in the fs. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The current code of raid attr arry is hard to understand and it is easy to introduce some problem if we modify the array. So I changed it and made it more readable. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
This'd save us a rbtree search which may become expensive in large filesystem. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
Argument 'trans' is not used any more. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
Argument 'trans' is not used any more. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
Argument 'trans' and 'root' are not used any more. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
commit d53ba474 (Btrfs: use commit root when loading free space cache) has remove the deadlock check, and the related comments can be removed as well. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Reviewed-by:
David Sterba <dsterba@suse.cz> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
If we start running low on metadata space we will try to allocate a chunk, which could then try to allocate a chunk to add the device entry. The thing is we allocate a chunk before we try really hard to make the allocation, so we should be able to find space for the device entry. Add a flag to the trans handle so we know we're currently allocating a chunk so we can just bail out if we try to allocate another chunk. Thanks, Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
We may try to flush some dirty pages when there is no enough space to reserve. But it is possible that this operation fails, in order to get enough space to reserve successfully, we will sync all the delalloc file. This operation is safe, we needn't worry about the case that the filesystem goes from r/w to r/o. because the filesystem should guarantee all the dirty pages have been written into the disk after it becomes readonly, so the sync operation will do nothing if the filesystem is already readonly. Though it may waste lots of time, as a corner case, we needn't care. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
Locking and unlocking delayed ref mutex are in the different functions, and the name of lock functions is not uniform, so the readability is not so good, this patch optimizes the lock logic and makes it more readable. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The delayed reference allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com>
-
- Feb 06, 2013
-
-
Jan Schmidt authored
When btrfs_qgroup_reserve returned a failure, we were missing a counter operation for BTRFS_I(inode)->outstanding_extents++, leading to warning messages about outstanding extents and space_info->bytes_may_use != 0. Additionally, the error handling code didn't take into account that we dropped the inode lock which might require more cleanup. Luckily, all the cleanup code we need is already there and can be shared with reserve_metadata_bytes, which is exactly what this patch does. Reported-by:
Lev Vainblat <lev@zadarastorage.com> Signed-off-by:
Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
- Feb 01, 2013
-
-
Chris Mason authored
We batch up operations to the extent allocation tree, which allows us to deal with the recursive nature of using the extent allocation tree to allocate extents to the extent allocation tree. It also provides a mechanism to sort and collect extent operations, which makes it much more efficient to record extents that are close together. The delayed extent operations must all be finished before the running transaction commits, so we have code to make sure and run a few of the batched operations when closing our transaction handles. This creates a great deal of contention for the locks in the delayed extent operation tree, and also contention for the lock on the extent allocation tree itself. All the extra contention just slows down the operations and doesn't get things done any faster. This commit changes things to use a wait queue instead. As procs want to run the delayed operations, one of them races in and gets permission to hit the tree, and the others step back and wait for progress to be made. Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
Chris Mason authored
With the new raid56 code, we want to make sure we're properly aligning our allocation clusters with -o ssd Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
David Woodhouse authored
This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
- Jan 14, 2013
-
-
Liu Bo authored
We forgot to reset the path lock state to zero after we unlock the path block, and this can lead to the ASSERT checker in tree unlock API. Reported-by:
Slava Barinov <rayslava@gmail.com> Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
This'd avoid us empty looping. Say we have only one disk and the metadata raid type will be defaultly DUP, and we do not need to start from index=0(RAID10) and get over two empty loops to index=2(DUP). Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
We still need to say we're flushing if we're limit flushing to keep somebody from coming in and stealing our reservation. Thanks, Signed-off-by:
Josef Bacik <jbacik@fusionio.com>
-
- Jan 12, 2013
-
-
Miao Xie authored
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read() with down_read_trylock() because - If ->s_umount is write locked, then the sb is not idle. That is writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock. - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start writeback, it may bring us deadlock problem when doing umount. In order to fix the problem, ext4 and btrfs implemented their own writeback functions instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle(). The name of these two functions is cumbersome, so rename them to try_to_writeback_inodes_sb(_nr). This idea came from Christoph Hellwig. Some code is from the patch of Kamal Mostafa. Reviewed-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Fengguang Wu <fengguang.wu@intel.com>
-
- Jan 09, 2013
-
-
Liu Bo authored
Convert 'hepler' to 'helper'. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Jiri Kosina <jkosina@suse.cz>
-
- Dec 17, 2012
-
-
Josef Bacik authored
This confuses and angers lockdep even though it's ok. We don't really need the lock for free space inodes since only the transaction committer will be reserving space. Thanks, Signed-off-by:
Josef Bacik <jbacik@fusionio.com> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
Josef Bacik authored
This happens because writeback_inodes_sb_nr_if_idle does down_read. This doesn't work for us and it has not been fixed upstream yet, so do it ourselves and use that instead so we can stop having this stupid long standing lockup. Thanks, Signed-off-by:
Josef Bacik <jbacik@fusionio.com> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
Liu Bo authored
Raid properties can be shared among raid calculation code, we can put them into a global table to keep it simple. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
Miao Xie authored
We forget to release the reserved space in the error path of delalloc reservatiom, fix it. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
- Dec 12, 2012
-
-
Stefan Behrens authored
This patch adds some code to disallow operations on the device that is used as the target for the device replace operation. Signed-off-by:
Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
Stefan Behrens authored
This is required for the device replace procedure in a later step. Two calling functions also had to be changed to have the fs_info pointer: repair_io_failure() and scrub_setup_recheck_block(). Signed-off-by:
Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-
Liu Bo authored
When committing a transaction, we may bail out of running delayed refs due to ENOSPC, and then abort the current transaction to flip into readonly. But we'll hit a deadlock on ref head's lock since we forget to release its lock and other cleanup stuff. Signed-off-by:
Liu Bo <bo.li.liu@oracle.com> Signed-off-by:
Chris Mason <chris.mason@fusionio.com>
-