Skip to content
Snippets Groups Projects
  1. Mar 01, 2013
  2. Feb 28, 2013
  3. Feb 26, 2013
    • Qu Wenruo's avatar
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo authored
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      fda2832f
    • Alexandre Oliva's avatar
      clear chunk_alloc flag on retryable failure · a81cb9a2
      Alexandre Oliva authored
      
      I've experienced filesystem freezes with permanent spikes in the active
      process count for quite a while, particularly on filesystems whose
      available raw space has already been fully allocated to chunks.
      
      While looking into this, I found a pretty obvious error in
      do_chunk_alloc: it sets space_info->chunk_alloc, but if
      btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
      that flag set, which causes any other threads waiting for
      space_info->chunk_alloc to become zero to spin indefinitely.
      
      I haven't double-checked that this patch fixes the failure I've observed
      fully (it's not exactly trivial to trigger), but it surely is a bug and
      the fix is trivial, so...  Please put it in :-)
      
      What I saw in that function also happens to explain why in some cases I
      see filesystems allocate a huge number of chunks that remain unused
      (leading to the scenario above, of not having more chunks to allocate).
      It happens for data and metadata, but not necessarily both.  I'm
      guessing some thread sets the force_alloc flag on the corresponding
      space_info, and then several threads trying to get disk space end up
      attempting to allocate a new chunk concurrently.  All of them will see
      the force_alloc flag and bump their local copy of force up to the level
      they see first, and they won't clear it even if another thread succeeds
      in allocating a chunk, thus clearing the force flag.  Then each thread
      that observed the force flag will, on its turn, force the allocation of
      a new chunk.  And any threads that come in while it does that will see
      the force flag still set and pick it up, and so on.  This sounds like a
      problem to me, but...  what should the correct behavior be?  Clear
      force_flag once we copy it to a local force?  Reset force to the
      incoming value on every loop?  Set the flag to our incoming force if we
      have it at first, clear our local flag, and move it from the space_info
      when we determined that we are the thread that's going to perform the
      allocation?
      
      btrfs: clear chunk_alloc flag on retryable failure
      
      From: Alexandre Oliva <oliva@gnu.org>
      
      If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
      without clearing chunk_alloc in space_info.  As a result, any further
      calls to do_chunk_alloc on that filesystem will start busy-waiting for
      chunk_alloc to be cleared, but it never will be.  This patch adjusts
      do_chunk_alloc so that it clears this flag in case of an error.
      
      Signed-off-by: default avatarAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      a81cb9a2
  4. Feb 20, 2013
    • Zach Brown's avatar
      btrfs: limit fallocate extent reservation to 256MB · 24542bf7
      Zach Brown authored
      
      Very large fallocate requests are cpu bound and result in extents with a
      repeating pattern of ever decreasing size:
      
      $ time fallocate -l 1T file
      real	0m13.039s
      
      ( an excerpt of the extents from btrfs-debug-tree: )
        prealloc data disk byte 1536292564992 nr 397312
        prealloc data disk byte 1536292962304 nr 196608
        prealloc data disk byte 1536293158912 nr 98304
        prealloc data disk byte 1536293257216 nr 49152
        prealloc data disk byte 1536293306368 nr 24576
        prealloc data disk byte 1536293330944 nr 12288
        prealloc data disk byte 1536293343232 nr 8192
        prealloc data disk byte 1536293351424 nr 4096
        prealloc data disk byte 1536293355520 nr 4096
        prealloc data disk byte 1536293359616 nr 4096
      
      The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
      allocate the entire remaining size after each extent is allocated.
      btrfs_reserve_extent() repeatedly cuts this requested size in half until
      it gets down to the size that the allocators can return.  We limit the
      problem for now by capping each reservation at 256 meg.
      
      The small extents come from a masking bug when decreasing the requested
      reservation size.  The high 32bits are cleared and the remaining low
      bits might happen to reserve a small size.   Fix this by using
      round_down() which properly casts the mask.
      
      After these fixes huge fallocate requests are fast and result in nice
      large extents:
      
      $ time fallocate -l 1T file
      real	0m0.082s
      
        prealloc data disk byte 1112425889792 nr 268435456
        prealloc data disk byte 1112694325248 nr 268435456
        prealloc data disk byte 1112962760704 nr 268435456
      
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      24542bf7
    • David Sterba's avatar
      btrfs: put some enospc messages under enospc_debug · b069e0c3
      David Sterba authored
      
      The warning in use_block_rsv is not useful for users and may fill
      the logs unnecessarily.
      
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      b069e0c3
    • Miao Xie's avatar
      Btrfs: fix deadlock due to unsubmitted · 0934856d
      Miao Xie authored
      
      The deadlock problem happened when running fsstress(a test program in LTP).
      
      Steps to reproduce:
       # mkfs.btrfs -b 100M <partition>
       # mount <partition> <mnt>
       # <Path>/fsstress -p 3 -n 10000000 -d <mnt>
      
      The reason is:
      btrfs_direct_IO()
       |->do_direct_IO()
           |->get_page()
           |->get_blocks()
           |	 |->btrfs_delalloc_resereve_space()
           |	 |->btrfs_add_ordered_extent() -------	Add a new ordered extent
           |->dio_send_cur_page(page0) --------------	We didn't submit bio here
           |->get_page()
           |->get_blocks()
      	 |->btrfs_delalloc_resereve_space()
      	     |->flush_space()
      		 |->btrfs_start_ordered_extent()
      		     |->wait_event() ----------	Wait the completion of
      						the ordered extent that is
      						mentioned above
      
      But because we didn't submit the bio that is mentioned above, the ordered
      extent can not complete, we would wait for its completion forever.
      
      There are two methods which can fix this deadlock problem:
      1. submit the bio before we invoke get_blocks()
      2. reserve the space before we do dio
      
      Though the 1st is the simplest way, we need modify the code of VFS, and it
      is likely to break contiguous requests, and introduce performance regression
      for the other filesystems.
      
      So we have to choose the 2nd way.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      0934856d
    • Josef Bacik's avatar
      Btrfs: steal from global reserve if we are cleaning up orphans · 5d80366e
      Josef Bacik authored
      
      Sometimes xfstest 83 will fail to remount the scratch device because we've
      gotten ourselves so full that we cannot cleanup the orphan items.  In this
      case check to see if we're doing the orphan cleanup and if we are allow us
      to steal our reservation from the global block rsv.  With this patch I've
      not been able to reproduce the failed mount problem.  Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      5d80366e
    • Josef Bacik's avatar
      Btrfs: rework the overcommit logic to be based on the total size · 70afa399
      Josef Bacik authored
      
      People have been complaining about random ENOSPC errors that will clear up
      after a umount or just a given amount of time.  Chris was able to reproduce
      this with stress.sh and lots of processes and so was I.  Basically the
      overcommit stuff would really let us get out of hand, in my tests I saw up
      to 30 gigs of outstanding reservations with only 2 gigs total of metadata
      space.  This usually worked out fine but with so much outstanding
      reservation the flushing stuff short circuits to make sure we don't hang
      forever flushing when we really need ENOSPC.  Plus we allocate chunks in
      order to alleviate the pressure, but this doesn't actually help us since we
      only use the non-allocated area in our over commit logic.
      
      So instead of basing overcommit on the amount of non-allocated space,
      instead just do it based on how much total space we have, and then limit it
      to the non-allocated space in case we are short on space to spill over into.
      This allows us to have the same performance as well as no longer giving
      random ENOSPC.  Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      70afa399
    • Eric Sandeen's avatar
      btrfs: remove unnecessary DEFINE_WAIT() declarations · 1971e917
      Eric Sandeen authored
      
      No point in DEFINE_WAIT(wait) if it's not used!
      
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      1971e917
    • Josef Bacik's avatar
      Btrfs: do not overcommit if we don't have enough space for global rsv · 96f1bb57
      Josef Bacik authored
      
      Because of how little we allocate chunks now we can get really tight on
      metadata space before we will allocate a new chunk.  This resulted in being
      unable to add device extents when allocating a new metadata chunk as we did
      not have enough space.  This is because we were allowed to overcommit too
      much metadata without actually making sure we had enough space to make
      allocations.  The idea behind overcommit is that we are allowed to say "sure
      you can have that reservation" when most of the free space is occupied by
      reservations, not actual allocations.  But in this case where a majority of
      the total space is in use by actual allocations we can screw ourselves by
      not being able to make real allocations when it matters.  So make sure we
      have enough real space for our global reserve, and if not then don't allow
      overcommitting.  Thanks,
      
      Reported-and-tested-by: default avatarJim Schutt <jaschut@sandia.gov>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      96f1bb57
    • Miao Xie's avatar
      Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits · de98ced9
      Miao Xie authored
      
      There is no lock to protect
        fs_info->avail_{data, metadata, system}_alloc_bits,
      it may introduce some problem, such as the wrong profile
      information, so we add a seqlock to protect them.
      
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      de98ced9
    • Miao Xie's avatar
      Btrfs: use percpu counter for fs_info->delalloc_bytes · 963d678b
      Miao Xie authored
      
      fs_info->delalloc_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant for it to reduce the lock
      contention.
      
      This patch also fixed the problem that we access the variant
      without the lock protection.At worst, we would not flush the
      delalloc inodes, and just return ENOSPC error when we still have
      some free space in the fs.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      963d678b
    • Miao Xie's avatar
      Btrfs: make raid attr array more readable · e6ec716f
      Miao Xie authored
      
      The current code of raid attr arry is hard to understand and it is easy to
      introduce some problem if we modify the array. So I changed it and made it
      more readable.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      e6ec716f
    • Liu Bo's avatar
      Btrfs: record first logical byte in memory · a1897fdd
      Liu Bo authored
      
      This'd save us a rbtree search which may become expensive in large filesystem.
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      a1897fdd
    • Liu Bo's avatar
      Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay · dcfac415
      Liu Bo authored
      
      Argument 'trans' is not used any more.
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      dcfac415
    • Liu Bo's avatar
      Btrfs: kill unused argument of update_block_group · c53d613e
      Liu Bo authored
      
      Argument 'trans' is not used any more.
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      c53d613e
    • Liu Bo's avatar
      Btrfs: kill unused arguments of cache_block_group · f6373bf3
      Liu Bo authored
      
      Argument 'trans' and 'root' are not used any more.
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      f6373bf3
    • Liu Bo's avatar
      Btrfs: remove deprecated comments · 17b85495
      Liu Bo authored
      
      commit d53ba474
      (Btrfs: use commit root when loading free space cache) has remove
      the deadlock check, and the related comments can be removed as well.
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      17b85495
    • Josef Bacik's avatar
      Btrfs: don't re-enter when allocating a chunk · c6b305a8
      Josef Bacik authored
      
      If we start running low on metadata space we will try to allocate a chunk,
      which could then try to allocate a chunk to add the device entry.  The thing
      is we allocate a chunk before we try really hard to make the allocation, so
      we should be able to find space for the device entry.  Add a flag to the
      trans handle so we know we're currently allocating a chunk so we can just
      bail out if we try to allocate another chunk.  Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      c6b305a8
    • Miao Xie's avatar
      Btrfs: flush all dirty inodes if writeback can not start · da633a42
      Miao Xie authored
      
      We may try to flush some dirty pages when there is no enough space to reserve.
      But it is possible that this operation fails, in order to get enough space to
      reserve successfully, we will sync all the delalloc file. This operation is
      safe, we needn't worry about the case that the filesystem goes from r/w to r/o.
      because the filesystem should guarantee all the dirty pages have been written
      into the disk after it becomes readonly, so the sync operation will do nothing
      if the filesystem is already readonly. Though it may waste lots of time,
      as a corner case, we needn't care.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      da633a42
    • Miao Xie's avatar
      Btrfs: make delayed ref lock logic more readable · 093486c4
      Miao Xie authored
      
      Locking and unlocking delayed ref mutex are in the different functions,
      and the name of lock functions is not uniform, so the readability is not
      so good, this patch optimizes the lock logic and makes it more readable.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      093486c4
    • Miao Xie's avatar
      Btrfs: use slabs for delayed reference allocation · 78a6184a
      Miao Xie authored
      
      The delayed reference allocation is in the fast path of the IO, so use slabs
      to improve the speed of the allocation.
      
      And besides that, it can do check for leaked objects when the module is removed.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      78a6184a
  5. Feb 06, 2013
  6. Feb 01, 2013
    • Chris Mason's avatar
      Btrfs: reduce CPU contention while waiting for delayed extent operations · bb721703
      Chris Mason authored
      
      We batch up operations to the extent allocation tree, which allows
      us to deal with the recursive nature of using the extent allocation
      tree to allocate extents to the extent allocation tree.
      
      It also provides a mechanism to sort and collect extent
      operations, which makes it much more efficient to record extents
      that are close together.
      
      The delayed extent operations must all be finished before the
      running transaction commits, so we have code to make sure and run a few
      of the batched operations when closing our transaction handles.
      
      This creates a great deal of contention for the locks in the
      delayed extent operation tree, and also contention for the lock on the
      extent allocation tree itself.  All the extra contention just slows
      down the operations and doesn't get things done any faster.
      
      This commit changes things to use a wait queue instead.  As procs
      want to run the delayed operations, one of them races in and gets
      permission to hit the tree, and the others step back and wait for
      progress to be made.
      
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      bb721703
    • Chris Mason's avatar
      Btrfs: fix cluster alignment for mount -o ssd · 8de972b4
      Chris Mason authored
      
      With the new raid56 code, we want to make sure we're
      properly aligning our allocation clusters with -o ssd
      
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      8de972b4
    • David Woodhouse's avatar
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse authored
      
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      53b381b3
  7. Jan 14, 2013
  8. Jan 12, 2013
    • Miao Xie's avatar
      vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them · 10ee27a0
      Miao Xie authored
      
      writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
      with down_read_trylock() because
      
      - If ->s_umount is write locked, then the sb is not idle. That is
        writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.
      
      - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
        writeback, it may bring us deadlock problem when doing umount. In order to
        fix the problem, ext4 and btrfs implemented their own writeback functions
        instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
        code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().
      
      The name of these two functions is cumbersome, so rename them to
      try_to_writeback_inodes_sb(_nr).
      
      This idea came from Christoph Hellwig.
      Some code is from the patch of Kamal Mostafa.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      10ee27a0
  9. Jan 09, 2013
  10. Dec 17, 2012
  11. Dec 12, 2012
Loading