Skip to content
Snippets Groups Projects
  1. Feb 19, 2014
  2. Feb 18, 2014
  3. Feb 17, 2014
  4. Feb 16, 2014
    • Theodore Ts'o's avatar
      ext4: fix online resize with a non-standard blocks per group setting · 3d2660d0
      Theodore Ts'o authored
      
      The set_flexbg_block_bitmap() function assumed that the number of
      blocks in a blockgroup was sb->blocksize * 8, which is normally true,
      but not always!  Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block
      bitmap corruption after:
      
      mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G
      mount -t ext4 /dev/vdd /vdd
      resize2fs /dev/vdd 8G
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: default avatarJon Bernard <jbernard@tuxion.com>
      Cc: stable@vger.kernel.org
      3d2660d0
    • Theodore Ts'o's avatar
      ext4: fix online resize with very large inode tables · b93c9535
      Theodore Ts'o authored
      
      If a file system has a large number of inodes per block group, all of
      the metadata blocks in a flex_bg may be larger than what can fit in a
      single block group.  Unfortunately, ext4_alloc_group_tables() in
      resize.c was never tested to see if it would handle this case
      correctly, and there were a large number of bugs which caused the
      following sequence to result in a BUG_ON:
      
      kernel bug at fs/ext4/resize.c:409!
         ...
      call trace:
       [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830
       [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80
       [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00
       [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0
       [<ffffffff811b9df2>] ? final_putname+0x22/0x50
       [<ffffffff811c1371>] sys_ioctl+0x81/0xa0
       [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b
      code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0
      rip  [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180
      
      
      This can be reproduced with the following command sequence:
      
         mke2fs -t ext4 -i 4096 /dev/vdd 1G
         mount -t ext4 /dev/vdd /vdd
         resize2fs /dev/vdd 8G
      
      To fix this, we need to make sure the right thing happens when a block
      group's inode table straddles two block groups, which means the
      following bugs had to be fixed:
      
      1) Not clearing the BLOCK_UNINIT flag in the second block group in
         ext4_alloc_group_tables --- the was proximate cause of the BUG_ON.
      
      2) Incorrectly determining how many block groups contained contiguous
         free blocks in ext4_alloc_group_tables().
      
      3) Incorrectly setting the start of the next block range to be marked
         in use after a discontinuity in setup_new_flex_group_blocks().
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      b93c9535
  5. Feb 15, 2014
    • Filipe David Borba Manana's avatar
      Btrfs: use right clone root offset for compressed extents · 93de4ba8
      Filipe David Borba Manana authored
      
      For non compressed extents, iterate_extent_inodes() gives us offsets
      that take into account the data offset from the file extent items, while
      for compressed extents it doesn't. Therefore we have to adjust them before
      placing them in a send clone instruction. Not doing this adjustment leads to
      the receiving end requesting for a wrong a file range to the clone ioctl,
      which results in different file content from the one in the original send
      root.
      
      Issue reproducible with the following excerpt from the test I made for
      xfstests:
      
        _scratch_mkfs
        _scratch_mount "-o compress-force=lzo"
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
      
        $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
        $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo
      
        # will be used for incremental send to be able to issue clone operations
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
      
        $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
        $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
            -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
        $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
            -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2
      
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
        $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
            -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap
      
        _scratch_unmount
        _scratch_mkfs
        _scratch_mount
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
        $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
        $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
        $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
      
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      93de4ba8
    • Anand Jain's avatar
      btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 · f085381e
      Anand Jain authored
      
      bdev is null when disk has disappeared and mounted with
      the degrade option
      
      stack trace
      ---------
      btrfs_sysfs_add_one+0x105/0x1c0 [btrfs]
      open_ctree+0x15f3/0x1fe0 [btrfs]
      btrfs_mount+0x5db/0x790 [btrfs]
      ? alloc_pages_current+0xa4/0x160
      mount_fs+0x34/0x1b0
      vfs_kern_mount+0x62/0xf0
      do_mount+0x22e/0xa80
      ? __get_free_pages+0x9/0x40
      ? copy_mount_options+0x31/0x170
      SyS_mount+0x7e/0xc0
      system_call_fastpath+0x16/0x1b
      ---------
      
      reproducer:
      -------
      mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd
      (detach a disk)
      devmgt detach /dev/sdc [1]
      mount -o degrade /dev/sdd /btrfs
      -------
      
      [1] github.com/anajain/devmgt.git
      
      Signed-off-by: default avatarAnand Jain <Anand.Jain@oracle.com>
      Tested-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f085381e
  6. Feb 14, 2014
    • Pavel Shilovsky's avatar
      CIFS: Fix too big maxBuf size for SMB3 mounts · 2365c4ea
      Pavel Shilovsky authored
      
      SMB3 servers can respond with MaxTransactSize of more than 4M
      that can cause a memory allocation error returned from kmalloc
      in a lock codepath. Also the client doesn't support multicredit
      requests now and allows buffer sizes of 65536 bytes only. Set
      MaxTransactSize to this maximum supported value.
      
      Cc: stable@vger.kernel.org # 3.7+
      Signed-off-by: default avatarPavel Shilovsky <piastry@etersoft.ru>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      2365c4ea
    • Jeff Layton's avatar
      cifs: ensure that uncached writes handle unmapped areas correctly · 5d81de8e
      Jeff Layton authored
      
      It's possible for userland to pass down an iovec via writev() that has a
      bogus user pointer in it. If that happens and we're doing an uncached
      write, then we can end up getting less bytes than we expect from the
      call to iov_iter_copy_from_user. This is CVE-2014-0069
      
      cifs_iovec_write isn't set up to handle that situation however. It'll
      blindly keep chugging through the page array and not filling those pages
      with anything useful. Worse yet, we'll later end up with a negative
      number in wdata->tailsz, which will confuse the sending routines and
      cause an oops at the very least.
      
      Fix this by having the copy phase of cifs_iovec_write stop copying data
      in this situation and send the last write as a short one. At the same
      time, we want to avoid sending a zero-length write to the server, so
      break out of the loop and set rc to -EFAULT if that happens. This also
      allows us to handle the case where no address in the iovec is valid.
      
      [Note: Marking this for stable on v3.4+ kernels, but kernels as old as
             v2.6.38 may have a similar problem and may need similar fix]
      
      Cc: <stable@vger.kernel.org> # v3.4+
      Reviewed-by: default avatarPavel Shilovsky <piastry@etersoft.ru>
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      5d81de8e
    • Josef Bacik's avatar
      Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol · 3a0dfa6a
      Josef Bacik authored
      
      A user was running into errors from an NFS export of a subvolume that had a
      default subvol set.  When we mount a default subvol we will use d_obtain_alias()
      to find an existing dentry for the subvolume in the case that the root subvol
      has already been mounted, or a dummy one is allocated in the case that the root
      subvol has not already been mounted.  This allows us to connect the dentry later
      on if we wander into the path.  However if we don't ever wander into the path we
      will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It doesn't
      appear to cause any problems but it is annoying nonetheless, so simply unset
      DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
      use d_materialise_unique() instead which will make everything play nicely
      together and reconnect stuff if we wander into the defaul subvol path from a
      different way.  With this patch I'm no longer getting the NFS errors when
      exporting a volume that has been mounted with a default subvol set.  Thanks,
      
      cc: bfields@fieldses.org
      cc: ebiederm@xmission.com
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a0dfa6a
    • Mitch Harder's avatar
      Btrfs: fix max_inline mount option · feb5f965
      Mitch Harder authored
      
      Currently, the only mount option for max_inline that has any effect is
      max_inline=0.  Any other value that is supplied to max_inline will be
      adjusted to a minimum of 4k.  Since max_inline has an effective maximum
      of ~3900 bytes due to page size limitations, the current behaviour
      only has meaning for max_inline=0.
      
      This patch will allow the the max_inline mount option to accept non-zero
      values as indicated in the documentation.
      
      Signed-off-by: default avatarMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      feb5f965
    • Liu Bo's avatar
      Btrfs: fix a lockdep warning when cleaning up aborted transaction · a9d2d4ad
      Liu Bo authored
      
      Given now we have 2 spinlock for management of delayed refs,
      CONFIG_DEBUG_SPINLOCK=y helped me find this,
      
      [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
      [ 4723.414882]  lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
      [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G        W  O 3.12.0+ #4
      [ 4723.421321] Call Trace:
      [ 4723.421872]  [<ffffffff81680fe7>] dump_stack+0x54/0x74
      [ 4723.422753]  [<ffffffff81681093>] spin_dump+0x8c/0x91
      [ 4723.424979]  [<ffffffff816810b9>] spin_bug+0x21/0x26
      [ 4723.425846]  [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
      [ 4723.434424]  [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
      [ 4723.438747]  [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
      [ 4723.443321]  [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
      [ 4723.444692]  [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      [ 4723.450336]  [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
      [ 4723.451332]  [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
      [ 4723.452543]  [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
      [ 4723.457833]  [<ffffffff81079efa>] kthread+0xea/0xf0
      [ 4723.458990]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.460133]  [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [ 4723.460865]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.496521] ------------[ cut here ]------------
      
      ----------------------------------------------------------------------
      
      The reason is that we get to call cond_resched_lock(&head_ref->lock) while
      still holding @delayed_refs->lock.
      
      So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
      dance before and after actually processing delayed refs.
      
      Here we don't drop the lock, others are not able to add new delayed refs to
      head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
      
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a9d2d4ad
    • Chris Mason's avatar
      Revert "btrfs: add ioctl to export size of global metadata reservation" · 11bcac89
      Chris Mason authored
      
      This reverts commit 01e219e8.
      
      David Sterba found a different way to provide these features without adding a new
      ioctl.  We haven't released any progs with this ioctl yet, so I'm taking this out
      for now until we finalize things.
      
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      CC: Jeff Mahoney <jeffm@suse.com>
      11bcac89
  7. Feb 13, 2014
    • Dave Kleikamp's avatar
      jfs: set i_ctime when setting ACL · 844fa1b5
      Dave Kleikamp authored
      
      This fixes a regression in 3.14-rc1 where xfstests generic/307 fails.
      
      jfs sets the ctime on the inode when writing an xattr. Previously,
      jfs went ahead and stored an acl that can be completely represented
      in the traditional permission bits, so the ctime was always set in
      the xattr code. The new code doesn't bother storing the acl in that
      case, thus the ctime isn't getting set.
      
      Signed-off-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Reported-by: default avatarMichael L. Semon <mlsemon35@gmail.com>
      844fa1b5
    • NeilBrown's avatar
      lockd: send correct lock when granting a delayed lock. · 2ec197db
      NeilBrown authored
      
      If an NFS client attempts to get a lock (using NLM) and the lock is
      not available, the server will remember the request and when the lock
      becomes available it will send a GRANT request to the client to
      provide the lock.
      
      If the client already held an adjacent lock, the GRANT callback will
      report the union of the existing and new locks, which can confuse the
      client.
      
      This happens because __posix_lock_file (called by vfs_lock_file)
      updates the passed-in file_lock structure when adjacent or
      over-lapping locks are found.
      
      To avoid this problem we take a copy of the two fields that can
      be changed (fl_start and fl_end) before the call and restore them
      afterwards.
      An alternate would be to allocate a 'struct file_lock', initialise it,
      use locks_copy_lock() to take a copy, then locks_release_private()
      after the vfs_lock_file() call.  But that is a lot more work.
      
      Reported-by: default avatarOlaf Kirch <okir@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      
      --
      v1 had a couple of issues (large on-stack struct and didn't really work properly).
      This version is much better tested.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      2ec197db
  8. Feb 12, 2014
    • Theodore Ts'o's avatar
      ext4: don't try to modify s_flags if the the file system is read-only · 23301410
      Theodore Ts'o authored
      
      If an ext4 file system is created by some tool other than mke2fs
      (perhaps by someone who has a pathalogical fear of the GPL) that
      doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
      and that file system is then mounted read-only, don't try to modify
      the s_flags field.  Otherwise, if dm_verity is in use, the superblock
      will change, causing an dm_verity failure.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      23301410
    • Zheng Liu's avatar
      ext4: fix error paths in swap_inode_boot_loader() · 30d29b11
      Zheng Liu authored
      
      In swap_inode_boot_loader() we forgot to release ->i_mutex and resume
      unlocked dio for inode and inode_bl if there is an error starting the
      journal handle.  This commit fixes this issue.
      
      Reported-by: default avatarAhmed Tamrawi <ahmedtamrawi@gmail.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org  # v3.10+
      30d29b11
    • Eric Whitney's avatar
      ext4: fix xfstest generic/299 block validity failures · 15cc1767
      Eric Whitney authored
      
      Commit a115f749 (ext4: remove wait for unwritten extent conversion from
      ext4_truncate) exposed a bug in ext4_ext_handle_uninitialized_extents().
      It can be triggered by xfstest generic/299 when run on a test file
      system created without a journal.  This test continuously fallocates and
      truncates files to which random dio/aio writes are simultaneously
      performed by a separate process.  The test completes successfully, but
      if the test filesystem is mounted with the block_validity option, a
      warning message stating that a logical block has been mapped to an
      illegal physical block is posted in the kernel log.
      
      The bug occurs when an extent is being converted to the written state
      by ext4_end_io_dio() and ext4_ext_handle_uninitialized_extents()
      discovers a mapping for an existing uninitialized extent. Although it
      sets EXT4_MAP_MAPPED in map->m_flags, it fails to set map->m_pblk to
      the discovered physical block number.  Because map->m_pblk is not
      otherwise initialized or set by this function or its callers, its
      uninitialized value is returned to ext4_map_blocks(), where it is
      stored as a bogus mapping in the extent status tree.
      
      Since map->m_pblk can accidentally contain illegal values that are
      larger than the physical size of the file system,  calls to
      check_block_validity() in ext4_map_blocks() that are enabled if the
      block_validity mount option is used can fail, resulting in the logged
      warning message.
      
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org  # 3.11+
      15cc1767
  9. Feb 11, 2014
    • J. Bruce Fields's avatar
      nfsd4: fix acl buffer overrun · 09bdc2d7
      J. Bruce Fields authored
      
      4ac7249e "nfsd: use get_acl and
      ->set_acl" forgets to set the size in the case get_acl() succeeds, so
      _posix_to_nfsv4_one() can then write past the end of its allocation.
      Symptoms were slab corruption warnings.
      
      Also, some minor cleanup while we're here.  (Among other things, note
      that the first few lines guarantee that pacl is non-NULL.)
      
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      09bdc2d7
    • Kent Overstreet's avatar
      block: Fix cloning of discard/write same bios · 8423ae3d
      Kent Overstreet authored
      
      Immutable biovecs changed the way bio segments are treated in such a way that
      bio_for_each_segment() cannot now do what we want for discard/write same bios,
      since bi_size means something completely different for them.
      
      Fortunately discard and write same bios never have more than a single biovec, so
      bio_for_each_segment() is unnecessary and not terribly meaningful for them, but
      we still have to special case them in a few places.
      
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Tested-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8423ae3d
    • Xue jiufei's avatar
      ocfs2: check existence of old dentry in ocfs2_link() · 0e048316
      Xue jiufei authored
      
      System call linkat first calls user_path_at(), check the existence of
      old dentry, and then calls vfs_link()->ocfs2_link() to do the actual
      work.  There may exist a race when Node A create a hard link for file
      while node B rm it.
      
               Node A                          Node B
      user_path_at()
        ->ocfs2_lookup(),
      find old dentry exist
                                      rm file, add inode say inodeA
                                      to orphan_dir
      
      call ocfs2_link(),create a
      hard link for inodeA.
      
                                      rm the link, add inodeA to orphan_dir
                                      again
      
      When orphan_scan work start, it calls ocfs2_queue_orphans() to do the
      main work.  It first tranverses entrys in orphan_dir, linking all inodes
      in this orphan_dir to a list look like this:
      
      	inodeA->inodeB->...->inodeA
      
      When tranvering this list, it will fall into loop, calling iput() again
      and again.  And finally trigger BUG_ON(inode->i_state & I_CLEAR).
      
      Signed-off-by: default avatarjoyce <xuejiufei@huawei.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e048316
    • Junxiao Bi's avatar
      ocfs2: update inode size after zeroing the hole · c7d2cbc3
      Junxiao Bi authored
      
      fs-writeback will release the dirty pages without page lock whose offset
      are over inode size, the release happens at
      block_write_full_page_endio().  If not update, dirty pages in file holes
      may be released before flushed to the disk, then file holes will contain
      some non-zero data, this will cause sparse file md5sum error.
      
      To reproduce the bug, find a big sparse file with many holes, like vm
      image file, its actual size should be bigger than available mem size to
      make writeback work more frequently, tar it with -S option, then keep
      untar it and check its md5sum again and again until you get a wrong
      md5sum.
      
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Younger Liu <younger.liu@huawei.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7d2cbc3
    • Younger Liu's avatar
      ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size · d62e74be
      Younger Liu authored
      
      The issue scenario is as following:
      
      - Create a small file and fallocate a large disk space for a file with
        FALLOC_FL_KEEP_SIZE option.
      
      - ftruncate the file back to the original size again.  but the disk free
        space is not changed back.  This is a real bug that be fixed in this
        patch.
      
      In order to solve the issue above, we modified ocfs2_setattr(), if
      attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and
      truncate disk space to attr->ia_size.
      
      Signed-off-by: default avatarYounger Liu <younger.liu@huawei.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Tested-by: default avatarJie Liu <jeff.liu@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Reviewed-by: default avatarJensen <shencanquan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d62e74be
    • Rafael Aquini's avatar
      mm: fix page leak at nfs_symlink() · a0b54add
      Rafael Aquini authored
      
      Changes in commit a0b8cab3 ("mm: remove lru parameter from
      __pagevec_lru_add and remove parts of pagevec API") have introduced a
      call to add_to_page_cache_lru() which causes a leak in nfs_symlink() as
      now the page gets an extra refcount that is not dropped.
      
      Jan Stancek observed and reported the leak effect while running test8
      from Connectathon Testsuite.  After several iterations over the test
      case, which creates several symlinks on a NFS mountpoint, the test
      system was quickly getting into an out-of-memory scenario.
      
      This patch fixes the page leak by dropping that extra refcount
      add_to_page_cache_lru() is grabbing.
      
      Signed-off-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: <stable@vger.kernel.org>	[3.11.x+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0b54add
    • Younger Liu's avatar
      ocfs2: fix ocfs2_sync_file() if filesystem is readonly · a987c7ca
      Younger Liu authored
      
      If filesystem is readonly, there is no need to flush drive's caches or
      force any uncommitted transactions.
      
      [akpm@linux-foundation.org: return -EROFS, not 0]
      Signed-off-by: default avatarYounger Liu <younger.liucn@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a987c7ca
    • Eric W. Biederman's avatar
      fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem · 96c7a2ff
      Eric W. Biederman authored
      
      Recently due to a spike in connections per second memcached on 3
      separate boxes triggered the OOM killer from accept.  At the time the
      OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
      problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
      hold a bitmap, and there was sufficient fragmentation that the largest
      page available was 8KiB.
      
      I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
      but I do agree that order 3 allocations are very likely to succeed.
      
      There are always pathologies where order > 0 allocations can fail when
      there are copious amounts of free memory available.  Using the pigeon
      hole principle it is easy to show that it requires 1 page more than 50%
      of the pages being free to guarantee an order 1 (8KiB) allocation will
      succeed, 1 page more than 75% of the pages being free to guarantee an
      order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
      the pages being free to guarantee an order 3 allocate will succeed.
      
      A server churning memory with a lot of small requests and replies like
      memcached is a common case that if anything can will skew the odds
      against large pages being available.
      
      Therefore let's not give external applications a practical way to kill
      linux server applications, and specify __GFP_NORETRY to the kmalloc in
      alloc_fdmem.  Unless I am misreading the code and by the time the code
      reaches should_alloc_retry in __alloc_pages_slowpath (where
      __GFP_NORETRY becomes signification).  We have already tried everything
      reasonable to allocate a page and the only thing left to do is wait.  So
      not waiting and falling back to vmalloc immediately seems like the
      reasonable thing to do even if there wasn't a chance of triggering the
      OOM killer.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96c7a2ff
    • Greg Pearson's avatar
      vmcore: prevent PT_NOTE p_memsz overflow during header update · 38dfac84
      Greg Pearson authored
      
      Currently, update_note_header_size_elf64() and
      update_note_header_size_elf32() will add the size of a PT_NOTE entry to
      real_sz even if that causes real_sz to exceeds max_sz.  This patch
      corrects the while loop logic in those routines to ensure that does not
      happen and prints a warning if a PT_NOTE entry is dropped.  If zero
      PT_NOTE entries are found or this condition is encountered because the
      only entry was dropped, a warning is printed and an error is returned.
      
      One possible negative side effect of exceeding the max_sz limit is an
      allocation failure in merge_note_headers_elf64() or
      merge_note_headers_elf32() which would produce console output such as
      the following while booting the crash kernel.
      
        vmalloc: allocation failure: 14076997632 bytes
        swapper/0: page allocation failure: order:0, mode:0x80d2
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7
        Call Trace:
          dump_stack+0x19/0x1b
          warn_alloc_failed+0xf0/0x160
          __vmalloc_node_range+0x19e/0x250
          vmalloc_user+0x4c/0x70
          merge_note_headers_elf64.constprop.9+0x116/0x24a
          vmcore_init+0x2d4/0x76c
          do_one_initcall+0xe2/0x190
          kernel_init_freeable+0x17c/0x207
          kernel_init+0xe/0x180
          ret_from_fork+0x7c/0xb0
      
        Kdump: vmcore not initialized
      
        kdump: dump target is /dev/sda4
        kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/
        kdump: saving vmcore-dmesg.txt
        Cannot open /proc/vmcore: No such file or directory
        kdump: saving vmcore-dmesg.txt failed
        kdump: saving vmcore
        kdump: saving vmcore failed
      
      This type of failure has been seen on a four socket prototype system
      with certain memory configurations.  Most PT_NOTE sections have a single
      entry similar to:
      
        n_namesz = 0x5
        n_descsz = 0x150
        n_type   = 0x1
      
      Occasionally, a second entry is encountered with very large n_namesz and
      n_descsz sizes:
      
        n_namesz = 0x80000008
        n_descsz = 0x510ae163
        n_type   = 0x80000008
      
      Not yet sure of the source of these extra entries, they seem bogus, but
      they shouldn't cause crash dump to fail.
      
      Signed-off-by: default avatarGreg Pearson <greg.pearson@hp.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38dfac84
  10. Feb 10, 2014
    • Steve French's avatar
      [CIFS] Fix cifsacl mounts over smb2 to not call cifs · 42eacf9e
      Steve French authored
      
      When mounting with smb2/smb3 (e.g. vers=2.1) and cifsacl mount option,
      it was trying to get the mode by querying the acl over the cifs
      rather than smb2 protocol.  This patch makes that protocol
      independent and makes cifsacl smb2 mounts return a more intuitive
      operation not supported error (until we add a worker function
      for smb2_get_acl).
      
      Note that a previous patch fixed getxattr/setxattr for the CIFSACL xattr
      which would unconditionally call cifs_get_acl and cifs_set_acl (even when
      mounted smb2). I made those protocol independent last week (new protocol
      version operations "get_acl" and "set_acl" but did not add an
      smb2_get_acl and smb2_set_acl yet so those now simply return EOPNOTSUPP
      which at least is better than sending cifs requests on smb2 mount)
      
      The previous patches did not fix the one remaining case though ie
      mounting with "cifsacl" when getting mode from acl would unconditionally
      end up calling "cifs_get_acl_from_fid" even for smb2 - so made that protocol
      independent but to make that protocol independent had to make sure that the callers
      were passing the protocol independent handle structure (cifs_fid) instead
      of cifs specific _u16 network file handle (ie cifs_fid instead of cifs_fid->fid)
      
      Now mount with smb2 and cifsacl mount options will return EOPNOTSUP (instead
      of timing out) and a future patch will add smb2 operations (e.g. get_smb2_acl)
      to enable this.
      
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      42eacf9e
    • Trond Myklebust's avatar
      NFS: Do not set NFS_INO_INVALID_LABEL unless server supports labeled NFS · fd1defc2
      Trond Myklebust authored
      
      Commit aa9c2669 (NFS: Client implementation of Labeled-NFS) introduces
      a performance regression. When nfs_zap_caches_locked is called, it sets
      the NFS_INO_INVALID_LABEL flag irrespectively of whether or not the
      NFS server supports security labels. Since that flag is never cleared,
      it means that all calls to nfs_revalidate_inode() will now trigger
      an on-the-wire GETATTR call.
      
      This patch ensures that we never set the NFS_INO_INVALID_LABEL unless the
      server advertises support for labeled NFS.
      It also causes nfs_setsecurity() to clear NFS_INO_INVALID_LABEL when it
      has successfully set the security label for the inode.
      Finally it gets rid of the NFS_INO_INVALID_LABEL cruft from nfs_update_inode,
      which has nothing to do with labeled NFS.
      
      Reported-by: default avatarNeil Brown <neilb@suse.de>
      Cc: stable@vger.kernel.org # 3.11+
      Tested-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      fd1defc2
Loading