Skip to content
Snippets Groups Projects
  1. Jun 15, 2009
    • Sunil Mushran's avatar
      ocfs2/net: Use wait_event() in o2net_send_message_vec() · 9af0b38f
      Sunil Mushran authored
      Replace wait_event_interruptible() with wait_event() in o2net_send_message_vec().
      This is because this function is called by the dlm that expects signals to be
      blocked.
      
      Fixes oss bugzilla#1126
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1126
      
      
      
      Signed-off-by: default avatarSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      9af0b38f
    • Tao Ma's avatar
      ocfs2: Adjust rightmost path in ocfs2_add_branch. · 6b791bcc
      Tao Ma authored
      
      In ocfs2_add_branch, we use the rightmost rec of the leaf extent block
      to generate the e_cpos for the newly added branch. In the most case, it
      is OK but if the parent extent block's rightmost rec covers more clusters
      than the leaf does, it will cause kernel panic if we insert some clusters
      in it. The message is something like:
      (7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression:
      le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count)
      (7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28,
      next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1
       [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2]
       [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2]
       [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2]
      ...
      
      The panic can be easily reproduced by the following small test case
      (with bs=512, cs=4K, and I remove all the error handling so that it looks
      clear enough for reading).
      
      int main(int argc, char **argv)
      {
      	int fd, i;
      	char buf[5] = "test";
      
      	fd = open(argv[1], O_RDWR|O_CREAT);
      
      	for (i = 0; i < 30; i++) {
      		lseek(fd, 40960 * i, SEEK_SET);
      		write(fd, buf, 5);
      	}
      
      	ftruncate(fd, 1146880);
      
      	lseek(fd, 1126400, SEEK_SET);
      	write(fd, buf, 5);
      
      	close(fd);
      
      	return 0;
      }
      
      The reason of the panic is that:
      the 30 writes and the ftruncate makes the file's extent list looks like:
      
      	Tree Depth: 1   Count: 19   Next Free Rec: 1
      	## Offset        Clusters       Block#
      	0  0             280            86183
      	SubAlloc Bit: 7   SubAlloc Slot: 0
      	Blknum: 86183   Next Leaf: 0
      	CRC32: 00000000   ECC: 0000
      	Tree Depth: 0   Count: 28   Next Free Rec: 28
      	## Offset        Clusters       Block#          Flags
      	0  0             1              143368          0x0
      	1  10            1              143376          0x0
      	...
      	26 260           1              143576          0x0
      	27 270           1              143584          0x0
      
      Now another write at 1126400(275 cluster) whiich will write at the gap
      between 271 and 280 will trigger ocfs2_add_branch, but the result after
      the function looks like:
      	Tree Depth: 1   Count: 19   Next Free Rec: 2
      	## Offset        Clusters       Block#
      	0  0             280            86183
      	1  271           0             143592
      So the extent record is intersected and make the following operation bug out.
      
      This patch just try to remove the gap before we add the new branch, so that
      the root(branch) rightmost rec will cover the same right position. So in the
      above case, before adding branch the tree will be changed to
      	Tree Depth: 1   Count: 19   Next Free Rec: 1
      	## Offset        Clusters       Block#
      	0  0             271            86183
      	SubAlloc Bit: 7   SubAlloc Slot: 0
      	Blknum: 86183   Next Leaf: 0
      	CRC32: 00000000   ECC: 0000
      	Tree Depth: 0   Count: 28   Next Free Rec: 28
      	## Offset        Clusters       Block#          Flags
      	0  0             1              143368          0x0
      	1  10            1              143376          0x0
      	...
      	26 260           1              143576          0x0
      	27 270           1              143584          0x0
      And after branch add, the tree looks like
      	Tree Depth: 1   Count: 19   Next Free Rec: 2
      	## Offset        Clusters       Block#
      	0  0             271            86183
      	1  271           0             143592
      
      Signed-off-by: default avatarTao Ma <tao.ma@oracle.com>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      6b791bcc
  2. Jun 09, 2009
    • Hisashi Hifumi's avatar
      ocfs2: fdatasync should skip unimportant metadata writeout · e04cc15f
      Hisashi Hifumi authored
      
      In ocfs2, fdatasync and fsync are identical.
      I think fdatasync should skip committing transaction when
      inode->i_state is set just I_DIRTY_SYNC and this indicates
      only atime or/and mtime updates.
      Following patch improves fdatasync throughput.
      
      #sysbench --num-threads=16 --max-requests=300000 --test=fileio
      --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
      --file-fsync-mode=fdatasync run
      
      Results:
      -2.6.30-rc8
      Test execution summary:
          total time:                          107.1445s
          total number of events:              119559
          total time taken by event execution: 116.1050
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0010s
               max:                            0.1220s
               approx.  95 percentile:         0.0016s
      
      Threads fairness:
          events (avg/stddev):           7472.4375/303.60
          execution time (avg/stddev):   7.2566/0.64
      
      -2.6.30-rc8-patched
      Test execution summary:
          total time:                          86.8529s
          total number of events:              300016
          total time taken by event execution: 24.3077
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0001s
               max:                            0.0336s
               approx.  95 percentile:         0.0001s
      
      Threads fairness:
          events (avg/stddev):           18751.0000/718.75
          execution time (avg/stddev):   1.5192/0.05
      
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      e04cc15f
  3. Jun 04, 2009
    • Tao Ma's avatar
      06c59bb8
    • Joel Becker's avatar
      ocfs2: Add statistics for the checksum and ecc operations. · 73be192b
      Joel Becker authored
      
      It would be nice to know how often we get checksum failures.  Even
      better, how many of them we can fix with the single bit ecc.  So, we add
      a statistics structure.  The structure can be installed into debugfs
      wherever the user wants.
      
      For ocfs2, we'll put it in the superblock-specific debugfs directory and
      pass it down from our higher-level functions.  The stats are only
      registered with debugfs when the filesystem supports metadata ecc.
      
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      73be192b
    • Srinivas Eeda's avatar
      ocfs2 patch to track delayed orphan scan timer statistics · 15633a22
      Srinivas Eeda authored
      
      Patch to track delayed orphan scan timer statistics.
      
      Modifies ocfs2_osb_dump to print the following:
        Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago
      
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      15633a22
    • Srinivas Eeda's avatar
      ocfs2: timer to queue scan of all orphan slots · 83273932
      Srinivas Eeda authored
      
      When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
      before moving the dentry to the orphan directory. Other nodes that have
      this dentry in cache have a PR on the same dentry lock.  When the EX is
      requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
      during downconvert.  The inode is finally deleted when the last node to iput
      the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.
      
      A problem arises if a node is forced to free dentry locks because of memory
      pressure. If this happens, the node will no longer get downconvert
      notifications for the dentries that have been unlinked on another node.
      If it also happens that node is actively using the corresponding inode and
      happens to be the one performing the last iput on that inode, it will fail
      to delete the inode as it will not have the MAYBE_ORPHANED flag set.
      
      This patch fixes this shortcoming by introducing a periodic scan of the
      orphan directories to delete such inodes. Care has been taken to distribute
      the workload across the cluster so that no one node has to perform the task
      all the time.
      
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      83273932
    • Jan Kara's avatar
      ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories · edd45c08
      Jan Kara authored
      
      We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin().
      So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir()
      to also use this lock ordering.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      edd45c08
    • Jan Kara's avatar
      ocfs2: Fix possible deadlock in quota recovery · 80d73f15
      Jan Kara authored
      
      In ocfs2_finish_quota_recovery() we acquired global quota file lock and started
      recovering local quota file. During this process we need to get quota
      structures, which calls ocfs2_dquot_acquire() which gets global quota file lock
      again. This second lock can block in case some other node has requested the
      quota file lock in the mean time. Fix the problem by moving quota file locking
      down into the function where it is really needed.  Then dqget() or dqput()
      won't be called with the lock held.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      80d73f15
    • Jan Kara's avatar
      ocfs2: Fix possible deadlock with quotas in ocfs2_setattr() · 65bac575
      Jan Kara authored
      
      We called vfs_dq_transfer() with global quota file lock held. This can lead
      to deadlocks as if vfs_dq_transfer() has to allocate new quota structure,
      it calls ocfs2_dquot_acquire() which tries to get quota file lock again and
      this can block if another node requested the lock in the mean time.
      
      Since we have to call vfs_dq_transfer() with transaction already started
      and quota file lock ranks above the transaction start, we cannot just rely
      on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock
      if they need it. We fix the problem by acquiring pointers to all quota
      structures needed by vfs_dq_transfer() already before calling the function.
      By this we are sure that all quota structures are properly allocated and
      they can be freed only after we drop references to them. Thus we don't need
      quota file lock anywhere inside vfs_dq_transfer().
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      65bac575
    • Jan Kara's avatar
      ocfs2: Fix lock inversion in ocfs2_local_read_info() · b4c30de3
      Jan Kara authored
      
      This function is called with dqio_mutex held but it has to acquire lock
      from global quota file which ranks above this lock. This is not deadlockable
      lock inversion since this code path is take only during mount when noone
      else can race with us but let's clean this up to silence lockdep.
      
      We just drop the dqio_mutex in the beginning of the function and reacquire
      it in the end since we don't need it - noone can race with us at this moment.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      b4c30de3
    • Jan Kara's avatar
      ocfs2: Fix possible deadlock in ocfs2_global_read_dquot() · 4e8a3019
      Jan Kara authored
      
      It is not possible to get a read lock and then try to get the same write lock
      in one thread as that can block on downconvert being requested by other node
      leading to deadlock. So first drop the quota lock for reading and only after
      that get it for writing.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      4e8a3019
  4. May 05, 2009
    • Coly Li's avatar
      ocfs2: update comments in masklog.h · 2b53bc7b
      Coly Li authored
      
      In the mainline ocfs2 code, the interface for masklog is in files under
      /sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h
      reference the old /proc interface.  They are out of date.
      
      This patch modifies the comments in cluster/masklog.h, which also provides
      a bash script example on how to change the log mask bits.
      
      Signed-off-by: default avatarColy Li <coly.li@suse.de>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      2b53bc7b
    • Tao Ma's avatar
      ocfs2: Don't printk the error when listing too many xattrs. · a46fa684
      Tao Ma authored
      
      Currently the kernel defines XATTR_LIST_MAX as 65536
      in include/linux/limits.h.  This is the largest buffer that is used for
      listing xattrs.
      
      But with ocfs2 xattr tree, we actually have no limit for the number.  If
      filesystem has more names than can fit in the buffer, the kernel
      logs will be pollluted with something like this when listing:
      
      (27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34
      (27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34
      
      So don't print "ERROR" message as this is not an ocfs2 error.
      
      Signed-off-by: default avatarTao Ma <tao.ma@oracle.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      a46fa684
  5. Apr 30, 2009
    • Joel Becker's avatar
      ocfs2: Fix a missing credit when deleting from indexed directories. · dfa13f39
      Joel Becker authored
      
      The ocfs2 directory index updates two blocks when we remove an entry -
      the dx root and the dx leaf.  OCFS2_DELETE_INODE_CREDITS was only
      accounting for the dx leaf.  This shows up when ocfs2_delete_inode()
      runs out of credits in jbd2_journal_dirty_metadata() at
      "J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".
      
      The test that caught this was running dirop_file_racer from the
      ocfs2-test suite with a 250-character filename PREFIX.  Run on a 512B
      blocksize, it forces the orphan dir index to grow large enough to
      trigger.
      
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      dfa13f39
  6. Apr 29, 2009
  7. Apr 23, 2009
  8. Apr 22, 2009
  9. Apr 15, 2009
  10. Apr 07, 2009
    • Tao Ma's avatar
      ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir. · 035a5711
      Tao Ma authored
      
      In ocfs2_expand_inline_dir, we calculate whether we need 1 extra
      cluster if we can't store the dx inline the root and save it in
      dx_alloc. So add it when we call ocfs2_reserve_clusters.
      
      Signed-off-by: default avatarTao Ma <tao.ma@oracle.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.com>
      035a5711
    • Miklos Szeredi's avatar
      splice: fix deadlock in splicing to file · 7bfac9ec
      Miklos Szeredi authored
      
      There's a possible deadlock in generic_file_splice_write(),
      splice_from_pipe() and ocfs2_file_splice_write():
      
       - task A calls generic_file_splice_write()
       - this calls inode_double_lock(), which locks i_mutex on both
         pipe->inode and target inode
       - ordering depends on inode pointers, can happen that pipe->inode is
         locked first
       - __splice_from_pipe() needs more data, calls pipe_wait()
       - this releases lock on pipe->inode, goes to interruptible sleep
       - task B calls generic_file_splice_write(), similarly to the first
       - this locks pipe->inode, then tries to lock inode, but that is
         already held by task A
       - task A is interrupted, it tries to lock pipe->inode, but fails, as
         it is already held by task B
       - ABBA deadlock
      
      Fix this by explicitly ordering locks: the outer lock must be on
      target inode and the inner lock (which is later unlocked and relocked)
      must be on pipe->inode.  This is OK, pipe inodes and target inodes
      form two nonoverlapping sets, generic_file_splice_write() and friends
      are not called with a target which is a pipe.
      
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Acked-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bfac9ec
  11. Apr 03, 2009
Loading