- Apr 08, 2014
-
-
Oleg Nesterov authored
get_task_state() uses the most significant bit to report the state to user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition can be noticed via /proc as Z -> X -> Z change. Note that this was possible even before EXIT_TRACE was introduced. This is not really bad but imho it make sense to hide EXIT_TRACE from user-space completely. So the patch simply swaps EXIT_ZOMBIE and EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space. Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
Starting from commit c4ad8f98 ("execve: use 'struct filename *' for executable name passing") bprm->filename can not go away after flush_old_exec(), so we do not need to save the binary name in bprm->tcomm[] added by 96e02d15 ("exec: fix use-after-free bug in setup_new_exec()"). And there was never need for filename_to_taskname-like code, we can simply do set_task_comm(kbasename(filename). This patch has to change set_task_comm() and trace_task_rename() to accept "const char *", but I think this change is also good. Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Djalal Harouni authored
The /proc/*/pagemap contain sensitive information and currently its mode is 0444. Change this to 0400, so the VFS will prevent unprivileged processes from getting file descriptors on arbitrary privileged /proc/*/pagemap files. This reduces the scope of address space leaking and bypasses by protecting already running processes. Signed-off-by:
Djalal Harouni <tixxdz@opendz.org> Acked-by:
Kees Cook <keescook@chromium.org> Acked-by:
Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Djalal Harouni authored
These procfs files contain sensitive information and currently their mode is 0444. Change this to 0400, so the VFS will be able to block unprivileged processes from getting file descriptors on arbitrary privileged /proc/*/{stack,syscall,personality} files. This reduces the scope of ASLR leaking and bypasses by protecting already running processes. Signed-off-by:
Djalal Harouni <tixxdz@opendz.org> Acked-by:
Kees Cook <keescook@chromium.org> Acked-by:
Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Monam Agarwal authored
Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by:
Monam Agarwal <monamagarwal123@gmail.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Vagin authored
Currently we don't have a way how to determing from which mount point file has been opened. This information is required for proper dumping and restoring file descriptos due to presence of mount namespaces. It's possible, that two file descriptors are opened using the same paths, but one fd references mount point from one namespace while the other fd -- from other namespace. $ ls -l /proc/1/fd/1 lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null $ cat /proc/1/fdinfo/1 pos: 0 flags: 0100002 mnt_id: 16 $ cat /proc/1/mountinfo | grep ^16 16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755 Signed-off-by:
Andrey Vagin <avagin@openvz.org> Acked-by:
Pavel Emelyanov <xemul@parallels.com> Acked-by:
Cyrill Gorcunov <gorcunov@openvz.org> Cc: Rob Landley <rob@landley.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Luiz Capitulino authored
It should read "reclaimable slab" and not "reclaimable swap". Signed-off-by:
Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by:
Rik van Riel <riel@redhat.com> Acked-by:
Rafael Aquini <aquini@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410 ), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by:
Davidlohr Bueso <davidlohr@hp.com> Reviewed-by:
Rik van Riel <riel@redhat.com> Acked-by:
Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by:
Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by:
Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by:
Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Alex Thorlton authored
load_elf_binary() sets current->mm->def_flags = def_flags and def_flags is always zero. Not only this looks strange, this is unnecessary because mm_init() has already set ->def_flags = 0. Signed-off-by:
Alex Thorlton <athorlton@sgi.com> Suggested-by:
Oleg Nesterov <oleg@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by:
Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
- Convert spinlock/static array to va_format (inspired by Joe Perches help on previous logging patches). - Convert printk(KERN_ERR to pr_warn in __ntfs_warning. - Convert printk(KERN_ERR to pr_err in __ntfs_error. - Convert printk(KERN_DEBUG to pr_debug in __ntfs_debug. (Note that __ntfs_debug is still guarded by #if DEBUG) - Improve !DEBUG to parse all arguments (Joe Perches). - Sparse pr_foo() conversions in super.c NTFS, NTFS-fs prefixes as well as 'warning' and 'error' were removed : pr_foo() automatically adds module name and error level is already specified. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Joe Perches <joe@perches.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 05, 2014
-
-
Jeff Layton authored
There is no guarantee that the strings in the nfs_cache_array will be NULL-terminated. In the event that we end up hitting a readdir loop, we need to ensure that we pass the warning message the length of the string. Reported-by:
Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Apr 04, 2014
-
-
Fabian Frederick authored
init_inodecache is only called by __init init_reiserfs_fs. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Acked-by:
Jeff Mahoney <jeffm@suse.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Rashika Kheria authored
Move prototype declaration to header file reiserfs/reiserfs.h from reiserfs/super.c because they are used by more than one file. This eliminates the following warning in reiserfs/bitmap.c: fs/reiserfs/bitmap.c:647:6: warning: no previous prototype for `show_alloc_options' [-Wmissing-prototypes] Signed-off-by:
Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by:
Josh Triplett <josh@joshtriplett.org> Acked-by:
Jeff Mahoney <jeffm@suse.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
hfsplus_create_attr_tree_cache is only called by __init init_hfsplus_fs Signed-off-by:
Fabian Frederick <fabf@skynet.be> Reviewed-by:
Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Sougata Santra authored
Concurrent access to alloc_blocks in hfsplus_inode_info() is protected by extents_lock mutex. This patch fixes two instances where alloc_blocks modification was not protected with this lock. This fixes possible allocation bitmap corruption in race conditions while extending and truncating files. [akpm@linux-foundation.org: take extents_lock before taking a copy of ->alloc_blocks] [akpm@linux-foundation.org: remove now-unused label `out'] Signed-off-by:
Sougata Santra <sougata@tuxera.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Sougata Santra authored
The variable is defined but not used. Generally it compiles away with -O2 optimization hence it does not show a warning. Signed-off-by:
Sougata Santra <sougata@tuxera.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Ryusuke Konishi authored
Add code to check sizes of on-disk data of metadata files such as inode size, segment usage size, DAT entry size, and checkpoint size. Although these sizes are read from disk, the current implementation doesn't check them. If these sizes are not sane on disk, it can cause out-of-range access to metadata or memory access overrun on metadata block buffers due to overflow in sundry calculations. Both lower limit and upper limit of metadata sizes are verified to prevent these issues. Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andreas Rohner authored
Add support for the FITRIM ioctl, which enables user space tools to issue TRIM/DISCARD requests to the underlying device. Every clean segment within the specified range will be discarded. Signed-off-by:
Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andreas Rohner authored
Add nilfs_sufile_trim_fs(), which takes an fstrim_range structure and calls blkdev_issue_discard for every clean segment in the specified range. The range is truncated to file system block boundaries. Signed-off-by:
Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andreas Rohner authored
With this ioctl the segment usage entries in the SUFILE can be updated from userspace. This is useful, because it allows the userspace GC to modify and update segment usage entries for specific segments, which enables it to avoid unnecessary write operations. If a segment needs to be cleaned, but there is no or very little reclaimable space in it, the cleaning operation basically degrades to a useless moving operation. In the end the only thing that changes is the location of the data and a timestamp in the segment usage information. With this ioctl the GC can skip the cleaning and update the segment usage entries directly instead. This is basically a shortcut to cleaning the segment. It is still necessary to read the segment summary information, but the writing of the live blocks can be skipped if it's not worth it. [konishi.ryusuke@lab.ntt.co.jp: add description of NILFS_IOCTL_SET_SUINFO ioctl] Signed-off-by:
Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andreas Rohner authored
Introduce nilfs_sufile_set_suinfo(), which expects an array of nilfs_suinfo_update structures and updates the segment usage information accordingly. This is basically a helper function for the newly introduced NILFS_IOCTL_SET_SUINFO ioctl. [konishi.ryusuke@lab.ntt.co.jp: use put_bh() instead of brelse() because we know bh != NULL] Signed-off-by:
Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
init_inodecache is only called by __init init_coda Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
Summary: - all printk(KERN_foo converted to pr_foo() - add pr_fmt and remove redundant prefixes - convert befs_() to va_format (based on patch by Joe Perches) - remove non standard %Lu - use __func__ for all debugging [akpm@linux-foundation.org: fix printk warnings, reported by Fengguang] Signed-off-by:
Fabian Frederick <fabf@skynet.be> Cc: Joe Perches <joe@perches.com> Cc: Fengguang Wu <fengguang.wu@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
init_inodecache is only called by __init init_befs_fs. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
Use kzalloc for clean fs_info allocation like other filesystems. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
init_inodecache is only called by __init init_minix_fs. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Luis Henriques authored
A missing 'break' statement in bm_status_write() results in a user program receiving '3' when doing the following: write(fd, "-1", 2); Signed-off-by:
Luis Henriques <luis.henriques@canonical.com> Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
init_inodecache is only called by __init init_efs_fs. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Josh Triplett authored
uselib hasn't been used since libc5; glibc does not use it. Support turning it off. When disabled, also omit the load_elf_library implementation from binfmt_elf.c, which only uselib invokes. bloat-o-meter: add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785) function old new delta padzero 39 36 -3 uselib_flags 20 - -20 sys_uselib 168 - -168 SyS_uselib 168 - -168 load_elf_library 426 - -426 The new CONFIG_USELIB defaults to `y'. Signed-off-by:
Josh Triplett <josh@joshtriplett.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Wang YanQing authored
After commit 6307f8fe ("security: remove dead hook task_setgroups"), set_groups will always return zero, so we could just remove return value of set_groups. This patch reduces code size, and simplfies code to use set_groups, because we don't need to check its return value any more. [akpm@linux-foundation.org: remove obsolete claims from set_groups() comment] Signed-off-by:
Wang YanQing <udknight@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Fabian Frederick authored
sys_sysfs is an obsolete system call no longer supported by libc. - This patch adds a default CONFIG_SYSFS_SYSCALL=y - Option can be turned off in expert mode. - cond_syscall added to kernel/sys_ni.c [akpm@linux-foundation.org: tweak Kconfig help text] Signed-off-by:
Fabian Frederick <fabf@skynet.be> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Dave Hansen authored
There is plenty of anecdotal evidence and a load of blog posts suggesting that using "drop_caches" periodically keeps your system running in "tip top shape". Perhaps adding some kernel documentation will increase the amount of accurate data on its use. If we are not shrinking caches effectively, then we have real bugs. Using drop_caches will simply mask the bugs and make them harder to find, but certainly does not fix them, nor is it an appropriate "workaround" to limit the size of the caches. On the contrary, there have been bug reports on issues that turned out to be misguided use of cache dropping. Dropping caches is a very drastic and disruptive operation that is good for debugging and running tests, but if it creates bug reports from production use, kernel developers should be aware of its use. Add a bit more documentation about it, a syslog message to track down abusers, and vmstat drop counters to help analyze problem reports. [akpm@linux-foundation.org: checkpatch fixes] [hannes@cmpxchg.org: add runtime suppression control] Signed-off-by:
Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by:
Michal Hocko <mhocko@suse.cz> Acked-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by:
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Sasha Levin authored
This patch removes read_cache_page_async() which wasn't really needed anywhere and simplifies the code around it a bit. read_cache_page_async() is useful when we want to read a page into the cache without waiting for it to complete. This happens when the appropriate callback 'filler' doesn't complete its read operation and releases the page lock immediately, and instead queues a different completion routine to do that. This never actually happened anywhere in the code. read_cache_page_async() had 3 different callers: - read_cache_page() which is the sync version, it would just wait for the requested read to complete using wait_on_page_read(). - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler function it supplied doesn't do any async reads, and would complete before the filler function returns - making it actually a sync read. - CRAMFS would call it using the read_mapping_page_async() wrapper, with a similar story to JFFS2 - the filler function doesn't do anything that reminds async reads and would always complete before the filler function returns. To sum it up, the code in mm/filemap.c never took advantage of having read_cache_page_async(). While there are filler callbacks that do async reads (such as the block one), we always called it with the read_cache_page(). This patch adds a mandatory wait for read to complete when adding a new page to the cache, and removes read_cache_page_async() and its wrappers. Signed-off-by:
Sasha Levin <sasha.levin@oracle.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reviewed-by:
Rik van Riel <riel@redhat.com> Reviewed-by:
Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reviewed-by:
Rik van Riel <riel@redhat.com> Reviewed-by:
Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reviewed-by:
Rik van Riel <riel@redhat.com> Reviewed-by:
Minchan Kim <minchan@kernel.org> Acked-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
This code used to have its own lru cache pagevec up until a0b8cab3 ("mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API"). Now it's just add_to_page_cache() followed by lru_cache_add(), might as well use add_to_page_cache_lru() directly. Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reviewed-by:
Rik van Riel <riel@redhat.com> Reviewed-by:
Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
Currently, to track reserved and allocated regions, we use two different ways, depending on the mapping. For MAP_SHARED, we use address_mapping's private_list and, while for MAP_PRIVATE, we use a resv_map. Now, we are preparing to change a coarse grained lock which protect a region structure to fine grained lock, and this difference hinder it. So, before changing it, unify region structure handling, consistently using a resv_map regardless of the kind of mapping. Signed-off-by:
Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by:
Davidlohr Bueso <davidlohr@hp.com> Reviewed-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by:
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Carpenter authored
We know that "ret > 0" is true here. These tests were left over from commit 02afc27f ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-