Skip to content
Snippets Groups Projects
  1. Apr 11, 2014
    • David S. Miller's avatar
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller authored
      
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      676d2369
  2. Apr 08, 2014
    • Heiko Carstens's avatar
    • Mark Salter's avatar
      mm: create generic early_ioremap() support · 9e5c33d7
      Mark Salter authored
      
      This patch creates a generic implementation of early_ioremap() support
      based on the existing x86 implementation.  early_ioremp() is useful for
      early boot code which needs to temporarily map I/O or memory regions
      before normal mapping functions such as ioremap() are available.
      
      Some architectures have optional MMU.  In the no-MMU case, the remap
      functions simply return the passed in physical address and the unmap
      functions do nothing.
      
      Signed-off-by: default avatarMark Salter <msalter@redhat.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e5c33d7
    • Josh Triplett's avatar
      lglock: map to spinlock when !CONFIG_SMP · 64b47e8f
      Josh Triplett authored
      
      When the system has only one CPU, lglock is effectively a spinlock; map
      it directly to spinlock to eliminate the indirection and duplicate code.
      
      In addition to removing overhead, this drops 1.6k of code with a
      defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.
      
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64b47e8f
    • Christoph Lameter's avatar
      percpu: add preemption checks to __this_cpu ops · 188a8140
      Christoph Lameter authored
      
      We define a check function in order to avoid trouble with the include
      files.  Then the higher level __this_cpu macros are modified to invoke
      the preemption check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Tested-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      188a8140
    • Christoph Lameter's avatar
      vmstat: use raw_cpu_ops to avoid false positives on preemption checks · 293b6a4c
      Christoph Lameter authored
      
      vm counters are allowed to be racy.  Use raw_cpu_ops to avoid the
      local_irq_disable overhead and to avoid preemption checks which will be
      added to the __this_cpu operations.
      
      [akpm@linux-foundation.org: Add comment.  Again.]
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Reported-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      293b6a4c
    • Christoph Lameter's avatar
      mm: use raw_cpu ops for determining current NUMA node · dc322a99
      Christoph Lameter authored
      With the preempt checking logic for __this_cpu_ops we will get false
      positives from locations in the code that use numa_node_id.
      
      Before the __this_cpu ops where introduced there were no checks for
      preemption present either.  smp_raw_processor_id() was used.  See
      
        http://www.spinics.net/lists/linux-numa/msg00641.html
      
      
      
      Therefore we need to use raw_cpu_read here to avoid false postives.
      
      Note that this issue has been discussed in prior years.  If the process
      changes nodes after retrieving the current numa node then that is
      acceptable since most uses of numa_node etc are for optimization and not
      for correctness.
      
      There were suggestions to implement a raw_numa_node_id in order to do
      preempt checks for numa_node_id as well.  But I think we better defer
      that to another patch since that would mean investigating how
      numa_node_id() is used throughout the kernel which would increase the
      scope of this patchset significantly.  After all preemption was never
      checked before when numa_node_id() was used.
      
      Some sample traces:
      
      __this_cpu_read operation in preemptible [00000000] code: login/1456
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        get_task_policy+0x1d/0x49
        get_vma_policy+0x14/0x76
        alloc_pages_vma+0x35/0xff
        handle_mm_fault+0x290/0x73b
        __do_page_fault+0x3fe/0x44d
        do_page_fault+0x9/0xc
        page_fault+0x22/0x30
        generic_file_aio_read+0x38e/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        alloc_pages_current+0x8f/0xbc
        __page_cache_alloc+0xb/0xd
        __do_page_cache_readahead+0xf4/0x219
        ra_submit+0x1c/0x20
        ondemand_readahead+0x28c/0x2b4
        page_cache_sync_readahead+0x38/0x3a
        generic_file_aio_read+0x261/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc322a99
    • Christoph Lameter's avatar
      percpu: add raw_cpu_ops · b3ca1c10
      Christoph Lameter authored
      
      The kernel has never been audited to ensure that this_cpu operations are
      consistently used throughout the kernel.  The code generated in many
      places can be improved through the use of this_cpu operations (which
      uses a segment register for relocation of per cpu offsets instead of
      performing address calculations).
      
      The patch set also addresses various consistency issues in general with
      the per cpu macros.
      
      A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
         because checks are skipped. This is typically shown through a raw_
         prefix. So this patch set changes the places where __this_cpu_ptr()
         is used to raw_cpu_ptr().
      
      B. There has been the long term wish by some that __this_cpu operations
         would check for preemption. However, there are cases where preemption
         checks need to be skipped. This patch set adds raw_cpu operations that
         do not check for preemption and then adds preemption checks to the
         __this_cpu operations.
      
      C. The use of __get_cpu_var is always a reference to a percpu variable
         that can also be handled via a this_cpu operation. This patch set
         replaces all uses of __get_cpu_var with this_cpu operations.
      
      D. We can then use this_cpu RMW operations in various places replacing
         sequences of instructions by a single one.
      
      E. The use of this_cpu operations throughout will allow other arches than
         x86 to implement optimized references and RMV operations to work with
         per cpu local data.
      
      F. The use of this_cpu operations opens up the possibility to
         further optimize code that relies on synchronization through
         per cpu data.
      
      The patch set works in a couple of stages:
      
      I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
          Also converts the existing __this_cpu_xx_# primitive in the x86
          code to raw_cpu_xx_#.
      
      II. Patch 2-4 use the raw_cpu operations in places that would give
           us false positives once they are enabled.
      
      III. Patch 5 adds preemption checks to __this_cpu operations to allow
          checking if preemption is properly disabled when these functions
          are used.
      
      IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
         with this_cpu_ptr. They do not depend on any changes to the percpu
         code. No preemption tests are skipped if they are applied.
      
      V. Patches 21-46 are conversion patches that use this_cpu operations
         in various kernel subsystems/drivers or arch code.
      
      VI.  Patches 47/48 (not included in this series) remove no longer used
          functions (__this_cpu_ptr and __get_cpu_var).  These should only be
          applied after all the conversion patches have made it and after we
          have done additional passes through the kernel to ensure that none of
          the uses of these functions remain.
      
      This patch (of 46):
      
      The patches following this one will add preemption checks to __this_cpu
      ops so we need to have an alternative way to use this_cpu operations
      without preemption checks.
      
      raw_cpu_ops will be the basis for all other ops since these will be the
      operations that do not implement any checks.
      
      Primitive operations are renamed by this patch from __this_cpu_xxx to
      raw_cpu_xxxx.
      
      Also change the uses of the x86 percpu primitives in preempt.h.
      These depend directly on asm/percpu.h (header #include nesting issue).
      
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bryan Wu <cooloney@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wim Van Sebroeck <wim@iguana.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ca1c10
    • Vladimir Davydov's avatar
      slub: rework sysfs layout for memcg caches · 9a41707b
      Vladimir Davydov authored
      
      Currently, we try to arrange sysfs entries for memcg caches in the same
      manner as for global caches.  Apart from turning /sys/kernel/slab into a
      mess when there are a lot of kmem-active memcgs created, it actually
      does not work properly - we won't create more than one link to a memcg
      cache in case its parent is merged with another cache.  For instance, if
      A is a root cache merged with another root cache B, we will have the
      following sysfs setup:
      
        X
        A -> X
        B -> X
      
      where X is some unique id (see create_unique_id()).  Now if memcgs M and
      N start to allocate from cache A (or B, which is the same), we will get:
      
        X
        X:M
        X:N
        A -> X
        B -> X
        A:M -> X:M
        A:N -> X:N
      
      Since B is an alias for A, we won't get entries B:M and B:N, which is
      confusing.
      
      It is more logical to have entries for memcg caches under the
      corresponding root cache's sysfs directory.  This would allow us to keep
      sysfs layout clean, and avoid such inconsistencies like one described
      above.
      
      This patch does the trick.  It creates a "cgroup" kset in each root
      cache kobject to keep its children caches there.
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a41707b
    • Vladimir Davydov's avatar
      memcg, slab: do not destroy children caches if parent has aliases · b8529907
      Vladimir Davydov authored
      
      Currently we destroy children caches at the very beginning of
      kmem_cache_destroy().  This is wrong, because the root cache will not
      necessarily be destroyed in the end - if it has aliases (refcount > 0),
      kmem_cache_destroy() will simply decrement its refcount and return.  In
      this case, at best we will get a bunch of warnings in dmesg, like this
      one:
      
        kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
        CPU: 1 PID: 7139 Comm: modprobe Tainted: G    B   W    3.13.0+ #117
        Call Trace:
          dump_stack+0x49/0x5b
          kmem_cache_destroy+0xdf/0xf0
          kmem_cache_destroy_memcg_children+0x97/0xc0
          kmem_cache_destroy+0xf/0xf0
          xfs_mru_cache_uninit+0x21/0x30 [xfs]
          exit_xfs_fs+0x2e/0xc44 [xfs]
          SyS_delete_module+0x198/0x1f0
          system_call_fastpath+0x16/0x1b
      
      At worst - if kmem_cache_destroy() will race with an allocation from a
      memcg cache - the kernel will panic.
      
      This patch fixes this by moving children caches destruction after the
      check if the cache has aliases.  Plus, it forbids destroying a root
      cache if it still has children caches, because each children cache keeps
      a reference to its parent.
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8529907
    • Vladimir Davydov's avatar
      memcg, slab: separate memcg vs root cache creation paths · 794b1248
      Vladimir Davydov authored
      
      Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
      memcg-only and except-for-memcg calls.  To clean this up, let's move the
      code responsible for memcg cache creation to a separate function.
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      794b1248
    • Vladimir Davydov's avatar
      memcg, slab: cleanup memcg cache creation · 5722d094
      Vladimir Davydov authored
      
      This patch cleans up the memcg cache creation path as follows:
      
      - Move memcg cache name creation to a separate function to be called
        from kmem_cache_create_memcg().  This allows us to get rid of the mutex
        protecting the temporary buffer used for the name formatting, because
        the whole cache creation path is protected by the slab_mutex.
      
      - Get rid of memcg_create_kmem_cache().  This function serves as a proxy
        to kmem_cache_create_memcg().  After separating the cache name creation
        path, it would be reduced to a function call, so let's inline it.
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5722d094
    • Uwe Kleine-König's avatar
      Kconfig: rename HAS_IOPORT to HAS_IOPORT_MAP · ce816fa8
      Uwe Kleine-König authored
      
      If the renamed symbol is defined lib/iomap.c implements ioport_map and
      ioport_unmap and currently (nearly) all platforms define the port
      accessor functions outb/inb and friend unconditionally.  So
      HAS_IOPORT_MAP is the better name for this.
      
      Consequently NO_IOPORT is renamed to NO_IOPORT_MAP.
      
      The motivation for this change is to reintroduce a symbol HAS_IOPORT
      that signals if outb/int et al are available.  I will address that at
      least one merge window later though to keep surprises to a minimum and
      catch new introductions of (HAS|NO)_IOPORT.
      
      The changes in this commit were done using:
      
      	$ git grep -l -E '(NO|HAS)_IOPORT' | xargs perl -p -i -e 's/\b((?:CONFIG_)?(?:NO|HAS)_IOPORT)\b/$1_MAP/'
      
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce816fa8
    • Josh Triplett's avatar
      bug: Make BUG() always stop the machine · a4b5d580
      Josh Triplett authored
      
      When !CONFIG_BUG and !HAVE_ARCH_BUG, define the generic BUG() as an
      infinite loop rather than a no-op.  This avoids undefined behavior if
      execution ever actually reaches BUG(), and avoids warnings about code
      after BUG() (such as on non-void functions calling BUG() and then not
      returning).
      
      bloat-o-meter results:
      
        add/remove: 0/0 grow/shrink: 43/10 up/down: 235/-98 (137)
        function                             old     new   delta
        umount_collect                       119     138     +19
        notify_change                        306     324     +18
        xstate_enable_boot_cpu               252     269     +17
        kunmap                                54      70     +16
        balloon_page_dequeue                 112     126     +14
        mm_take_all_locks                    223     233     +10
        list_lru_walk_node                   143     152      +9
        vma_adjust                          1059    1067      +8
        pcpu_setup_first_chunk              1130    1138      +8
        mm_drop_all_locks                    143     151      +8
        ns_capable                            55      62      +7
        anon_transport_class_unregister        8      15      +7
        srcu_init_notifier_head               35      41      +6
        shrink_dcache_for_umount             174     180      +6
        kunmap_high                           99     105      +6
        end_page_writeback                    43      49      +6
        do_exit                             1339    1345      +6
        __kfifo_dma_out_prepare_r             86      92      +6
        __kfifo_dma_in_prepare_r              90      96      +6
        fixup_user_fault                     120     125      +5
        repair_env_string                     73      77      +4
        read_cache_pages_invalidate_page      56      60      +4
        isolate_lru_pages.isra               142     146      +4
        do_notify_parent_cldstop             255     259      +4
        cpu_init                             370     374      +4
        utimes_common                        270     272      +2
        tasklet_hi_action                     91      93      +2
        tasklet_action                        91      93      +2
        set_pte_vaddr                         46      48      +2
        find_get_pages_tag                   202     204      +2
        early_iounmap                        185     187      +2
        __native_set_fixmap                   36      38      +2
        __get_user_pages                     822     824      +2
        __early_ioremap                      299     301      +2
        yield_task_stop                        1       2      +1
        tick_resume                           37      38      +1
        switched_to_stop                       1       2      +1
        switched_to_idle                       1       2      +1
        prio_changed_stop                      1       2      +1
        prio_changed_idle                      1       2      +1
        pm_qos_power_read                    111     112      +1
        arch_cpu_idle_dead                     1       2      +1
        __insert_vmap_area                   140     141      +1
        sys_renameat                         614     612      -2
        mm_fault_error                       297     295      -2
        SyS_renameat                         614     612      -2
        sys_linkat                           416     413      -3
        SyS_linkat                           416     413      -3
        chmod_common                         129     122      -7
        proc_cap_handler                     240     225     -15
        __schedule                           849     831     -18
        sys_madvise                         1077    1054     -23
        SyS_madvise                         1077    1054     -23
      
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4b5d580
    • Josh Triplett's avatar
      bug: when !CONFIG_BUG, make WARN call no_printk to check format and args · 4e50ebde
      Josh Triplett authored
      
      The stub version of WARN for !CONFIG_BUG completely ignored its format
      string and subsequent arguments; make it check them instead, using
      no_printk.
      
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e50ebde
    • Josh Triplett's avatar
    • Josh Triplett's avatar
      bug: when !CONFIG_BUG, simplify WARN_ON_ONCE and family · b607e70e
      Josh Triplett authored
      
      When !CONFIG_BUG, WARN_ON and family become simple passthroughs of their
      condition argument; however, WARN_ON_ONCE and family still have conditions
      and a boolean to detect one-time invocation, even though the warning
      they'd emit doesn't exist.  Make the existing definitions conditional on
      CONFIG_BUG, and add definitions for !CONFIG_BUG that map to the
      passthrough versions of WARN and WARN_ON.
      
      This saves 4.4k on a minimized configuration (smaller than allnoconfig),
      and 20.6k with defconfig plus CONFIG_BUG=n.
      
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b607e70e
    • Alexandre Bounine's avatar
      rapidio: rework device hierarchy and introduce mport class of devices · 2aaf308b
      Alexandre Bounine authored
      
      This patch removes an artificial RapidIO bus root device and establishes
      actual device hierarchy by providing reference to real parent devices.
      It also introduces device class for RapidIO controller devices (on-chip
      or an eternal bridge, known as "mport").
      
      Existing implementation was sufficient for SoC-based platforms that have
      a single RapidIO controller.  With introduction of devices using
      multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme
      is very limiting or does not work at all.  The implemented changes allow
      to properly reference platform's local RapidIO mport devices and provide
      device details needed for upper layers.
      
      This change to RapidIO device hierarchy does not break any known
      existing kernel or user space interfaces.
      
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
      Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com>
      Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2aaf308b
    • Stephen Hemminger's avatar
      idr: remove dead code · 90ae3ae5
      Stephen Hemminger authored
      
      Remove no longer used deprecated code, and make local functions
      static.
      
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarJean Delvare <jdelvare@suse.de>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: George Spelvin <linux@horizon.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90ae3ae5
    • Rashika Kheria's avatar
      include/linux/crash_dump.h: add vmcore_cleanup() prototype · 82e0703b
      Rashika Kheria authored
      
      Eliminate the following warning in proc/vmcore.c:
      
        fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]
      
      [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
      Signed-off-by: default avatarRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82e0703b
    • Oleg Nesterov's avatar
      wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space · ad86622b
      Oleg Nesterov authored
      
      get_task_state() uses the most significant bit to report the state to
      user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
      can be noticed via /proc as Z -> X -> Z change.  Note that this was
      possible even before EXIT_TRACE was introduced.
      
      This is not really bad but imho it make sense to hide EXIT_TRACE from
      user-space completely.  So the patch simply swaps EXIT_ZOMBIE and
      EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad86622b
    • Oleg Nesterov's avatar
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov authored
      
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Tested-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • Oleg Nesterov's avatar
      exec: kill bprm->tcomm[], simplify the "basename" logic · 23aebe16
      Oleg Nesterov authored
      
      Starting from commit c4ad8f98 ("execve: use 'struct filename *' for
      executable name passing") bprm->filename can not go away after
      flush_old_exec(), so we do not need to save the binary name in
      bprm->tcomm[] added by 96e02d15 ("exec: fix use-after-free bug in
      setup_new_exec()").
      
      And there was never need for filename_to_taskname-like code, we can
      simply do set_task_comm(kbasename(filename).
      
      This patch has to change set_task_comm() and trace_task_rename() to
      accept "const char *", but I think this change is also good.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23aebe16
    • Srikar Dronamraju's avatar
      numa: use LAST_CPUPID_SHIFT to calculate LAST_CPUPID_MASK · 834a964a
      Srikar Dronamraju authored
      
      LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH.  However
      LAST_CPUPID_WIDTH itself can be 0.  (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
      set).  In such a case LAST_CPUPID_MASK turns out to be 0.
      
      But with recent commit 1ae71d03: (mm: numa: bugfix for
      LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
      page_cpupid_xchg_last() and page_cpupid_reset_last() causes
      page->_last_cpupid to be set to 0.
      
      This causes performance regression. Its almost as if numa_balancing is
      off.
      
      Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
      LAST_CPUPID_WIDTH.
      
      Some performance numbers and perf stats with and without the fix.
      
      (3.14-rc6)
      ----------
      numa01
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
      
               12,27,462 cs                                                           [100.00%]
                2,41,957 migrations                                                   [100.00%]
             1,68,01,713 faults                                                       [100.00%]
          7,99,35,29,041 cache-misses
                  98,808 migrate:mm_migrate_pages                                     [100.00%]
      
          1407.690148814 seconds time elapsed
      
      numa02
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
      
                  63,065 cs                                                           [100.00%]
                  14,364 migrations                                                   [100.00%]
                2,08,118 faults                                                       [100.00%]
            25,32,59,404 cache-misses
                      12 migrate:mm_migrate_pages                                     [100.00%]
      
            63.840827219 seconds time elapsed
      
      (3.14-rc6 with fix)
      -------------------
      numa01
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
      
                9,68,911 cs                                                           [100.00%]
                1,01,414 migrations                                                   [100.00%]
               88,38,697 faults                                                       [100.00%]
          4,42,92,51,042 cache-misses
                4,25,060 migrate:mm_migrate_pages                                     [100.00%]
      
           685.965331189 seconds time elapsed
      
      numa02
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
      
                  17,543 cs                                                           [100.00%]
                   2,962 migrations                                                   [100.00%]
                1,17,843 faults                                                       [100.00%]
            11,80,61,644 cache-misses
                  12,358 migrate:mm_migrate_pages                                     [100.00%]
      
            20.380132343 seconds time elapsed
      
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      834a964a
    • Zhang Yanfei's avatar
      madvise: correct the comment of MADV_DODUMP flag · 85892f19
      Zhang Yanfei authored
      
      s/MADV_NODUMP/MADV_DONTDUMP/
      
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85892f19
    • Fabian Frederick's avatar
      mm/readahead.c: inline ra_submit · 29f175d1
      Fabian Frederick authored
      
      Commit f9acc8c7 ("readahead: sanify file_ra_state names") left
      ra_submit with a single function call.
      
      Move ra_submit to internal.h and inline it to save some stack.  Thanks
      to Andrew Morton for commenting different versions.
      
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29f175d1
    • Miklos Szeredi's avatar
      mm: remove unused arg of set_page_dirty_balance() · ed6d7c8e
      Miklos Szeredi authored
      
      There's only one caller of set_page_dirty_balance() and that will call it
      with page_mkwrite == 0.
      
      The page_mkwrite argument was unused since commit b827e496 "mm: close
      page_mkwrite races".
      
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed6d7c8e
    • Michal Hocko's avatar
      memcg: rename high level charging functions · d715ae08
      Michal Hocko authored
      
      mem_cgroup_newpage_charge is used only for charging anonymous memory so
      it is better to rename it to mem_cgroup_charge_anon.
      
      mem_cgroup_cache_charge is used for file backed memory so rename it to
      mem_cgroup_charge_file.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715ae08
    • Johannes Weiner's avatar
      memcg: get_mem_cgroup_from_mm() · df381975
      Johannes Weiner authored
      
      Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
      owner is exiting, just return root_mem_cgroup.  This makes sense for all
      callsites and gets rid of some of them having to fallback manually.
      
      [fengguang.wu@intel.com: fix warnings]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df381975
    • Kirill A. Shutemov's avatar
      mm: use 'const char *' insted of 'char *' for reason in dump_page() · d230dec1
      Kirill A. Shutemov authored
      
      I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
      warning:
      
        warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
      
      Let's convert 'reason' to 'const char *' in dump_page() and friends: we
      shouldn't modify it anyway.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d230dec1
    • David Rientjes's avatar
      res_counter: remove interface for locked charging and uncharging · 539a13b4
      David Rientjes authored
      
      The res_counter_{charge,uncharge}_locked() variants are not used in the
      kernel outside of the resource counter code itself, so remove the
      interface.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      539a13b4
    • David Rientjes's avatar
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes authored
      
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • David Rientjes's avatar
      mm, mempolicy: rename slab_node for clarity · 2a389610
      David Rientjes authored
      
      slab_node() is actually a mempolicy function, so rename it to
      mempolicy_slab_node() to make it clearer that it used for processes with
      mempolicies.
      
      At the same time, cleanup its code by saving numa_mem_id() in a local
      variable (since we require a node with memory, not just any node) and
      remove an obsolete comment that assumes the mempolicy is actually passed
      into the function.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a389610
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso authored
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410
      
      ), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • Kirill A. Shutemov's avatar
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov authored
      
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • Kirill A. Shutemov's avatar
      mm: introduce vm_ops->map_pages() · 8c6e50b0
      Kirill A. Shutemov authored
      
      Here's new version of faultaround patchset.  It took a while to tune it
      and collect performance data.
      
      First patch adds new callback ->map_pages to vm_operations_struct.
      
      ->map_pages() is called when VM asks to map easy accessible pages.
      Filesystem should find and map pages associated with offsets from
      "pgoff" till "max_pgoff".  ->map_pages() is called with page table
      locked and must not block.  If it's not possible to reach a page without
      blocking, filesystem should skip it.  Filesystem should use do_set_pte()
      to setup page table entry.  Pointer to entry associated with offset
      "pgoff" is passed in "pte" field in vm_fault structure.  Pointers to
      entries for other offsets should be calculated relative to "pte".
      
      Currently VM use ->map_pages only on read page fault path.  We try to
      map FAULT_AROUND_PAGES a time.  FAULT_AROUND_PAGES is 16 for now.
      Performance data for different FAULT_AROUND_ORDER is below.
      
      TODO:
       - implement ->map_pages() for shmem/tmpfs;
       - modify get_user_pages() to be able to use ->map_pages() and implement
         mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
      
      =========================================================================
      Tested on 4-socket machine (120 threads) with 128GiB of RAM.
      
      Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
      somewhere between 3 and 5. Let's say 4 :)
      
      Linux build (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
      	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
      Linux rebuild (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
      	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
      Git test suite (make -j60 test)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
      	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413
      
      Two synthetic tests: access every word in file in sequential/random order.
      It doesn't improve much after FAULT_AROUND_ORDER == 4.
      
      Sequential access 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
      	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
       8 threads
      	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
      	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
       32 threads
      	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
      	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
       60 threads
      	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
      	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
       120 threads
      	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
      	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
      Random access 1GiB file
       1 thread
      	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
      	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
       8 threads
      	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
      	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
       32 threads
      	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
      	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
       60 threads
      	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
      	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
       120 threads
      	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
      	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594
      
      Touch only one page in page table in 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
      	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
       8 threads
      	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
      	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
       32 threads
      	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
      	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
       60 threads
      	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
      	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
       120 threads
      	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
      	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125
      
      This patch (of 2):
      
      Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
      accessible pages around fault address.
      
      On read page fault, if filesystem provides ->map_pages(), we try to map up
      to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
      number of minor page faults.
      
      We call ->map_pages first and use ->fault() as fallback if page by the
      offset is not ready to be mapped (cold page cache or something).
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c6e50b0
    • Alex Thorlton's avatar
      mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE · a0715cc2
      Alex Thorlton authored
      
      Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs.  It
      also adds a prctl control which allows us to set the THP disable bit in
      mm->def_flags so that VMs will pick up the setting as they are created.
      
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0715cc2
  3. Apr 07, 2014
    • Arnd Bergmann's avatar
      sched: remove sleep_on() and friends · b8780c36
      Arnd Bergmann authored
      This is the final piece in the puzzle, as all patches to remove the
      last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
      into the 3.15 merge window. The work was long overdue, and this
      interface in particular should not have survived the BKL removal
      that was done a couple of years ago.
      
      Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3
      
      ":
      
       "[...] it was suggested that the janitors look for and fix all code
        that calls sleep_on() [...] since (1) almost all such code is
        incorrect, and (2) Linus has agreed that those functions should
        be removed in the 2.5 development series".
      
      We haven't quite made it for 2.5, but maybe we can merge this for 3.15.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8780c36
  4. Apr 05, 2014
Loading