Skip to content
Snippets Groups Projects
  1. Apr 08, 2014
    • Oleg Nesterov's avatar
      wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space · ad86622b
      Oleg Nesterov authored
      
      get_task_state() uses the most significant bit to report the state to
      user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
      can be noticed via /proc as Z -> X -> Z change.  Note that this was
      possible even before EXIT_TRACE was introduced.
      
      This is not really bad but imho it make sense to hide EXIT_TRACE from
      user-space completely.  So the patch simply swaps EXIT_ZOMBIE and
      EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad86622b
  2. Jan 24, 2014
    • Oleg Nesterov's avatar
      fs/proc/array.c: change do_task_stat() to use while_each_thread() · 185ee40e
      Oleg Nesterov authored
      
      Change the remaining next_thread (ab)users to use while_each_thread().
      
      The last user which should be changed is next_tid(), but we can't do this
      now.
      
      __exit_signal() and complete_signal() are fine, they actually need
      next_thread() logic.
      
      This patch (of 3):
      
      do_task_stat() can use while_each_thread(), no changes in
      the compiled code.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Reviewed-by: default avatarSameer Nanda <snanda@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      185ee40e
    • Oleg Nesterov's avatar
      proc: cleanup/simplify get_task_state/task_state_array · 74e37200
      Oleg Nesterov authored
      
      get_task_state() and task_state_array[] look confusing and suboptimal, it
      is not clear what it can actually report to user-space and
      task_state_array[] blows .data for no reason.
      
      1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
         clear. TASK_REPORT is self-documenting but it is not clear
         what ->exit_state can add.
      
         Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
         into TASK_REPORT and use it to calculate the final result.
      
      2. With the change above it is obvious that task_state_array[]
         has the unused entries just to make BUILD_BUG_ON() happy.
      
         Change this BUILD_BUG_ON() to use TASK_REPORT rather than
         TASK_STATE_MAX and shrink task_state_array[].
      
      3. Turn the "while (state)" loop into fls(state).
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74e37200
  3. Oct 09, 2013
  4. Apr 12, 2013
    • Thomas Gleixner's avatar
      kthread: Prevent unpark race which puts threads on the wrong cpu · f2530dc7
      Thomas Gleixner authored
      
      The smpboot threads rely on the park/unpark mechanism which binds per
      cpu threads on a particular core. Though the functionality is racy:
      
      CPU0	       	 	CPU1  	     	    CPU2
      unpark(T)				    wake_up_process(T)
        clear(SHOULD_PARK)	T runs
      			leave parkme() due to !SHOULD_PARK  
        bind_to(CPU2)		BUG_ON(wrong CPU)						    
      
      We cannot let the tasks move themself to the target CPU as one of
      those tasks is actually the migration thread itself, which requires
      that it starts running on the target cpu right away.
      
      The solution to this problem is to prevent wakeups in park mode which
      are not from unpark(). That way we can guarantee that the association
      of the task to the target cpu is working correctly.
      
      Add a new task state (TASK_PARKED) which prevents other wakeups and
      use this state explicitly for the unpark wakeup.
      
      Peter noticed: Also, since the task state is visible to userspace and
      all the parked tasks are still in the PID space, its a good hint in ps
      and friends that these tasks aren't really there for the moment.
      
      The migration thread has another related issue.
      
      CPU0	      	     	 CPU1
      Bring up CPU2
      create_thread(T)
      park(T)
       wait_for_completion()
      			 parkme()
      			 complete()
      sched_set_stop_task()
      			 schedule(TASK_PARKED)
      
      The sched_set_stop_task() call is issued while the task is on the
      runqueue of CPU1 and that confuses the hell out of the stop_task class
      on that cpu. So we need the same synchronizaion before
      sched_set_stop_task().
      
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Reported-and-tested-by: default avatarDave Hansen <dave@sr71.net>
      Reported-and-tested-by: default avatarBorislav Petkov <bp@alien8.de>
      Acked-by: default avatarPeter Ziljstra <peterz@infradead.org>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: dhillf@gmail.com
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      f2530dc7
  5. Jan 27, 2013
    • Frederic Weisbecker's avatar
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker authored
      
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6fac4829
  6. Dec 18, 2012
  7. Nov 28, 2012
    • Frederic Weisbecker's avatar
      cputime: Rename thread_group_times to thread_group_cputime_adjusted · e80d0a1a
      Frederic Weisbecker authored
      
      We have thread_group_cputime() and thread_group_times(). The naming
      doesn't provide enough information about the difference between
      these two APIs.
      
      To lower the confusion, rename thread_group_times() to
      thread_group_cputime_adjusted(). This name better suggests that
      it's a version of thread_group_cputime() that does some stabilization
      on the raw cputime values. ie here: scale on top of CFS runtime
      stats and bound lower value for monotonicity.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      e80d0a1a
  8. Nov 20, 2012
  9. Jun 01, 2012
  10. May 15, 2012
  11. May 03, 2012
  12. Mar 29, 2012
  13. Mar 24, 2012
    • KAMEZAWA Hiroyuki's avatar
      procfs: speed up /proc/pid/stat, statm · bda7bad6
      KAMEZAWA Hiroyuki authored
      
      Process accounting applications as top, ps visit some files under
      /proc/<pid>.  With seq_put_decimal_ull(), we can optimize /proc/<pid>/stat
      and /proc/<pid>/statm files.
      
      This patch adds
        - seq_put_decimal_ll() for signed values.
        - allow delimiter == 0.
        - convert seq_printf() to seq_put_decimal_ull/ll in /proc/stat, statm.
      
      Test result on a system with 2000+ procs.
      
      Before patch:
        [kamezawa@bluextal test]$ top -b -n 1 | wc -l
        2223
        [kamezawa@bluextal test]$ time top -b -n 1 > /dev/null
      
        real    0m0.675s
        user    0m0.044s
        sys     0m0.121s
      
        [kamezawa@bluextal test]$ time ps -elf > /dev/null
      
        real    0m0.236s
        user    0m0.056s
        sys     0m0.176s
      
      After patch:
        kamezawa@bluextal ~]$ time top -b -n 1 > /dev/null
      
        real    0m0.657s
        user    0m0.052s
        sys     0m0.100s
      
        [kamezawa@bluextal ~]$ time ps -elf > /dev/null
      
        real    0m0.198s
        user    0m0.050s
        sys     0m0.145s
      
      Considering top, ps tend to scan /proc periodically, this will reduce cpu
      consumption by top/ps to some extent.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda7bad6
  14. Jan 13, 2012
  15. Jan 06, 2012
    • Eric Paris's avatar
      ptrace: do not audit capability check when outputing /proc/pid/stat · 69f594a3
      Eric Paris authored
      
      Reading /proc/pid/stat of another process checks if one has ptrace permissions
      on that process.  If one does have permissions it outputs some data about the
      process which might have security and attack implications.  If the current
      task does not have ptrace permissions the read still works, but those fields
      are filled with inocuous (0) values.  Since this check and a subsequent denial
      is not a violation of the security policy we should not audit such denials.
      
      This can be quite useful to removing ptrace broadly across a system without
      flooding the logs when ps is run or something which harmlessly walks proc.
      
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@canonical.com>
      69f594a3
  16. Dec 15, 2011
  17. Jun 22, 2011
  18. May 27, 2011
  19. Mar 24, 2011
    • Kees Cook's avatar
      proc: protect mm start_code/end_code in /proc/pid/stat · 5883f57c
      Kees Cook authored
      
      While mm->start_stack was protected from cross-uid viewing (commit
      f83ce3e6 ("proc: avoid information leaks to non-privileged
      processes")), the start_code and end_code values were not.  This would
      allow the text location of a PIE binary to leak, defeating ASLR.
      
      Note that the value "1" is used instead of "0" for a protected value since
      "ps", "killall", and likely other readers of /proc/pid/stat, take
      start_code of "0" to mean a kernel thread and will misbehave.  Thanks to
      Brad Spengler for pointing this out.
      
      Addresses CVE-2011-0726
      
      Signed-off-by: default avatarKees Cook <kees.cook@canonical.com>
      Cc: <stable@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Eugene Teo <eugeneteo@kernel.sg>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5883f57c
  20. Feb 15, 2011
  21. Jan 13, 2011
  22. Jul 30, 2010
    • David Howells's avatar
      CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials · de09a977
      David Howells authored
      
      It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
      credentials by incrementing their usage count after their replacement by the
      task being accessed.
      
      What happens is that get_task_cred() can race with commit_creds():
      
      	TASK_1			TASK_2			RCU_CLEANER
      	-->get_task_cred(TASK_2)
      	rcu_read_lock()
      	__cred = __task_cred(TASK_2)
      				-->commit_creds()
      				old_cred = TASK_2->real_cred
      				TASK_2->real_cred = ...
      				put_cred(old_cred)
      				  call_rcu(old_cred)
      		[__cred->usage == 0]
      	get_cred(__cred)
      		[__cred->usage == 1]
      	rcu_read_unlock()
      							-->put_cred_rcu()
      							[__cred->usage == 1]
      							panic()
      
      However, since a tasks credentials are generally not changed very often, we can
      reasonably make use of a loop involving reading the creds pointer and using
      atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
      
      If successful, we can safely return the credentials in the knowledge that, even
      if the task we're accessing has released them, they haven't gone to the RCU
      cleanup code.
      
      We then change task_state() in procfs to use get_task_cred() rather than
      calling get_cred() on the result of __task_cred(), as that suffers from the
      same problem.
      
      Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
      tripped when it is noticed that the usage count is not zero as it ought to be,
      for example:
      
      kernel BUG at kernel/cred.c:168!
      invalid opcode: 0000 [#1] SMP
      last sysfs file: /sys/kernel/mm/ksm/run
      CPU 0
      Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
      745
      RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
      RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
      RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
      RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
      R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
      FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
      Stack:
       ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
      <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
      <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
      Call Trace:
       [<ffffffff810698cd>] put_cred+0x13/0x15
       [<ffffffff81069b45>] commit_creds+0x16b/0x175
       [<ffffffff8106aace>] set_current_groups+0x47/0x4e
       [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
       [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
      Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
      48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
      04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
      RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
       RSP <ffff88019e7e9eb8>
      ---[ end trace df391256a100ebdd ]---
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de09a977
  23. May 27, 2010
  24. May 12, 2010
    • Robin Holt's avatar
      revert "procfs: provide stack information for threads" and its fixup commits · 34441427
      Robin Holt authored
      
      Originally, commit d899bf7b ("procfs: provide stack information for
      threads") attempted to introduce a new feature for showing where the
      threadstack was located and how many pages are being utilized by the
      stack.
      
      Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
      applied to fix the NO_MMU case.
      
      Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
      64-bit") was applied to fix a bug in ia32 executables being loaded.
      
      Commit 9ebd4eba ("procfs: fix /proc/<pid>/stat stack pointer for kernel
      threads") was applied to fix a bug which had kernel threads printing a
      userland stack address.
      
      Commit 1306d603 ('proc: partially revert "procfs: provide stack
      information for threads"') was then applied to revert the stack pages
      being used to solve a significant performance regression.
      
      This patch nearly undoes the effect of all these patches.
      
      The reason for reverting these is it provides an unusable value in
      field 28.  For x86_64, a fork will result in the task->stack_start
      value being updated to the current user top of stack and not the stack
      start address.  This unpredictability of the stack_start value makes
      it worthless.  That includes the intended use of showing how much stack
      space a thread has.
      
      Other architectures will get different values.  As an example, ia64
      gets 0.  The do_fork() and copy_process() functions appear to treat the
      stack_start and stack_size parameters as architecture specific.
      
      I only partially reverted c44972f1 ("procfs: disable per-task stack usage
      on NOMMU") .  If I had completely reverted it, I would have had to change
      mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
      configured.  Since I could not test the builds without significant effort,
      I decided to not change mm/Makefile.
      
      I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
      information for threads on 64-bit") .  I left the KSTK_ESP() change in
      place as that seemed worthwhile.
      
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34441427
  25. Mar 30, 2010
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  26. Mar 06, 2010
  27. Feb 25, 2010
    • Paul E. McKenney's avatar
      vfs: Apply lockdep-based checking to rcu_dereference() uses · 7dc52157
      Paul E. McKenney authored
      
      Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
      and fcheck_files().
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7dc52157
  28. Jan 11, 2010
    • KOSAKI Motohiro's avatar
      proc: partially revert "procfs: provide stack information for threads" · 1306d603
      KOSAKI Motohiro authored
      
      Commit d899bf7b (procfs: provide stack information for threads) introduced
      to show stack information in /proc/{pid}/status.  But it cause large
      performance regression.  Unfortunately /proc/{pid}/status is used ps
      command too and ps is one of most important component.  Because both to
      take mmap_sem and page table walk are heavily operation.
      
      If many process run, the ps performance is,
      
      [before d899bf7b]
      
      % perf stat ps >/dev/null
      
       Performance counter stats for 'ps':
      
           4090.435806  task-clock-msecs         #      0.032 CPUs
                   229  context-switches         #      0.000 M/sec
                     0  CPU-migrations           #      0.000 M/sec
                   234  page-faults              #      0.000 M/sec
            8587565207  cycles                   #   2099.425 M/sec
            9866662403  instructions             #      1.149 IPC
            3789415411  cache-references         #    926.409 M/sec
              30419509  cache-misses             #      7.437 M/sec
      
         128.859521955  seconds time elapsed
      
      [after d899bf7b]
      
      % perf stat  ps  > /dev/null
      
       Performance counter stats for 'ps':
      
           4305.081146  task-clock-msecs         #      0.028 CPUs
                   480  context-switches         #      0.000 M/sec
                     2  CPU-migrations           #      0.000 M/sec
                   237  page-faults              #      0.000 M/sec
            9021211334  cycles                   #   2095.480 M/sec
           10605887536  instructions             #      1.176 IPC
            3612650999  cache-references         #    839.160 M/sec
              23917502  cache-misses             #      5.556 M/sec
      
         152.277819582  seconds time elapsed
      
      Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
      provide almost same information. we can use it.
      
      Commit d899bf7b introduced two features:
      
       1) Add the annotattion of [thread stack: xxxx] mark to
          /proc/{pid}/task/{tid}/maps.
       2) Add StackUsage field to /proc/{pid}/status.
      
      I only revert (2), because I haven't seen (1) cause regression.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1306d603
  29. Dec 17, 2009
  30. Dec 02, 2009
    • Hidetoshi Seto's avatar
      sched, cputime: Introduce thread_group_times() · 0cf55e1e
      Hidetoshi Seto authored
      This is a real fix for problem of utime/stime values decreasing
      described in the thread:
      
         http://lkml.org/lkml/2009/11/3/522
      
      
      
      Now cputime is accounted in the following way:
      
       - {u,s}time in task_struct are increased every time when the thread
         is interrupted by a tick (timer interrupt).
      
       - When a thread exits, its {u,s}time are added to signal->{u,s}time,
         after adjusted by task_times().
      
       - When all threads in a thread_group exits, accumulated {u,s}time
         (and also c{u,s}time) in signal struct are added to c{u,s}time
         in signal struct of the group's parent.
      
      So {u,s}time in task struct are "raw" tick count, while
      {u,s}time and c{u,s}time in signal struct are "adjusted" values.
      
      And accounted values are used by:
      
       - task_times(), to get cputime of a thread:
         This function returns adjusted values that originates from raw
         {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
      
       - thread_group_cputime(), to get cputime of a thread group:
         This function returns sum of all {u,s}time of living threads in
         the group, plus {u,s}time in the signal struct that is sum of
         adjusted cputimes of all exited threads belonged to the group.
      
      The problem is the return value of thread_group_cputime(),
      because it is mixed sum of "raw" value and "adjusted" value:
      
        group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
      
      This misbehavior can break {u,s}time monotonicity.
      Assume that if there is a thread that have raw values greater
      than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
      but only runs 45ms) and if it exits, cputime will decrease (e.g.
      -5ms).
      
      To fix this, we could do:
      
        group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
      
      But task_times() contains hard divisions, so applying it for
      every thread should be avoided.
      
      This patch fixes the above problem in the following way:
      
       - Modify thread's exit (= __exit_signal()) not to use task_times().
         It means {u,s}time in signal struct accumulates raw values instead
         of adjusted values.  As the result it makes thread_group_cputime()
         to return pure sum of "raw" values.
      
       - Introduce a new function thread_group_times(*task, *utime, *stime)
         that converts "raw" values of thread_group_cputime() to "adjusted"
         values, in same calculation procedure as task_times().
      
       - Modify group's exit (= wait_task_zombie()) to use this introduced
         thread_group_times().  It make c{u,s}time in signal struct to
         have adjusted values like before this patch.
      
       - Replace some thread_group_cputime() by thread_group_times().
         This replacements are only applied where conveys the "adjusted"
         cputime to users, and where already uses task_times() near by it.
         (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
      
      This patch have a positive side effect:
      
       - Before this patch, if a group contains many short-life threads
         (e.g. runs 0.9ms and not interrupted by ticks), the group's
         cputime could be invisible since thread's cputime was accumulated
         after adjusted: imagine adjustment function as adj(ticks, runtime),
           {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
         After this patch it will not happen because the adjustment is
         applied after accumulated.
      
      v2:
       - remove if()s, put new variables into signal_struct.
      
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      0cf55e1e
  31. Nov 26, 2009
Loading