Skip to content
Snippets Groups Projects
  1. Dec 21, 2012
    • Kees Cook's avatar
      exec: do not leave bprm->interp on stack · b66c5984
      Kees Cook authored
      If a series of scripts are executed, each triggering module loading via
      unprintable bytes in the script header, kernel stack contents can leak
      into the command line.
      
      Normally execution of binfmt_script and binfmt_misc happens recursively.
      However, when modules are enabled, and unprintable bytes exist in the
      bprm->buf, execution will restart after attempting to load matching
      binfmt modules.  Unfortunately, the logic in binfmt_script and
      binfmt_misc does not expect to get restarted.  They leave bprm->interp
      pointing to their local stack.  This means on restart bprm->interp is
      left pointing into unused stack memory which can then be copied into the
      userspace argv areas.
      
      After additional study, it seems that both recursion and restart remains
      the desirable way to handle exec with scripts, misc, and modules.  As
      such, we need to protect the changes to interp.
      
      This changes the logic to require allocation for any changes to the
      bprm->interp.  To avoid adding a new kmalloc to every exec, the default
      value is left as-is.  Only when passing through binfmt_script or
      binfmt_misc does an allocation take place.
      
      For a proof of concept, see DoTest.sh from:
      
         http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/
      
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b66c5984
  2. Dec 18, 2012
    • Kees Cook's avatar
      exec: use -ELOOP for max recursion depth · d7402698
      Kees Cook authored
      
      To avoid an explosion of request_module calls on a chain of abusive
      scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
      as maximum recursion depth is hit, the error will fail all the way back
      up the chain, aborting immediately.
      
      This also has the side-effect of stopping the user's shell from attempting
      to reexecute the top-level file as a shell script. As seen in the
      dash source:
      
              if (cmd != path_bshell && errno == ENOEXEC) {
                      *argv-- = cmd;
                      *argv = cmd = path_bshell;
                      goto repeat;
              }
      
      The above logic was designed for running scripts automatically that lacked
      the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
      things continue to behave as the shell expects.
      
      Additionally, when tracking recursion, the binfmt handlers should not be
      involved. The recursion being tracked is the depth of calls through
      search_binary_handler(), so that function should be exclusively responsible
      for tracking the depth.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7402698
  3. Nov 29, 2012
  4. Nov 19, 2012
  5. Oct 25, 2012
  6. Oct 13, 2012
    • Jeff Layton's avatar
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton authored
      
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • Jeff Layton's avatar
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton authored
      
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  7. Oct 09, 2012
    • Michel Lespinasse's avatar
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse authored
      
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38a76013
  8. Oct 08, 2012
    • Oleg Nesterov's avatar
      exec: make de_thread() killable · d5bbd43d
      Oleg Nesterov authored
      
      Change de_thread() to use KILLABLE rather than UNINTERRUPTIBLE while
      waiting for other threads.  The only complication is that we should
      clear ->group_exit_task and ->notify_count before we return, and we
      should do this under tasklist_lock.  -EAGAIN is used to match the
      initial signal_group_exit() check/return, it doesn't really matter.
      
      This fixes the (unlikely) race with coredump.  de_thread() checks
      signal_group_exit() before it starts to kill the subthreads, but this
      can't help if another CLONE_VM (but non CLONE_THREAD) task starts the
      coredumping after de_thread() unlocks ->siglock.  In this case the
      killed sub-thread can block in exit_mm() waiting for coredump_finish(),
      execing thread waits for that sub-thead, and the coredumping thread
      waits for execing thread.  Deadlock.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5bbd43d
  9. Oct 05, 2012
  10. Oct 03, 2012
  11. Oct 01, 2012
    • Al Viro's avatar
      generic sys_execve() · 38b983b3
      Al Viro authored
      
      Selected by __ARCH_WANT_SYS_EXECVE in unistd.h.  Requires
      	* working current_pt_regs()
      	* *NOT* doing a syscall-in-kernel kind of kernel_execve()
      implementation.  Using generic kernel_execve() is fine.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      38b983b3
  12. Sep 30, 2012
    • Al Viro's avatar
      generic kernel_execve() · 282124d1
      Al Viro authored
      
      based mostly on arm and alpha versions.  Architectures can define
      __ARCH_WANT_KERNEL_EXECVE and use it, provided that
      	* they have working current_pt_regs(), even for kernel threads.
      	* kernel_thread-spawned threads do have space for pt_regs
      in the normal location.  Normally that's as simple as switching to
      generic kernel_thread() and making sure that kernel threads do *not*
      go through return from syscall path; call the payload from equivalent
      of ret_from_fork if we are in a kernel thread (or just have separate
      ret_from_kernel_thread and make copy_thread() use it instead of
      ret_from_fork in kernel thread case).
      	* they have ret_from_kernel_execve(); it is called after
      successful do_execve() done by kernel_execve() and gets normal
      pt_regs location passed to it as argument.  It's essentially
      a longjmp() analog - it should set sp, etc. to the situation
      expected at the return for syscall and go there.  Eventually
      the need for that sucker will disappear, but that'll take some
      surgery on kernel_thread() payloads.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      282124d1
  13. Sep 27, 2012
  14. Sep 20, 2012
  15. Jul 31, 2012
    • Jovi Zhang's avatar
      coredump: fix wrong comments on core limits of pipe coredump case · 108ceeb0
      Jovi Zhang authored
      
      In commit 898b374a ("exec: replace call_usermodehelper_pipe with use
      of umh init function and resolve limit"), the core limits recursive
      check value was changed from 0 to 1, but the corresponding comments were
      not updated.
      
      Signed-off-by: default avatarJovi Zhang <bookjovi@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108ceeb0
    • Kees Cook's avatar
      coredump: warn about unsafe suid_dumpable / core_pattern combo · 54b50199
      Kees Cook authored
      
      When suid_dumpable=2, detect unsafe core_pattern settings and warn when
      they are seen.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54b50199
    • Kees Cook's avatar
      fs: make dumpable=2 require fully qualified path · 9520628e
      Kees Cook authored
      
      When the suid_dumpable sysctl is set to "2", and there is no core dump
      pipe defined in the core_pattern sysctl, a local user can cause core files
      to be written to root-writable directories, potentially with
      user-controlled content.
      
      This means an admin can unknowningly reintroduce a variation of
      CVE-2006-2451, allowing local users to gain root privileges.
      
        $ cat /proc/sys/fs/suid_dumpable
        2
        $ cat /proc/sys/kernel/core_pattern
        core
        $ ulimit -c unlimited
        $ cd /
        $ ls -l core
        ls: cannot access core: No such file or directory
        $ touch core
        touch: cannot touch `core': Permission denied
        $ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 &
        $ pid=$!
        $ sleep 1
        $ kill -SEGV $pid
        $ ls -l core
        -rw------- 1 root kees 458752 Jun 21 11:35 core
        $ sudo strings core | grep evil
        OHAI=evil-string-here
      
      While cron has been fixed to abort reading a file when there is any
      parse error, there are still other sensitive directories that will read
      any file present and skip unparsable lines.
      
      Instead of introducing a suid_dumpable=3 mode and breaking all users of
      mode 2, this only disables the unsafe portion of mode 2 (writing to disk
      via relative path).  Most users of mode 2 (e.g.  Chrome OS) already use
      a core dump pipe handler, so this change will not break them.  For the
      situations where a pipe handler is not defined but mode 2 is still
      active, crash dumps will only be written to fully qualified paths.  If a
      relative path is defined (e.g.  the default "core" pattern), dump
      attempts will trigger a printk yelling about the lack of a fully
      qualified path.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9520628e
  16. Jul 29, 2012
  17. Jul 26, 2012
    • Josh Boyer's avatar
      posix_types.h: Cleanup stale __NFDBITS and related definitions · 8ded2bbc
      Josh Boyer authored
      
      Recently, glibc made a change to suppress sign-conversion warnings in
      FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
      kernel's definition of __NFDBITS if applications #include
      <linux/types.h> after including <sys/select.h>.  A build failure would
      be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
      flags to gcc.
      
      It was suggested that the kernel should either match the glibc
      definition of __NFDBITS or remove that entirely.  The current in-kernel
      uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
      uses of the related __FDELT and __FDMASK defines.  Given that, we'll
      continue the cleanup that was started with commit 8b3d1cda
      ("posix_types: Remove fd_set macros") and drop the remaining unused
      macros.
      
      Additionally, linux/time.h has similar macros defined that expand to
      nothing so we'll remove those at the same time.
      
      Reported-by: default avatarJeff Law <law@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      [ .. and fix up whitespace as per akpm ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ded2bbc
  18. Jun 20, 2012
  19. Jun 08, 2012
    • Linus Torvalds's avatar
      Revert "mm: correctly synchronize rss-counters at exit/exec" · 48d212a2
      Linus Torvalds authored
      
      This reverts commit 40af1bbd.
      
      It's horribly and utterly broken for at least the following reasons:
      
       - calling sync_mm_rss() from mmput() is fundamentally wrong, because
         there's absolutely no reason to believe that the task that does the
         mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
         you name it.
      
       - calling it *after* having done mmdrop() on it is doubly insane, since
         the mm struct may well be gone now.
      
       - testing mm against NULL before you call it is insane too, since a
      NULL mm there would have caused oopses long before.
      
      .. and those are just the three bugs I found before I decided to give up
      looking for me and revert it asap.  I should have caught it before I
      even took it, but I trusted Andrew too much.
      
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48d212a2
  20. Jun 07, 2012
    • Konstantin Khlebnikov's avatar
      mm: correctly synchronize rss-counters at exit/exec · 40af1bbd
      Konstantin Khlebnikov authored
      
      mm->rss_stat counters have per-task delta: task->rss_stat.  Before
      changing task->mm pointer the kernel must flush this delta with
      sync_mm_rss().
      
      do_exit() already calls sync_mm_rss() to flush the rss-counters before
      committing the rss statistics into task->signal->maxrss, taskstats,
      audit and other stuff.  Unfortunately the kernel does this before
      calling mm_release(), which can call put_user() for processing
      task->clear_child_tid.  So at this point we can trigger page-faults and
      task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
      inconsistent and check_mm() will print something like this:
      
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
      
      This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
      out of do_exit() and calls it earlier.  After mm_release() there should
      be no pagefaults.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>		[3.4.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40af1bbd
  21. May 31, 2012
  22. May 17, 2012
    • Suresh Siddha's avatar
      coredump: ensure the fpu state is flushed for proper multi-threaded core dump · 11aeca0b
      Suresh Siddha authored
      
      Nalluru reported hitting the BUG_ON(__thread_has_fpu(tsk)) in
      arch/x86/kernel/xsave.c:__sanitize_i387_state() during the coredump
      of a multi-threaded application.
      
      A look at the exit seqeuence shows that other threads can still be on the
      runqueue potentially at the below shown exit_mm() code snippet:
      
      		if (atomic_dec_and_test(&core_state->nr_threads))
      			complete(&core_state->startup);
      
      ===> other threads can still be active here, but we notify the thread
      ===> dumping core to wakeup from the coredump_wait() after the last thread
      ===> joins this point. Core dumping thread will continue dumping
      ===> all the threads state to the core file.
      
      		for (;;) {
      			set_task_state(tsk, TASK_UNINTERRUPTIBLE);
      			if (!self.task) /* see coredump_finish() */
      				break;
      			schedule();
      		}
      
      As some of those threads are on the runqueue and didn't call schedule() yet,
      their fpu state is still active in the live registers and the thread
      proceeding with the coredump will hit the above mentioned BUG_ON while
      trying to dump other threads fpustate to the coredump file.
      
      BUG_ON() in arch/x86/kernel/xsave.c:__sanitize_i387_state() is
      in the code paths for processors supporting xsaveopt. With or without
      xsaveopt, multi-threaded coredump is broken and maynot contain
      the correct fpustate at the time of exit.
      
      In coredump_wait(), wait for all the threads to be come inactive, so
      that we are sure all the extended register state is flushed to
      the memory, so that it can be reliably copied to the core file.
      
      Reported-by: default avatarSuresh Nalluru <suresh@aristanetworks.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Link: http://lkml.kernel.org/r/1336692811-30576-2-git-send-email-suresh.b.siddha@intel.com
      
      
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      11aeca0b
  23. May 15, 2012
  24. May 03, 2012
  25. Apr 14, 2012
    • Andy Lutomirski's avatar
      Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs · 259e5e6c
      Andy Lutomirski authored
      
      With this change, calling
        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
      disables privilege granting operations at execve-time.  For example, a
      process will not be able to execute a setuid binary to change their uid
      or gid if this bit is set.  The same is true for file capabilities.
      
      Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
      LSMs respect the requested behavior.
      
      To determine if the NO_NEW_PRIVS bit is set, a task may call
        prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
      It returns 1 if set and 0 if it is not set. If any of the arguments are
      non-zero, it will return -1 and set errno to -EINVAL.
      (PR_SET_NO_NEW_PRIVS behaves similarly.)
      
      This functionality is desired for the proposed seccomp filter patch
      series.  By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
      system call behavior for itself and its child tasks without being
      able to impact the behavior of a more privileged task.
      
      Another potential use is making certain privileged operations
      unprivileged.  For example, chroot may be considered "safe" if it cannot
      affect privileged tasks.
      
      Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
      set and AppArmor is in use.  It is fixed in a subsequent patch.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarWill Drewry <wad@chromium.org>
      Acked-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      
      v18: updated change desc
      v17: using new define values as per 3.4
      Signed-off-by: default avatarJames Morris <james.l.morris@oracle.com>
      259e5e6c
  26. Mar 31, 2012
  27. Mar 28, 2012
    • David Howells's avatar
      Add #includes needed to permit the removal of asm/system.h · 96f951ed
      David Howells authored
      
      asm/system.h is a cause of circular dependency problems because it contains
      commonly used primitive stuff like barrier definitions and uncommonly used
      stuff like switch_to() that might require MMU definitions.
      
      asm/system.h has been disintegrated by this point on all arches into the
      following common segments:
      
       (1) asm/barrier.h
      
           Moved memory barrier definitions here.
      
       (2) asm/cmpxchg.h
      
           Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.
      
       (3) asm/bug.h
      
           Moved die() and similar here.
      
       (4) asm/exec.h
      
           Moved arch_align_stack() here.
      
       (5) asm/elf.h
      
           Moved AT_VECTOR_SIZE_ARCH here.
      
       (6) asm/switch_to.h
      
           Moved switch_to() here.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      96f951ed
  28. Mar 22, 2012
  29. Mar 21, 2012
Loading