Skip to content
Snippets Groups Projects
  1. Sep 05, 2013
  2. Aug 22, 2013
  3. Aug 20, 2013
    • Chuck Anderson's avatar
      xen/smp: initialize IPI vectors before marking CPU online · fc78d343
      Chuck Anderson authored
      
      An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      RCU has detected that a CPU has not entered a quiescent state within the
      grace period.  It needs to send the CPU a reschedule IPI if it is not
      offline.  rcu_implicit_offline_qs() does this check:
      
      	/*
      	 * If the CPU is offline, it is in a quiescent state.  We can
      	 * trust its state not to change because interrupts are disabled.
      	 */
      	if (cpu_is_offline(rdp->cpu)) {
      		rdp->offline_fqs++;
      		return 1;
      	}
      
      	Else the CPU is online.  Send it a reschedule IPI.
      
      The CPU is in the middle of being hot-plugged and has been marked online
      (!cpu_is_offline()).  See start_secondary():
      
      	set_cpu_online(smp_processor_id(), true);
      	...
      	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
      
      start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
      mark it as active:
      
      	/*
      	 * Wait until the cpu which brought this one up marked it
      	 * online before enabling interrupts. If we don't do that then
      	 * we can end up waking up the softirq thread before this cpu
      	 * reached the active state, which makes the scheduler unhappy
      	 * and schedule the softirq thread on the wrong cpu. This is
      	 * only observable with forced threaded interrupts, but in
      	 * theory it could also happen w/o them. It's just way harder
      	 * to achieve.
      	 */
      	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
      		cpu_relax();
      
      	/* enable local interrupts */
      	local_irq_enable();
      
      The CPU being hot-plugged will be marked active after it has been fully
      initialized by the CPU managing the hot-plug.  In the Xen PVHVM case
      xen_smp_intr_init() is called to set up the hot-plugged vCPU's
      XEN_RESCHEDULE_VECTOR.
      
      The hot-plugging CPU is marked online, not marked active and does not have
      its IPI vectors set up.  rcu_implicit_offline_qs() sees the hot-plugging
      cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
      This will lead to:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      	xen_send_IPI_one()
      	xen_smp_send_reschedule()
      	rcu_implicit_offline_qs()
      	rcu_implicit_dynticks_qs()
      	force_qs_rnp()
      	force_quiescent_state()
      	__rcu_process_callbacks()
      	rcu_process_callbacks()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      
      because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
      the XEN_RESCHEDULE_VECTOR.
      
      There is at least one other place that has caused the same crash:
      
      	xen_smp_send_reschedule()
      	wake_up_idle_cpu()
      	add_timer_on()
      	clocksource_watchdog()
      	call_timer_fn()
      	run_timer_softirq()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      	xen_hvm_callback_vector()
      
      clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
      a watchdog timer:
      
      	/*
      	 * Cycle through CPUs to check if the CPUs stay synchronized
      	 * to each other.
      	 */
      	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
      	if (next_cpu >= nr_cpu_ids)
      		next_cpu = cpumask_first(cpu_online_mask);
      	watchdog_timer.expires += WATCHDOG_INTERVAL;
      	add_timer_on(&watchdog_timer, next_cpu);
      
      This resulted in an attempt to send an IPI to a hot-plugging CPU that
      had not initialized its reschedule vector. One option would be to make
      the RCU code check to not check for CPU offline but for CPU active.
      As becoming active is done after a CPU is online (in older kernels).
      
      But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
      completely reworked - in the online path, cpu_active is set *before* cpu_online,
      and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
      notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
      path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
      to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
      is better left to the scheduler alone, since it has the intelligence to use it
      judiciously."
      
      The conclusion was that:
      "
      1. At the IPI sender side:
      
         It is incorrect to send an IPI to an offline CPU (cpu not present in
         the cpu_online_mask). There are numerous places where we check this
         and warn/complain.
      
      2. At the IPI receiver side:
      
         It is incorrect to let the world know of our presence (by setting
         ourselves in global bitmasks) until our initialization steps are complete
         to such an extent that we can handle the consequences (such as
         receiving interrupts without crashing the sender etc.)
      " (from Srivatsa)
      
      As the native code enables the interrupts at some point we need to be
      able to service them. In other words a CPU must have valid IPI vectors
      if it has been marked online.
      
      It doesn't need to handle the IPI (interrupts may be disabled) but needs
      to have valid IPI vectors because another CPU may find it in cpu_online_mask
      and attempt to send it an IPI.
      
      This patch will change the order of the Xen vCPU bring-up functions so that
      Xen vectors have been set up before start_secondary() is called.
      It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
      to initialize it.
      
      Orabug 13823853
      Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
      Acked-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      fc78d343
    • David Vrabel's avatar
      x86/xen: do not identity map UNUSABLE regions in the machine E820 · 3bc38cbc
      David Vrabel authored
      
      If there are UNUSABLE regions in the machine memory map, dom0 will
      attempt to map them 1:1 which is not permitted by Xen and the kernel
      will crash.
      
      There isn't anything interesting in the UNUSABLE region that the dom0
      kernel needs access to so we can avoid making the 1:1 mapping and
      treat it as RAM.
      
      We only do this for dom0, as that is where tboot case shows up.
      A PV domU could have an UNUSABLE region in its pseudo-physical map
      and would need to be handled in another patch.
      
      This fixes a boot failure on hosts with tboot.
      
      tboot marks a region in the e820 map as unusable and the dom0 kernel
      would attempt to map this region and Xen does not permit unusable
      regions to be mapped by guests.
      
        (XEN)  0000000000000000 - 0000000000060000 (usable)
        (XEN)  0000000000060000 - 0000000000068000 (reserved)
        (XEN)  0000000000068000 - 000000000009e000 (usable)
        (XEN)  0000000000100000 - 0000000000800000 (usable)
        (XEN)  0000000000800000 - 0000000000972000 (unusable)
      
      tboot marked this region as unusable.
      
        (XEN)  0000000000972000 - 00000000cf200000 (usable)
        (XEN)  00000000cf200000 - 00000000cf38f000 (reserved)
        (XEN)  00000000cf38f000 - 00000000cf3ce000 (ACPI data)
        (XEN)  00000000cf3ce000 - 00000000d0000000 (reserved)
        (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
        (XEN)  00000000fe000000 - 0000000100000000 (reserved)
        (XEN)  0000000100000000 - 0000000630000000 (usable)
      
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      [v1: Altered the patch and description with domU's with UNUSABLE regions]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3bc38cbc
    • Yinghai Lu's avatar
      x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM · 527bf129
      Yinghai Lu authored
      
      Dave Hansen reported that systems between 500G and 600G RAM
      crash early if DEBUG_PAGEALLOC is selected.
      
       > [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
       > [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
       > [    0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
       > [    0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
       > [    0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
       > [    0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
       > [    0.000000]  [mem 0xe80ee00000-0xe80effffff] page 4k
       > [    0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
       > [    0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
       > [    0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory
      
      It turns out that we missed increasing needed pages in BRK to
      mapping initial 2M and [0,1M) when we switched to use the #PF
      handler to set memory mappings:
      
       > commit 8170e6be
       > Author: H. Peter Anvin <hpa@zytor.com>
       > Date:   Thu Jan 24 12:19:52 2013 -0800
       >
       >     x86, 64bit: Use a #PF handler to materialize early mappings on demand
      
      Before that, we had the maping from [0,512M) in head_64.S, and we
      can spare two pages [0-1M).  After that change, we can not reuse
      pages anymore.
      
      When we have more than 512M ram, we need an extra page for pgd page
      with [512G, 1024g).
      
      Increase pages in BRK for page table to solve the boot crash.
      
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Bisected-by: default avatarDave Hansen <dave.hansen@intel.com>
      Tested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: <stable@vger.kernel.org> # v3.9 and later
      Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      527bf129
  4. Aug 14, 2013
  5. Aug 13, 2013
    • Oleg Nesterov's avatar
      sched: fix the theoretical signal_wake_up() vs schedule() race · e0acd0a6
      Oleg Nesterov authored
      
      This is only theoretical, but after try_to_wake_up(p) was changed
      to check p->state under p->pi_lock the code like
      
      	__set_current_state(TASK_INTERRUPTIBLE);
      	schedule();
      
      can miss a signal. This is the special case of wait-for-condition,
      it relies on try_to_wake_up/schedule interaction and thus it does
      not need mb() between __set_current_state() and if(signal_pending).
      
      However, this __set_current_state() can move into the critical
      section protected by rq->lock, now that try_to_wake_up() takes
      another lock we need to ensure that it can't be reordered with
      "if (signal_pending(current))" check inside that section.
      
      The patch is actually one-liner, it simply adds smp_wmb() before
      spin_lock_irq(rq->lock). This is what try_to_wake_up() already
      does by the same reason.
      
      We turn this wmb() into the new helper, smp_mb__before_spinlock(),
      for better documentation and to allow the architectures to change
      the default implementation.
      
      While at it, kill smp_mb__after_lock(), it has no callers.
      
      Perhaps we can also add smp_mb__before/after_spinunlock() for
      prepare_to_wait().
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0acd0a6
  6. Aug 12, 2013
  7. Aug 10, 2013
    • Daniel Drake's avatar
      x86: Don't clear olpc_ofw_header when sentinel is detected · d55e37bb
      Daniel Drake authored
      
      OpenFirmware wasn't quite following the protocol described in boot.txt
      and the kernel has detected this through use of the sentinel value
      in boot_params. OFW does zero out almost all of the stuff that it should
      do, but not the sentinel.
      
      This causes the kernel to clear olpc_ofw_header, which breaks x86 OLPC
      support.
      
      OpenFirmware has now been fixed. However, it would be nice if we could
      maintain Linux compatibility with old firmware versions. To do that, we just
      have to avoid zeroing out olpc_ofw_header.
      
      OFW does not write to any other parts of the header that are being zapped
      by the sentinel-detection code, and all users of olpc_ofw_header are
      somewhat protected through checking for the OLPC_OFW_SIG magic value
      before using it. So this should not cause any problems for anyone.
      
      Signed-off-by: default avatarDaniel Drake <dsd@laptop.org>
      Link: http://lkml.kernel.org/r/20130809221420.618E6FAB03@dev.laptop.org
      
      
      Acked-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v3.9+
      d55e37bb
  8. Aug 05, 2013
    • Vince Weaver's avatar
      perf/x86: Fix intel QPI uncore event definitions · c9601247
      Vince Weaver authored
      
      John McCalpin reports that the "drs_data" and "ncb_data" QPI
      uncore events are missing the "extra bit" and always return zero
      values unless the bit is properly set.
      
      More details from him:
      
       According to the Xeon E5-2600 Product Family Uncore Performance
       Monitoring Guide, Table 2-94, about 1/2 of the QPI Link Layer events
       (including the ones that "perf" calls "drs_data" and "ncb_data") require
       that the "extra bit" be set.
      
       This was confusing for a while -- a note at the bottom of page 94 says
       that the "extra bit" is bit 16 of the control register.
       Unfortunately, Table 2-86 clearly says that bit 16 is reserved and must
       be zero.  Looking around a bit, I found that bit 21 appears to be the
       correct "extra bit", and further investigation shows that "perf" actually
       agrees with me:
      	[root@c560-003.stampede]# cat /sys/bus/event_source/devices/uncore_qpi_0/format/event
      	config:0-7,21
      
       So the command
      	# perf -e "uncore_qpi_0/event=drs_data/"
       Is the same as
      	# perf -e "uncore_qpi_0/event=0x02,umask=0x08/"
       While it should be
      	# perf -e "uncore_qpi_0/event=0x102,umask=0x08/"
      
       I confirmed that this last version gives results that agree with the
       amount of data that I expected the STREAM benchmark to move across the QPI
       link in the second (cross-chip) test of the original script.
      
      Reported-by: default avatarJohn McCalpin <mccalpin@tacc.utexas.edu>
      Signed-off-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Cc: zheng.z.yan@intel.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1308021037280.26119@vincent-weaver-1.um.maine.edu
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c9601247
  9. Jul 31, 2013
  10. Jul 29, 2013
  11. Jul 26, 2013
  12. Jul 24, 2013
  13. Jul 23, 2013
  14. Jul 19, 2013
  15. Jul 18, 2013
  16. Jul 17, 2013
    • Kees Cook's avatar
      x86: Make sure IDT is page aligned · 4df05f36
      Kees Cook authored
      
      Since the IDT is referenced from a fixmap, make sure it is page aligned.
      Merge with 32-bit one, since it was already aligned to deal with F00F
      bug. Since bss is cleared before IDT setup, it can live there. This also
      moves the other *_idt_table variables into common locations.
      
      This avoids the risk of the IDT ever being moved in the bss and having
      the mapping be offset, resulting in calling incorrect handlers. In the
      current upstream kernel this is not a manifested bug, but heavily patched
      kernels (such as those using the PaX patch series) did encounter this bug.
      
      The tables other than idt_table technically do not need to be page
      aligned, at least not at the current time, but using a common
      declaration avoids mistakes.  On 64 bits the table is exactly one page
      long, anyway.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: http://lkml.kernel.org/r/20130716183441.GA14232@www.outflux.net
      
      
      Reported-by: default avatarPaX Team <pageexec@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      4df05f36
  17. Jul 15, 2013
    • H. Peter Anvin's avatar
      x86, suspend: Handle CPUs which fail to #GP on RDMSR · 5ff560fd
      H. Peter Anvin authored
      
      There are CPUs which have errata causing RDMSR of a nonexistent MSR to
      not fault.  We would then try to WRMSR to restore the value of that
      MSR, causing a crash.  Specifically, some Pentium M variants would
      have this problem trying to save and restore the non-existent EFER,
      causing a crash on resume.
      
      Work around this by making sure we can write back the result at
      suspend time.
      
      Huge thanks to Christian Sünkenberg for finding the offending erratum
      that finally deciphered the mystery.
      
      Reported-and-tested-by: default avatarJohan Heinrich <onny@project-insanity.org>
      Debugged-by: default avatarChristian Sünkenberg <christian.suenkenberg@student.kit.edu>
      Acked-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Link: http://lkml.kernel.org/r/51DDC972.3010005@student.kit.edu
      Cc: <stable@vger.kernel.org> # v3.7+
      5ff560fd
    • Paul Gortmaker's avatar
      x86: delete __cpuinit usage from all x86 files · 148f9bb8
      Paul Gortmaker authored
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      Note that some harmless section mismatch warnings may result, since
      notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
      are flagged as __cpuinit  -- so if we remove the __cpuinit from
      arch specific callers, we will also get section mismatch warnings.
      As an intermediate step, we intend to turn the linux/init.h cpuinit
      content into no-ops as early as possible, since that will get rid
      of these warnings.  In any case, they are temporary and harmless.
      
      This removes all the arch/x86 uses of the __cpuinit macros from
      all C files.  x86 only had the one __CPUINIT used in assembly files,
      and it wasn't paired off with a .previous or a __FINIT, so we can
      delete it directly w/o any corresponding additional change there.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      148f9bb8
  18. Jul 12, 2013
  19. Jul 11, 2013
  20. Jul 09, 2013
Loading