- Nov 13, 2013
-
-
Anthoine Bourgeois authored
If a nested guest does a NM fault but its CR0 doesn't contain the TS flag (because it was already cleared by the guest with L1 aid) then we have to activate FPU ourselves in L0 and then continue to L2. If TS flag is set then we fallback on the previous behavior, forward the fault to L1 if it asked for. Signed-off-by:
Anthoine Bourgeois <bourgeois@bertin.fr> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 07, 2013
-
-
Borislav Petkov authored
We need to copy padding to kernel space first before looking at it. Reported-by:
kbuild test robot <fengguang.wu@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Nov 06, 2013
-
-
Josh Triplett authored
The prototype for kvm_check_iopl appeared in commit f850e2e6 ("KVM: x86 emulator: Check IOPL level during io instruction emulation"), but the function never actually existed. Remove the prototype. Signed-off-by:
Josh Triplett <josh@joshtriplett.org> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Josh Triplett authored
complete_pio ceased to exist in commit 7972995b ("KVM: x86 emulator: Move string pio emulation into emulator.c"), but the prototype remained. Remove its prototype. Signed-off-by:
Josh Triplett <josh@joshtriplett.org> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Marcelo Tosatti authored
In certain occasions it is possible for a hung task detector positive to be false: continuation from a paused VM, for example. Add a method to reset detection, similar as is done with other kernel watchdogs. Acked-by:
Don Zickus <dzickus@redhat.com> Acked-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Marcelo Tosatti authored
Implement reset of kernel watchdogs at pvclock read time. This avoids adding special code to every watchdog. This is possible for watchdogs which measure time based on sched_clock() or ktime_get() variants. Suggested by Don Zickus. Acked-by:
Don Zickus <dzickus@redhat.com> Acked-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Michael S. Tsirkin authored
I noticed that srcu_read_lock/unlock both have a memory barrier, so just by moving srcu_read_unlock earlier we can get rid of one call to smp_mb() using smp_mb__after_srcu_read_unlock instead. Unsurprisingly, the gain is small but measureable using the unit test microbenchmark: before vmcall in the ballpark of 1410 cycles after vmcall in the ballpark of 1360 cycles Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Nov 05, 2013
-
-
Gleb Natapov authored
Currently cpuid emulation is traced only when executed by intercept. Move trace point so that emulator invocation is traced too. Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Gleb Natapov authored
Make code shorter. Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Gleb Natapov authored
All decode_register() callers check if instruction has rex prefix to properly decode one byte operand. It make sense to move the check inside. Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Nov 03, 2013
-
-
Paolo Bonzini authored
When I was looking at RHEL5.9's failure to start with unrestricted_guest=0/emulate_invalid_guest_state=1, I got it working with a slightly older tree than kvm.git. I now debugged the remaining failure, which was introduced by commit 660696d1 (KVM: X86 emulator: fix source operand decoding for 8bit mov[zs]x instructions, 2013-04-24) introduced a similar mis-emulation to the one in commit 8acb4207 (KVM: fix sil/dil/bpl/spl in the mod/rm fields, 2013-05-30). The incorrect decoding occurs in 8-bit movzx/movsx instructions whose 8-bit operand is sil/dil/bpl/spl. Needless to say, "movzbl %bpl, %eax" does occur in RHEL5.9's decompression prolog, just a handful of instructions before finally giving control to the decompressed vmlinux and getting out of the invalid guest state. Because OpMem8 bypasses decode_modrm, the same handling of the REX prefix must be applied to OpMem8. Reported-by:
Michele Baldessari <michele@redhat.com> Cc: stable@vger.kernel.org Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Oct 31, 2013
-
-
Paolo Bonzini authored
Yet another instruction that we fail to emulate, this time found in Windows 2008R2 32-bit. Reviewed-by:
Gleb Natapov <gleb@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Michael S. Tsirkin authored
mst can't be blamed for lack of switch entries: the issue is with msrs actually. Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The loop was always using 0 as the index. This means that any rubbish after the first element of the array went undetected. It seems reasonable to assume that no KVM userspace did that. Reviewed-by:
Gleb Natapov <gleb@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The KVM_SET_XCRS ioctl must accept anything that KVM_GET_XCRS could return. XCR0's bit 0 is always 1 in real processors with XSAVE, and KVM_GET_XCRS will always leave bit 0 set even if the emulated processor does not have XSAVE. So, KVM_SET_XCRS must ignore that bit when checking for attempts to enable unsupported save states. Reviewed-by:
Gleb Natapov <gleb@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 30, 2013
-
-
Alex Williamson authored
We currently use some ad-hoc arch variables tied to legacy KVM device assignment to manage emulation of instructions that depend on whether non-coherent DMA is present. Create an interface for this, adapting legacy KVM device assignment and adding VFIO via the KVM-VFIO device. For now we assume that non-coherent DMA is possible any time we have a VFIO group. Eventually an interface can be developed as part of the VFIO external user interface to query the coherency of a group. Signed-off-by:
Alex Williamson <alex.williamson@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Alex Williamson authored
Default to operating in coherent mode. This simplifies the logic when we switch to a model of registering and unregistering noncoherent I/O with KVM. Signed-off-by:
Alex Williamson <alex.williamson@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Alex Williamson authored
So far we've succeeded at making KVM and VFIO mostly unaware of each other, but areas are cropping up where a connection beyond eventfds and irqfds needs to be made. This patch introduces a KVM-VFIO device that is meant to be a gateway for such interaction. The user creates the device and can add and remove VFIO groups to it via file descriptors. When a group is added, KVM verifies the group is valid and gets a reference to it via the VFIO external user interface. Signed-off-by:
Alex Williamson <alex.williamson@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Borislav Petkov authored
This basically came from the need to be able to boot 32-bit Atom SMP guests on an AMD host, i.e. a host which doesn't support MOVBE. As a matter of fact, qemu has since recently received MOVBE support but we cannot share that with kvm emulation and thus we have to do this in the host. We're waay faster in kvm anyway. :-) So, we piggyback on the #UD path and emulate the MOVBE functionality. With it, an 8-core SMP guest boots in under 6 seconds. Also, requesting MOVBE emulation needs to happen explicitly to work, i.e. qemu -cpu n270,+movbe... Just FYI, a fairly straight-forward boot of a MOVBE-enabled 3.9-rc6+ kernel in kvm executes MOVBE ~60K times. Signed-off-by:
Andre Przywara <andre@andrep.de> Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Borislav Petkov authored
Add initial support for handling three-byte instructions in the emulator. Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Borislav Petkov authored
Call it EmulateOnUD which is exactly what we're trying to do with vendor-specific instructions. Rename ->only_vendor_specific_insn to something shorter, while at it. Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Borislav Petkov authored
Add a field to the current emulation context which contains the instruction opcode length. This will streamline handling of opcodes of different length. Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Borislav Petkov authored
Add a kvm ioctl which states which system functionality kvm emulates. The format used is that of CPUID and we return the corresponding CPUID bits set for which we do emulate functionality. Make sure ->padding is being passed on clean from userspace so that we can use it for something in the future, after the ioctl gets cast in stone. s/kvm_dev_ioctl_get_supported_cpuid/kvm_dev_ioctl_get_cpuid/ while at it. Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 28, 2013
-
-
Jan Kiszka authored
If the host supports it, we can and should expose it to the guest as well, just like we already do with PIN_BASED_VIRTUAL_NMIS. Signed-off-by:
Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jan Kiszka authored
__vmx_complete_interrupts stored uninjected NMIs in arch.nmi_injected, not arch.nmi_pending. So we actually need to check the former field in vmcs12_save_pending_event. This fixes the eventinj unit test when run in nested KVM. Signed-off-by:
Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jan Kiszka authored
As long as the hardware provides us 2MB EPT pages, we can also expose them to the guest because our shadow EPT code already supports this feature. Signed-off-by:
Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 17, 2013
-
-
Aneesh Kumar K.V authored
We will use that in the later patch to find the kvm ops handler Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Oct 15, 2013
-
-
Raghavendra K T authored
We use jump label to enable pv-spinlock. With the changes in (442e0973 Merge branch 'x86/jumplabel'), the jump label behaviour has changed that would result in eventual hang of the VM since we would end up in a situation where slow path locks would halt the vcpus but we will not be able to wakeup the vcpu by lock releaser using unlock kick. Similar problem in Xen and more detailed description is available in a945928e (xen: Do not enable spinlocks before jump_label_init() has executed) This patch splits kvm_spinlock_init to separate jump label changes with pvops patching and also make jump label enabling after jump_label_init(). Signed-off-by:
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
chai wen authored
Page pinning is not mandatory in kvm async page fault processing since after async page fault event is delivered to a guest it accesses page once again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf code, and do some simplifying in check/clear processing. Suggested-by:
Gleb Natapov <gleb@redhat.com> Signed-off-by:
Gu zheng <guz.fnst@cn.fujitsu.com> Signed-off-by:
chai wen <chaiw.fnst@cn.fujitsu.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Oct 14, 2013
-
-
Christoffer Dall authored
The gfn_to_index function relies on huge page defines which either may not make sense on systems that don't support huge pages or are defined in an unconvenient way for other architectures. Since this is x86-specific, move the function to arch/x86/include/asm/kvm_host.h. Signed-off-by:
Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Oct 11, 2013
-
-
Ingo Molnar authored
Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by:
Fengguang Wu <fengguang.wu@intel.com> Reported-by:
Oleg Nesterov <oleg@redhat.com> Reported-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by:
Jakub Jelinek <jakub@redhat.com> Reviewed-by:
Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- Oct 10, 2013
-
-
Arthur Chunqi Li authored
This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by:
Arthur Chunqi Li <yzt356@gmail.com> Reviewed-by:
Gleb Natapov <gleb@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Frediano Ziglio authored
Due to the way kernel is initialized under Xen is possible that the ring1 selector used by the kernel for the boot cpu end up to be copied to userspace leading to segmentation fault in the userspace. Xen code in the kernel initialize no-boot cpus with correct selectors (ds and es set to __USER_DS) but the boot one keep the ring1 (passed by Xen). On task context switch (switch_to) we assume that ds, es and cs already point to __USER_DS and __KERNEL_CSso these selector are not changed. If processor is an Intel that support sysenter instruction sysenter/sysexit is used so ds and es are not restored switching back from kernel to userspace. In the case the selectors point to a ring1 instead of __USER_DS the userspace code will crash on first memory access attempt (to be precise Xen on the emulated iret used to do sysexit will detect and set ds and es to zero which lead to GPF anyway). Now if an userspace process call kernel using sysenter and get rescheduled (for me it happen on a specific init calling wait4) could happen that the ring1 selector is set to ds and es. This is quite hard to detect cause after a while these selectors are fixed (__USER_DS seems sticky). Bisecting the code commit 7076aada appears to be the first one that have this issue. Signed-off-by:
Frediano Ziglio <frediano.ziglio@citrix.com> Signed-off-by:
Stefano Stabellini <stefano.stabellini@eu.citrix.com> Reviewed-by:
Andrew Cooper <andrew.cooper3@citrix.com>
-
Gleb Natapov authored
72f85795 broke shadow on EPT. This patch reverts it and fixes PAE on nEPT (which reverted commit fixed) in other way. Shadow on EPT is now broken because while L1 builds shadow page table for L2 (which is PAE while L2 is in real mode) it never loads L2's GUEST_PDPTR[0-3]. They do not need to be loaded because without nested virtualization HW does this during guest entry if EPT is disabled, but in our case L0 emulates L2's vmentry while EPT is enables, so we cannot rely on vmcs12->guest_pdptr[0-3] to contain up-to-date values and need to re-read PDPTEs from L2 memory. This is what kvm_set_cr3() is doing, but by clearing cache bits during L2 vmentry we drop values that kvm_set_cr3() read from memory. So why the same code does not work for PAE on nEPT? kvm_set_cr3() reads pdptes into vcpu->arch.walk_mmu->pdptrs[]. walk_mmu points to vcpu->arch.nested_mmu while nested guest is running, but ept_load_pdptrs() uses vcpu->arch.mmu which contain incorrect values. Fix that by using walk_mmu in ept_(load|save)_pdptrs. Signed-off-by:
Gleb Natapov <gleb@redhat.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Tested-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 06, 2013
-
-
Ville Syrjälä authored
Dell Latitude E5410 needs reboot=pci to actually reboot. Signed-off-by:
Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://lkml.kernel.org/r/1380888964-14517-1-git-send-email-ville.syrjala@linux.intel.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- Oct 05, 2013
-
-
Bjorn Helgaas authored
This reverts commit 07f9b61c. 07f9b61c was intended to be a cleanup that didn't change anything, but in fact, for systems without _CBA (which is almost everything), it broke extended config space for domain 0 and all config space for other domains. Reference: http://lkml.kernel.org/r/20131004011806.GE20450@dangermouse.emea.sgi.com Reported-by:
Hedi Berriche <hedi@sgi.com> Signed-off-by:
Bjorn Helgaas <bhelgaas@google.com>
-
- Oct 04, 2013
-
-
Thomas Petazzoni authored
Commit ebd97be6 ('PCI: remove ARCH_SUPPORTS_MSI kconfig option') removed the ARCH_SUPPORTS_MSI option which architectures could select to indicate that they support MSI. Now, all architectures are supposed to build fine when MSI support is enabled: instead of having the architecture tell *when* MSI support can be used, it's up to the architecture code to ensure that MSI support can be enabled. On x86, commit ebd97be6 removed the following line: select ARCH_SUPPORTS_MSI if (X86_LOCAL_APIC && X86_IO_APIC) Which meant that MSI support was only available when the local APIC and I/O APIC were enabled. While this is always true on SMP or x86-64, it is not necessarily the case on i386 !SMP. The below patch makes sure that the local APIC and I/O APIC support is always enabled when MSI support is enabled. To do so, it: * Ensures the X86_UP_APIC option is not visible when PCI_MSI is enabled. This is the option that allows, on UP machines, to enable or not the APIC support. It is already not visible on SMP systems, or x86-64 systems, for example. We're simply also making it invisible on i386 MSI systems. * Ensures that the X86_LOCAL_APIC and X86_IO_APIC options are 'y' when PCI_MSI is enabled. Notice that this change requires a change in drivers/iommu/Kconfig to avoid a recursive Kconfig dependencey. The AMD_IOMMU option selects PCI_MSI, but was depending on X86_IO_APIC. This dependency is no longer needed: as soon as PCI_MSI is selected, the presence of X86_IO_APIC is guaranteed. Moreover, the AMD_IOMMU already depended on X86_64, which already guaranteed that X86_IO_APIC was enabled, so this dependency was anyway redundant. Signed-off-by:
Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Link: http://lkml.kernel.org/r/1380794354-9079-1-git-send-email-thomas.petazzoni@free-electrons.com Reported-by:
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by:
Bjorn Helgaas <bhelgaas@google.com> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Peter Zijlstra authored
Currently the cap_user_time_zero capability has different tests than cap_user_time; even though they expose the exact same data. Switch from CONSTANT && NONSTOP to sched_clock_stable to also deal with multi cabinet machines and drop the tsc_disabled() check.. non of this will work sanely without tsc anyway. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-nmgn0j0muo1r4c94vlfh23xy@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- Oct 03, 2013
-
-
Paolo Bonzini authored
kvm_mmu initialization is mostly filling in function pointers, there is no way for it to fail. Clean up unused return values. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Paolo Bonzini authored
They do the same thing, and destroy_kvm_mmu can be confused with kvm_mmu_destroy. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-