- Oct 10, 2013
-
-
Bharat Bhushan authored
When the MM code is invalidating a range of pages, it calls the KVM kvm_mmu_notifier_invalidate_range_start() notifier function, which calls kvm_unmap_hva_range(), which arranges to flush all the TLBs for guest pages. However, the Linux PTEs for the range being flushed are still valid at that point. We are not supposed to establish any new references to pages in the range until the ...range_end() notifier gets called. The PPC-specific KVM code doesn't get any explicit notification of that; instead, we are supposed to use mmu_notifier_retry() to test whether we are or have been inside a range flush notifier pair while we have been referencing a page. This patch calls the mmu_notifier_retry() while mapping the guest page to ensure we are not referencing a page when in range invalidation. This call is inside a region locked with kvm->mmu_lock, which is the same lock that is called by the KVM MMU notifier functions, thus ensuring that no new notification can proceed while we are in the locked region. Signed-off-by:
Bharat Bhushan <bharat.bhushan@freescale.com> Acked-by:
Alexander Graf <agraf@suse.de> [Backported to 3.12 - Paolo] Reviewed-by:
Bharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paul Mackerras authored
This fixes a typo in the code that saves the guest DSCR (Data Stream Control Register) into the kvm_vcpu_arch struct on guest exit. The effect of the typo was that the DSCR value was saved in the wrong place, so changes to the DSCR by the guest didn't persist across guest exit and entry, and some host kernel memory got corrupted. Cc: stable@vger.kernel.org [v3.1+] Signed-off-by:
Paul Mackerras <paulus@samba.org> Acked-by:
Alexander Graf <agraf@suse.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Sep 04, 2013
-
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Aug 29, 2013
-
-
Paul Mackerras authored
This reworks kvmppc_mmu_book3s_64_xlate() to make it check the large page bit in the hashed page table entries (HPTEs) it looks at, and to simplify and streamline the code. The checking of the first dword of each HPTE is now done with a single mask and compare operation, and all the code dealing with the matching HPTE, if we find one, is consolidated in one place in the main line of the function flow. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Aug 28, 2013
-
-
Paul Mackerras authored
It turns out that if we exit the guest due to a hcall instruction (sc 1), and the loading of the instruction in the guest exit path fails for any reason, the call to kvmppc_ld() in kvmppc_get_last_inst() fetches the instruction after the hcall instruction rather than the hcall itself. This in turn means that the instruction doesn't get recognized as an hcall in kvmppc_handle_exit_pr() but gets passed to the guest kernel as a sc instruction. That usually results in the guest kernel getting a return code of 38 (ENOSYS) from an hcall, which often triggers a BUG_ON() or other failure. This fixes the problem by adding a new variant of kvmppc_get_last_inst() called kvmppc_get_last_sc(), which fetches the instruction if necessary from pc - 4 rather than pc. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
Currently the code assumes that once we load up guest FP/VSX or VMX state into the CPU, it stays valid in the CPU registers until we explicitly flush it to the thread_struct. However, on POWER7, copy_page() and memcpy() can use VMX. These functions do flush the VMX state to the thread_struct before using VMX instructions, but if this happens while we have guest state in the VMX registers, and we then re-enter the guest, we don't reload the VMX state from the thread_struct, leading to guest corruption. This has been observed to cause guest processes to segfault. To fix this, we check before re-entering the guest that all of the bits corresponding to facilities owned by the guest, as expressed in vcpu->arch.guest_owned_ext, are set in current->thread.regs->msr. Any bits that have been cleared correspond to facilities that have been used by kernel code and thus flushed to the thread_struct, so for them we reload the state from the thread_struct. We also need to check current->thread.regs->msr before calling giveup_fpu() or giveup_altivec(), since if the relevant bit is clear, the state has already been flushed to the thread_struct and to flush it again would corrupt it. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
Commit 8e44ddc3 ("powerpc/kvm/book3s: Add support for H_IPOLL and H_XIRR_X in XICS emulation") added a call to get_tb() but didn't include the header that defines it, and on some configs this means book3s_xics.c fails to compile: arch/powerpc/kvm/book3s_xics.c: In function ‘kvmppc_xics_hcall’: arch/powerpc/kvm/book3s_xics.c:812:3: error: implicit declaration of function ‘get_tb’ [-Werror=implicit-function-declaration] Cc: stable@vger.kernel.org [v3.10, v3.11] Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Thadeu Lima de Souza Cascardo authored
err was overwritten by a previous function call, and checked to be 0. If the following page allocation fails, 0 is going to be returned instead of -ENOMEM. Signed-off-by:
Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Chen Gang authored
'rmls' is 'unsigned long', lpcr_rmls() will return negative number when failure occurs, so it need a type cast for comparing. 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number when failure occurs, so it need a type cast for comparing. Signed-off-by:
Chen Gang <gang.chen@asianux.com> Acked-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Aug 26, 2013
-
-
Yann Droneaud authored
KVM uses anon_inode_get() to allocate file descriptors as part of some of its ioctls. But those ioctls are lacking a flag argument allowing userspace to choose options for the newly opened file descriptor. In such case it's advised to use O_CLOEXEC by default so that userspace is allowed to choose, without race, if the file descriptor is going to be inherited across exec(). This patch set O_CLOEXEC flag on all file descriptors created with anon_inode_getfd() to not leak file descriptors across exec(). Signed-off-by:
Yann Droneaud <ydroneaud@opteya.com> Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com Reviewed-by:
Alexander Graf <agraf@suse.de> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
- Aug 22, 2013
-
-
Aneesh Kumar K.V authored
Otherwise we would clear the pvr value Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Aug 14, 2013
-
-
Anton Blanchard authored
Our ppc64 spinlocks and rwlocks use a trick where a lock token and the paca index are placed in the lock with a single store. Since we are using two u16s they need adjusting for little endian. Signed-off-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
Anton Blanchard authored
The lppaca, slb_shadow and dtl_entry hypervisor structures are big endian, so we have to byte swap them in little endian builds. LE KVM hosts will also need to be fixed but for now add an #error to remind us. Signed-off-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
Anton Blanchard authored
Although the shared_proc field in the lppaca works today, it is not architected. A shared processor partition will always have a non zero yield_count so use that instead. Create a wrapper so users don't have to know about the details. In order for older kernels to continue to work on KVM we need to set the shared_proc bit. While here, remove the ugly bitfield. Signed-off-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
- Aug 09, 2013
-
-
Thadeu Lima de Souza Cascardo authored
err was overwritten by a previous function call, and checked to be 0. If the following page allocation fails, 0 is going to be returned instead of -ENOMEM. Signed-off-by:
Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
Chen Gang authored
'rmls' is 'unsigned long', lpcr_rmls() will return negative number when failure occurs, so it need a type cast for comparing. 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number when failure occurs, so it need a type cast for comparing. Signed-off-by:
Chen Gang <gang.chen@asianux.com> Acked-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
- Jul 30, 2013
-
-
Hongtao Jia authored
Opcode and xopcode are useful definitions not just for KVM. Move these definitions to asm/ppc-opcode.h for public use. Also add the opcodes for LHAUX and LWZUX. Signed-off-by:
Jia Hongtao <hongtao.jia@freescale.com> Signed-off-by:
Li Yang <leoli@freescale.com> [scottwood@freesacle.com: update commit message and rebase] Signed-off-by:
Scott Wood <scottwood@freescale.com>
-
- Jul 25, 2013
-
-
Paul Mackerras authored
Unlike the other general-purpose SPRs, SPRG3 can be read by usermode code, and is used in recent kernels to store the CPU and NUMA node numbers so that they can be read by VDSO functions. Thus we need to load the guest's SPRG3 value into the real SPRG3 register when entering the guest, and restore the host's value when exiting the guest. We don't need to save the guest SPRG3 value when exiting the guest as usermode code can't modify SPRG3. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Jul 18, 2013
-
-
Takuya Yoshikawa authored
This is called right after the memslots is updated, i.e. when the result of update_memslots() gets installed in install_new_memslots(). Since the memslots needs to be updated twice when we delete or move a memslot, kvm_arch_commit_memory_region() does not correspond to this exactly. In the following patch, x86 will use this new API to check if the mmio generation has reached its maximum value, in which case mmio sptes need to be flushed out. Signed-off-by:
Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Acked-by:
Alexander Graf <agraf@suse.de> Reviewed-by:
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jul 11, 2013
-
-
Scott Wood authored
kvm_guest_enter() was already called by kvmppc_prepare_to_enter(). Don't call it again. Signed-off-by:
Scott Wood <scottwood@freescale.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Scott Wood authored
Currently this is only being done on 64-bit. Rather than just move it out of the 64-bit ifdef, move it to kvm_lazy_ee_enable() so that it is consistent with lazy ee state, and so that we don't track more host code as interrupts-enabled than necessary. Rename kvm_lazy_ee_enable() to kvm_fix_ee_before_entry() to reflect that this function now has a role on 32-bit as well. Signed-off-by:
Scott Wood <scottwood@freescale.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Jul 10, 2013
-
-
Paul Mackerras authored
The table of offsets to real-mode hcall handlers in book3s_hv_rmhandlers.S can contain negative values, if some of the handlers end up before the table in the vmlinux binary. Thus we need to use a sign-extending load to read the values in the table rather than a zero-extending load. Without this, the host crashes when the guest does one of the hcalls with negative offsets, due to jumping to a bogus address. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
This corrects the usage of the tlbie (TLB invalidate entry) instruction in HV KVM. The tlbie instruction changed between PPC970 and POWER7. On the PPC970, the bit to select large vs. small page is in the instruction, not in the RB register value. This changes the code to use the correct form on PPC970. On POWER7 we were calculating the AVAL (Abbreviated Virtual Address, Lower) field of the RB value incorrectly for 64k pages. This fixes it. Since we now have several cases to handle for the tlbie instruction, this factors out the code to do a sequence of tlbies into a new function, do_tlbies(), and calls that from the various places where the code was doing tlbie instructions inline. It also makes kvmppc_h_bulk_remove() use the same global_invalidates() function for determining whether to do local or global TLB invalidations as is used in other places, for consistency, and also to make sure that kvm->arch.need_tlb_flush gets updated properly. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Jul 08, 2013
-
-
Aneesh Kumar K.V authored
Both RMA and hash page table request will be a multiple of 256K. We can use a chunk size of 256K to track the free/used 256K chunk in the bitmap. This should help to reduce the bitmap size. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Aneesh Kumar K.V authored
Older version of power architecture use Real Mode Offset register and Real Mode Limit Selector for mapping guest Real Mode Area. The guest RMA should be physically contigous since we use the range when address translation is not enabled. This patch switch RMA allocation code to use contigous memory allocator. The patch also remove the the linear allocator which not used any more Acked-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Aneesh Kumar K.V authored
Powerpc architecture uses a hash based page table mechanism for mapping virtual addresses to physical address. The architecture require this hash page table to be physically contiguous. With KVM on Powerpc currently we use early reservation mechanism for allocating guest hash page table. This implies that we need to reserve a big memory region to ensure we can create large number of guest simultaneously with KVM on Power. Another disadvantage is that the reserved memory is not available to rest of the subsystems and and that implies we limit the total available memory in the host. This patch series switch the guest hash page table allocation to use contiguous memory allocator. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Alexander Graf authored
We don't emulate breakpoints yet, so just ignore reads and writes to / from DABR. This fixes booting of more recent Linux guest kernels for me. Reported-by:
Nello Martuscielli <ppc.addon@gmail.com> Tested-by:
Nello Martuscielli <ppc.addon@gmail.com> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Jun 30, 2013
-
-
Alexander Graf authored
While technically it's legal to write to PIR and have the identifier changed, we don't implement logic to do so because we simply expose vcpu_id to the guest. So instead, let's ignore writes to PIR. This ensures that we don't inject faults into the guest for something the guest is allowed to do. While at it, we cross our fingers hoping that it also doesn't mind that we broke its PIR read values. Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
At present, if the guest creates a valid SLB (segment lookaside buffer) entry with the slbmte instruction, then invalidates it with the slbie instruction, then reads the entry with the slbmfee/slbmfev instructions, the result of the slbmfee will have the valid bit set, even though the entry is not actually considered valid by the host. This is confusing, if not worse. This fixes it by zeroing out the orige and origv fields of the SLB entry structure when the entry is invalidated. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
With this, the guest can use 1TB segments as well as 256MB segments. Since we now have the situation where a single emulated guest segment could correspond to multiple shadow segments (as the shadow segments are still 256MB segments), this adds a new kvmppc_mmu_flush_segment() to scan for all shadow segments that need to be removed. This restructures the guest HPT (hashed page table) lookup code to use the correct hashing and matching functions for HPTEs within a 1TB segment. We use the standard hpt_hash() function instead of open-coding the hash calculation, and we use HPTE_V_COMPARE() with an AVPN value that has the B (segment size) field included. The calculation of avpn is done a little earlier since it doesn't change in the loop starting at the do_second label. The computation in kvmppc_mmu_book3s_64_esid_to_vsid() changes so that it returns a 256MB VSID even if the guest SLB entry is a 1TB entry. This is because the users of this function are creating 256MB SLB entries. We set a new VSID_1T flag so that entries created from 1T segments don't collide with entries from 256MB segments. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
The loop in kvmppc_mmu_book3s_64_xlate() that looks up a translation in the guest hashed page table (HPT) keeps going if it finds an HPTE that matches but doesn't allow access. This is incorrect; it is different from what the hardware does, and there should never be more than one matching HPTE anyway. This fixes it to stop when any matching HPTE is found. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
On entering a PR KVM guest, we invalidate the whole SLB before loading up the guest entries. We do this using an slbia instruction, which invalidates all entries except entry 0, followed by an slbie to invalidate entry 0. However, the slbie turns out to be ineffective in some circumstances (specifically when the host linear mapping uses 64k pages) because of errors in computing the parameter to the slbie. The result is that the guest kernel hangs very early in boot because it takes a DSI the first time it tries to access kernel data using a linear mapping address in real mode. Currently we construct bits 36 - 43 (big-endian numbering) of the slbie parameter by taking bits 56 - 63 of the SLB VSID doubleword. These bits for the tlbie are C (class, 1 bit), B (segment size, 2 bits) and 5 reserved bits. For the SLB VSID doubleword these are C (class, 1 bit), reserved (1 bit), LP (large page size, 2 bits), and 4 reserved bits. Thus we are not setting the B field correctly, and when LP = 01 as it is for 64k pages, we are setting a reserved bit. Rather than add more instructions to calculate the slbie parameter correctly, this takes a simpler approach, which is to set entry 0 to zeroes explicitly. Normally slbmte should not be used to invalidate an entry, since it doesn't invalidate the ERATs, but it is OK to use it to invalidate an entry if it is immediately followed by slbia, which does invalidate the ERATs. (This has been confirmed with the Power architects.) This approach takes fewer instructions and will work whatever the contents of entry 0. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Paul Mackerras authored
This makes sure the calculation of the proto-VSIDs used by PR KVM is done with 64-bit arithmetic. Since vcpu3s->context_id[] is int, when we do vcpu3s->context_id[0] << ESID_BITS the shift will be done with 32-bit instructions, possibly leading to significant bits getting lost, as the context id can be up to 524283 and ESID_BITS is 18. To fix this we cast the context id to u64 before shifting. Signed-off-by:
Paul Mackerras <paulus@samba.org> Signed-off-by:
Alexander Graf <agraf@suse.de>
-
Tiejun Chen authored
Availablity of the doorbell_exception function is guarded by CONFIG_PPC_DOORBELL. Use the same define to guard our caller of it. Signed-off-by:
Tiejun Chen <tiejun.chen@windriver.com> [agraf: improve patch description] Signed-off-by:
Alexander Graf <agraf@suse.de>
-
- Jun 21, 2013
-
-
Aneesh Kumar K.V authored
We can find pte that are splitting while walking page tables. Return None pte in that case. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
Aneesh Kumar K.V authored
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly document why we don't need to handle transparent hugepages at callsites. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
Aneesh Kumar K.V authored
If a hash bucket gets full, we "evict" a more/less random entry from it. When we do that we don't invalidate the TLB (hpte_remove) because we assume the old translation is still technically "valid". This implies that when we are invalidating or updating pte, even if HPTE entry is not valid we should do a tlb invalidate. With hugepages, we need to pass the correct actual page size value for tlb invalidation. This change update the patch 0608d692 "powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle transparent hugepages correctly. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org>
-
- Jun 19, 2013
-
-
Scott Wood authored
kwmppc_lazy_ee_enable() should be called as late as possible, or else we get things like WARN_ON(preemptible()) in enable_kernel_fp() in configurations where preemptible() works. Note that book3s_pr already waits until just before __kvmppc_vcpu_run to call kvmppc_lazy_ee_enable(). Signed-off-by:
Scott Wood <scottwood@freescale.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jun 11, 2013
-
-
Scott Wood authored
EE is hard-disabled on entry to kvmppc_handle_exit(), so call hard_irq_disable() so that PACA_IRQ_HARD_DIS is set, and soft_enabled is unset. Without this, we get warnings such as arch/powerpc/kernel/time.c:300, and sometimes host kernel hangs. Signed-off-by:
Scott Wood <scottwood@freescale.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-
Scott Wood authored
KVM core expects arch code to acquire the srcu lock when calling gfn_to_memslot and similar functions. Signed-off-by:
Scott Wood <scottwood@freescale.com> Signed-off-by:
Gleb Natapov <gleb@redhat.com>
-