Pandora Sourcecodes - pandora-kernel.git/log

KVM: PPC: Book3S HV: Fix updates of vcpu->cpu

This removes the powerpc "generic" updates of vcpu->cpu in load and
put, and moves them to the various backends.

The reason is that "HV" KVM does its own sauce with that field
and the generic updates might corrupt it. The field contains the
CPU# of the -first- HW CPU of the core always for all the VCPU
threads of a core (the one that's online from a host Linux
perspective).

However, the preempt notifiers are going to be called on the
threads VCPUs when they are running (due to them sleeping on our
private waitqueue) causing unload to be called, potentially
clobbering the value.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: Move some PPC ioctl definitions to the correct place

This moves the definitions of KVM_CREATE_SPAPR_TCE and
KVM_ALLOCATE_RMA in include/linux/kvm.h from the section listing the
vcpu ioctls to the section listing VM ioctls, as these are both
implemented and documented as VM ioctls.

Fortunately there is no actual collision of ioctl numbers at this
point. Moving these to the correct section will reduce the
probability of a future collision. This does not change the
user/kernel ABI at all.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

This adds an implementation of kvm_arch_flush_shadow_memslot for
Book3S HV, and arranges for kvmppc_core_commit_memory_region to
flush the dirty log when modifying an existing slot.  With this,
we can handle deletion and modification of memory slots.

kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which
on Book3S HV now traverses the reverse map chains to remove any HPT
(hashed page table) entries referring to pages in the memslot.  This
gets called by generic code whenever deleting a memslot or changing
the guest physical address for a memslot.

We flush the dirty log in kvmppc_core_commit_memory_region for
consistency with what x86 does.  We only need to flush when an
existing memslot is being modified, because for a new memslot the
rmap array (which stores the dirty bits) is all zero, meaning that
every page is considered clean already, and when deleting a memslot
we obviously don't care about the dirty bits any more.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Move kvm->arch.slot_phys into memslot.arch

Now that we have an architecture-specific field in the kvm_memory_slot
structure, we can use it to store the array of page physical addresses
that we need for Book3S HV KVM on PPC970 processors. This reduces the
size of struct kvm_arch for Book3S HV, and also reduces the size of
struct kvm_arch_memory_slot for other PPC KVM variants since the fields
in it are now only compiled in for Book3S HV.

This necessitates making the kvm_arch_create_memslot and
kvm_arch_free_memslot operations specific to each PPC KVM variant.
That in turn means that we now don't allocate the rmap arrays on
Book3S PR and Book E.

Since we now unpin pages and free the slot_phys array in
kvmppc_core_free_memslot, we no longer need to do it in
kvmppc_core_destroy_vm, since the generic code takes care to free
all the memslots when destroying a VM.

We now need the new memslot to be passed in to
kvmppc_core_prepare_memory_region, since we need to initialize its
arch.slot_phys member on Book3S HV.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots

The generic KVM code uses SRCU (sleeping RCU) to protect accesses
to the memslots data structures against updates due to userspace
adding, modifying or removing memory slots.  We need to do that too,
both to avoid accessing stale copies of the memslots and to avoid
lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
around code that accesses and uses memslots.

Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE
need to access the memslots, and we don't want to call the SRCU code
in real mode (since we have no assurance that it would only access
the linear mapping), we hold the SRCU read lock for the VM while
in the guest.  This does mean that adding or removing memory slots
while some vcpus are executing in the guest will block for up to
two jiffies.  This tradeoff is acceptable since adding/removing
memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE
are performance-critical hot paths.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro

The current form of DO_KVM macro restricts its use to one call per input
parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol
definition.
Duplicate calls of DO_KVM are required by distinct implementations of
exeption handlers which are delegated at runtime. Use a rare label number
to avoid conflicts with the calling contexts.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Support FPU on non-hv systems

When running on HV aware hosts, we can not trap when the guest sets the FP
bit, so we just let it do so when it wants to, because it has full access to
MSR.

For non-HV aware hosts with an FPU (like 440), we need to also adjust the
shadow MSR though. Otherwise the guest gets an FP unavailable trap even when
it really enabled the FP bit in MSR.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: 440: Implement mfdcrx

We need mfdcrx to execute properly on 460 cores.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: 440: Implement mtdcrx

We need mtdcrx to execute properly on 460 cores.

Signed-off-by: Alexander Graf <agraf@suse.de>

Document IACx/DACx registers access using ONE_REG API

Patch to access the debug registers (IACx/DACx) using ONE_REG api
was sent earlier. But that missed the respective documentation.

Also corrected the index number referencing in section 4.69

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: E500: Remove E500_TLB_DIRTY flag

Since we always mark pages as dirty immediately when mapping them read/write
now, there's no need for the dirty flag in our cache.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Use symbols for exit trace

Exit traces are a lot easier to read when you don't have to remember
cryptic numbers for guest exit reasons. Symbolify them in our trace
output.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Add MCSR SPR support

Add support for the MCSR SPR. This only implements the SPR storage
bits, not actual machine checks.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: 44x: Initialize PVR

We need to make sure that vcpu->arch.pvr is initialized to a sane value,
so let's just take the host PVR.

Signed-off-by: Alexander Graf <agraf@suse.de>

booke: Added ONE_REG interface for IAC/DAC debug registers

IAC/DAC are defined as 32 bit while they are 64 bit wide. So ONE_REG
interface is added to set/get them.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: booke: Add watchdog emulation

This patch adds the watchdog emulation in KVM. The watchdog
emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_BOOKE_WATCHDOG) ioctl.
The kernel timer are used for watchdog emulation and emulates
h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
it is configured.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
[bharat.bhushan@freescale.com: reworked patch]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
[agraf: adjust to new request framework]
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Add return value to core_check_requests

Requests may want to tell us that we need to go back into host state,
so add a return value for the checks.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Add return value in prepare_to_enter

Our prepare_to_enter helper wants to be able to return in more circumstances
to the host than only when an interrupt is pending. Broaden the interface a
bit and move even more generic code to the generic helper.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Ignore EXITING_GUEST_MODE mode

We don't need to do anything when mode is EXITING_GUEST_MODE, because
we essentially are outside of guest mode and did everything it asked
us to do by the time we check it.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Move kvm_guest_enter call into generic code

We need to call kvm_guest_enter in booke and book3s, so move its
call to generic code.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: PR: Rework irq disabling

Today, we disable preemption while inside guest context, because we need
to expose to the world that we are not in a preemptible context. However,
during that time we already have interrupts disabled, which would indicate
that we are in a non-preemptible context.

The reason the checks for irqs_disabled() fail for us though is that we
manually control hard IRQs and ignore all the lazy EE framework. Let's
stop doing that. Instead, let's always use lazy EE to indicate when we
want to disable IRQs, but do a special final switch that gets us into
EE disabled, but soft enabled state. That way when we get back out of
guest state, we are immediately ready to process interrupts.

This simplifies the code drastically and reduces the time that we appear
as preempt disabled.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Consistentify vcpu exit path

When getting out of __vcpu_run, let's be consistent about the state we
return in. We want to always

* have IRQs enabled
* have called kvm_guest_exit before

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: PR: Indicate we're out of guest mode

When going out of guest mode, indicate that we are in vcpu->mode. That way
requests from other CPUs don't needlessly need to kick us to process them,
because it'll just happen next time we enter the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Exit guest context while handling exit

The x86 implementation of KVM accounts for host time while processing
guest exits. Do the same for us.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Book3S: PR: Only do resched check once per exit

Now that we use our generic exit helper, we can safely drop our previous
kvm_resched that we used to trigger at the beginning of the exit handler
function.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Drop redundant vcpu->mode set

We only need to set vcpu->mode to outside once.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Book3s: PR: Add (dumb) MMU Notifier support

Now that we have very simple MMU Notifier support for e500 in place,
also add the same simple support to book3s. It gets us one step closer
to actual fast support.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Use same kvmppc_prepare_to_enter code for booke and book3s_pr

We need to do the same things when preparing to enter a guest for booke and
book3s_pr cores. Fold the generic code into a generic function that both call.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: No duplicate request != 0 check

We only call kvmppc_check_requests() when vcpu->requests != 0, so drop
the redundant check in the function itself

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Add some more trace points

Without trace points, debugging what exactly is going on inside guest
code can be very tricky. Add a few more trace points at places that
hopefully tell us more when things go wrong.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: E500: Implement MMU notifiers

The e500 target has lived without mmu notifiers ever since it got
introduced, but fails for the user space check on them with hugetlbfs.

So in order to get that one working, implement mmu notifiers in a
reasonably dumb fashion and be happy. On embedded hardware, we almost
never end up with mmu notifier calls, since most people don't overcommit.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Add support for vcpu->mode

Generic KVM code might want to know whether we are inside guest context
or outside. It also wants to be able to push us out of guest context.

Add support to the BookE code for the generic vcpu->mode field that describes
the above states.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Add check_requests helper function

We need a central place to check for pending requests in. Add one that
only does the timer check we already do in a different place.

Later, this central function can be extended by more checks.

Signed-off-by: Alexander Graf <agraf@suse.de>

powerpc/epapr: export epapr_hypercall_start

This fixes breakage introduced by the following commit:

  commit 6d2d82627f4f1e96a33664ace494fa363e0495cb
  Author: Liu Yu-B13201 <Yu.Liu@freescale.com>
  Date:   Tue Jul 3 05:48:56 2012 +0000

    PPC: Don't use hardcoded opcode for ePAPR hcall invocation

when a driver that uses ePAPR hypercalls is built as a module.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Quieten message about allocating linear regions

This is printed once for every RMA or HPT region that get
preallocated. If one preallocates hundreds of such regions
(in order to run hundreds of KVM guests), that gets rather
painful, so make it a bit quieter.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: E500: Fix clear_tlb_refs

Our mapping code assumes that TLB0 entries are always mapped. However, after
calling clear_tlb_refs() this is no longer the case.

Map them dynamically if we find an entry unmapped in TLB0.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: BookE: Expose remote TLB flushes in debugfs

We're already counting remote TLB flushes in a variable, but don't export
it to user space yet. Do so, so we know what's going on.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Expose SYNC cap based on mmu notifiers

Semantically, the "SYNC" cap means that we have mmu notifiers available.
Express this in our #ifdef'ery around the feature, so that we can be sure
we don't miss out on ppc targets when they get their implementation.

Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: PR: Use generic tracepoint for guest exit

We want to have tracing information on guest exits for booke as well
as book3s. Since most information is identical, use a common trace point.

Signed-off-by: Alexander Graf <agraf@suse.de>

PPC: Don't use hardcoded opcode for ePAPR hcall invocation

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

powerpc/fsl-soc: use CONFIG_EPAPR_PARAVIRT for hcalls

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

PPC: select EPAPR_PARAVIRT for all users of epapr hcalls

Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: ev_idle hcall support for e500 guests

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[varun: 64-bit changes]
Signed-off-by: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: Add support for ePAPR idle hcall in host kernel

And add a new flag definition in kvm_ppc_pvinfo to indicate
whether the host supports the EV_IDLE hcall.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[stuart.yoder@freescale.com: cleanup,fixes for conditions allowing idle]
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
[agraf: fix typo]
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: add pvinfo for hcall opcodes on e500mc/e5500

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[stuart: factored this out from idle hcall support in host patch]
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: PPC: use definitions in epapr header for hcalls

Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

PPC: epapr: create define for return code value of success

Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

KVM: s390: Fix vcpu_load handling in interrupt code

Recent changes (KVM: make processes waiting on vcpu mutex killable)
now requires to check the return value of vcpu_load. This triggered
a warning in s390 specific kvm code. Turns out that we can actually
remove the put/load, since schedule will do the right thing via
the preempt notifiers.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86: Fix guest debug across vcpu INIT reset

If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: Add resampling irqfds for level triggered interrupts

To emulate level triggered interrupts, add a resample option to
KVM_IRQFD.  When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM.  This may, for
instance, indicate an EOI.  Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt.  On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.

All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: optimize apic interrupt delivery

Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

Merge branch 'queue' into next

* queue:
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly

Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Eliminate pointless temporary 'ac'

'ac' essentially reconstructs the 'access' variable we already
have, except for the PFERR_PRESENT_MASK and PFERR_RSVD_MASK. As
these are not used by callees, just use 'access' directly.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Avoid access/dirty update loop if all is well

Keep track of accessed/dirty bits; if they are all set, do not
enter the accessed/dirty update loop.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Eliminate eperm temporary

'eperm' is no longer used in the walker loop, so we can eliminate it.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Optimize is_last_gpte()

Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Simplify walk_addr_generic() loop

The page table walk is coded as an infinite loop, with a special
case on the last pte.

Code it as an ordinary loop with a termination condition on the last
pte (large page or walk length exhausted), and put the last pte handling
code after the loop where it belongs.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Optimize pte permission checks

walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.

Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).

The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.

The result is short, branch-free code.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Update accessed and dirty bits after guest pagetable walk

While unspecified, the behaviour of Intel processors is to first
perform the page table walk, then, if the walk was successful, to
atomically update the accessed and dirty bits of walked paging elements.

While we are not required to follow this exactly, doing so will allow us
to perform the access permissions check after the walk is complete, rather
than after each walk step.

(the tricky case is SMEP: a zero in any pte's U bit makes the referenced
page a supervisor page, so we can't fault on a one bit during the walk
itself).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Move gpte_access() out of paging_tmpl.h

We no longer rely on paging_tmpl.h defines; so we can move the function
to mmu.c.

Rely on zero extension to 64 bits to get the correct nx behaviour.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Optimize gpte_access() slightly

If nx is disabled, then is gpte[63] is set we will hit a reserved
bit set fault before checking permissions; so we can ignore the
setting of efer.nxe.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: MMU: Push clean gpte write protection out of gpte_access()

gpte_access() computes the access permissions of a guest pte and also
write-protects clean gptes. This is wrong when we are servicing a
write fault (since we'll be setting the dirty bit momentarily) but
correct when instantiating a speculative spte, or when servicing a
read fault (since we'll want to trap a following write in order to
set the dirty bit).

It doesn't seem to hurt in practice, but in order to make the code
readable, push the write protection out of gpte_access() and into
a new protect_clean_gpte() which is called explicitly when needed.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: clarify kvmclock documentation

- mention that system time needs to be added to wallclock time
- positive tsc_shift means left shift, not right
- mention additional 32bit right shift

Signed-off-by: Stefan Fritsch <sf@sfritsch.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: make processes waiting on vcpu mutex killable

vcpu mutex can be held for unlimited time so
taking it with mutex_lock on an ioctl is wrong:
one process could be passed a vcpu fd and
call this ioctl on the vcpu used by another process,
it will then be unkillable until the owner exits.

Call mutex_lock_killable instead and return status.
Note: mutex_lock_interruptible would be even nicer,
but I am not sure all users are prepared to handle EINTR
from these ioctls. They might misinterpret it as an error.

Cleanup paths expect a vcpu that can't be used by
any userspace so this will always succeed - catch bugs
by calling BUG_ON.

Catch callers that don't check return state by adding
__must_check.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: SVM: Make use of asm.h

Use macros for bitness-insensitive register names, instead of
rolling our own.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Make use of asm.h

Use macros for bitness-insensitive register names, instead of
rolling our own.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Make lto-friendly

LTO (link-time optimization) doesn't like local labels to be referred to
from a different function, since the two functions may be built in separate
compilation units. Use an external variable instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()

find_highest_vector() and count_vectors():
- Instead of using magic values, define and use proper macros.

find_highest_vector():
- Remove likely() which is there only for historical reasons and not
   doing correct branch predictions anymore.  Using such heuristics
   to optimize this function is not worth it now.  Let CPUs predict
   things instead.

- Stop checking word[0] separately.  This was only needed for doing
   likely() optimization.

- Use for loop, not while, to iterate over the register array to make
   the code clearer.

Note that we actually confirmed that the likely() did wrong predictions
by inserting debug code.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: MMU: remove unnecessary check

Checking the return of kvm_mmu_get_page is unnecessary since it is
guaranteed by memory cache

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: Depend on HIGH_RES_TIMERS

KVM lapic timer and tsc deadline timer based on hrtimer,
setting a leftmost node to rb tree and then do hrtimer reprogram.
If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
do nothing and then make kvm lapic timer and tsc deadline timer fail.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: Improve wording of KVM_SET_USER_MEMORY_REGION documentation

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: use symbolic constant for nr interrupts

interrupt_bitmap is KVM_NR_INTERRUPTS bits in size,
so just use that instead of hard-coded constants
and math.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: emulator: optimize "rep ins" handling

Optimize "rep ins" by allowing emulator to write back more than one
datum at a time. Introduce new operand type OP_MEM_STR which tells
writeback() that dst contains pointer to an array that should be written
back as opposite to just one data element.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: emulator: string_addr_inc() cleanup

Remove unneeded segment argument. Address structure already has correct
segment which was put there during decode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: emulator: make x86 emulation modes enum instead of defines

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: Provide userspace IO exit completion callback

Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: move postcommit flush to x86, as mmio sptes are x86 specific

Other arches do not need this.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya)
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: perform an invalid memslot step for gpa base change

PPC must flush all translations before the new memory slot
is visible.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: split kvm_arch_flush_shadow

Introducing kvm_arch_flush_shadow_memslot, to invalidate the
translations of a single memory slot.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: SVM: constify lookup tables

We never modify direct_access_msrs[], msrpm_ranges[],
svm_exit_handlers[] or x86_intercept_map[] at runtime.
Mark them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: VMX: constify lookup tables

We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and
kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86: more constification

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86: constify read_write_emulator_ops

We never change those, make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86 emulator: constify emulate_ops

We never change emulate_ops[] at runtime so it should be r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86 emulator: mark opcode tables const

The opcode tables never change at runtime, therefor mark them const.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86 emulator: use aligned variants of SSE register ops

As the the compiler ensures that the memory operand is always aligned
to a 16 byte memory location, use the aligned variant of MOVDQ for
read_sse_reg() and write_sse_reg().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86: minor size optimization

Some fields can be constified and/or made static to reduce code and data
size.

Numbers for a 32 bit build:

        text    data     bss     dec     hex filename
before: 3351      80       0    3431     d67 cpuid.o
after: 3391       0       0    3391     d3f cpuid.o

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: cleanup pic reset

kvm_pic_reset() is not used anywhere. Move reset logic from
pic_ioport_write() there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86: remove unused variable from kvm_task_switch()

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Ignore segment G and D bits when considering whether we can virtualize

We will enter the guest with G and D cleared; as real hardware ignores D in
real mode, and G is taken care of by the limit test, we allow more code to
run in vm86 mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Save all segment data in real mode

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Preserve segment limit and access rights in real mode

While this is undocumented, real processors do not reload the segment
limit and access rights when loading a segment register in real mode.
Real programs rely on it so we need to comply with this behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Return real real-mode segment data even if emulate_invalid_guest_state=1

emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the
vmcs; we do. So we need to return the real ones (maintained by vmx_set_segment).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Fix #GP error code during linearization

We want the segment selector, nor segment number.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Check segment limits in real mode too

Segment limits are verified in real mode, not just protected mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Leave segment limit and attributs alone in real mode

When loading a segment in real mode, only the base and selector must
be modified. The limit needs to be left alone, otherwise big real mode
users will hit a #GP due to limit checking (currently this is suppressed
because we don't check limits in real mode).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Allow vm86 virtualization of big real mode

Usually, big real mode uses large (4GB) segments. Currently we don't
virtualize this; if any segment has a limit other than 0xffff, we emulate.
But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize
real mode; if an access overruns the segment limit, the guest will #GP, which
we will trap and forward to the emulator. This results in significantly
faster execution, and less risk of hitting an unemulated instruction.

If the limit is less than 0xffff, we retain the existing behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Allow real mode emulation using vm86 with dpl=0

Real mode is always entered from protected mode with dpl=0. Since
the dpl doesn't affect execution, and we already override it to 3
in the vmcs (as vmx requires), we can allow execution in that state.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Retain limit and attributes when entering protected mode

Real processors don't change segment limits and attributes while in
real mode. Mimic that behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Use kvm_segment to save protected-mode segments when entering realmode

Instead of using struct kvm_save_segment, use struct kvm_segment, which is what
the other APIs use. This leads to some simplification.

We replace save_rmode_seg() with a call to vmx_save_segment(). Since this depends
on rmode.vm86_active, we move the call to before setting the flag.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>