cascardo/linux.git
8 years agoMerge tag 'kvm-s390-next-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Bonzini [Thu, 10 Mar 2016 16:06:07 +0000 (17:06 +0100)]
Merge tag 'kvm-s390-next-4.6-2' of git://git./linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Fixes and features for kvm/next (4.6) part 2

- add watchdog diagnose to trace event decoder
- better handle the cpu timer when not inside the guest
- only provide STFLE if the CPU model has STFLE
- reduce DMA page usage

8 years agoKVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
Paolo Bonzini [Tue, 8 Mar 2016 09:00:11 +0000 (10:00 +0100)]
KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch

It is now equal to use_eager_fpu(), which simply tests a cpufeature bit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: x86: disable MPX if host did not enable MPX XSAVE features
Paolo Bonzini [Tue, 8 Mar 2016 08:52:13 +0000 (09:52 +0100)]
KVM: x86: disable MPX if host did not enable MPX XSAVE features

When eager FPU is disabled, KVM will still see the MPX bit in CPUID and
presumably the MPX vmentry and vmexit controls.  However, it will not
be able to expose the MPX XSAVE features to the guest, because the guest's
accessible XSAVE features are always a subset of host_xcr0.

In this case, we should disable the MPX CPUID bit, the BNDCFGS MSR,
and the MPX vmentry and vmexit controls for nested virtualization.
It is then unnecessary to enable guest eager FPU if the guest has the
MPX CPUID bit set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoMerge tag 'kvm-arm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm...
Paolo Bonzini [Wed, 9 Mar 2016 10:50:42 +0000 (11:50 +0100)]
Merge tag 'kvm-arm-for-4.6' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.6

- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- Various optimizations to the vgic save/restore code

Conflicts:
include/uapi/linux/kvm.h

8 years agoarm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
Marc Zyngier [Tue, 9 Feb 2016 17:36:09 +0000 (17:36 +0000)]
arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit

So far, we're always writing all possible LRs, setting the empty
ones with a zero value. This is obvious doing a low of work for
nothing, and we're better off clearing those we've actually
dirtied on the exit path (it is very rare to inject more than one
interrupt at a time anyway).

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: vgic-v3: Reset LRs at boot time
Marc Zyngier [Thu, 3 Mar 2016 15:43:58 +0000 (15:43 +0000)]
arm64: KVM: vgic-v3: Reset LRs at boot time

In order to let the GICv3 code be more lazy in the way it
accesses the LRs, it is necessary to start with a clean slate.

Let's reset the LRs on each CPU when the vgic is probed (which
includes a round trip to EL2...).

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: vgic-v3: Do not save an LR known to be empty
Marc Zyngier [Tue, 9 Feb 2016 17:09:49 +0000 (17:09 +0000)]
arm64: KVM: vgic-v3: Do not save an LR known to be empty

On exit, any empty LR will be signaled in ICH_ELRSR_EL2. Which
means that we do not have to save it, and we can just clear
its state in the in-memory copy.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: vgic-v3: Save maintenance interrupt state only if required
Marc Zyngier [Tue, 9 Feb 2016 18:53:04 +0000 (18:53 +0000)]
arm64: KVM: vgic-v3: Save maintenance interrupt state only if required

Next on our list of useless accesses is the maintenance interrupt
status registers (ICH_MISR_EL2, ICH_EISR_EL2).

It is pointless to save them if we haven't asked for a maintenance
interrupt the first place, which can only happen for two reasons:
- Underflow: ICH_HCR_UIE will be set,
- EOI: ICH_LR_EOI will be set.

These conditions can be checked on the in-memory copies of the regs.
Should any of these two condition be valid, we must read GICH_MISR.
We can then check for ICH_MISR_EOI, and only when set read
ICH_EISR_EL2.

This means that in most case, we don't have to save them at all.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: vgic-v3: Avoid accessing ICH registers
Marc Zyngier [Wed, 17 Feb 2016 10:25:05 +0000 (10:25 +0000)]
arm64: KVM: vgic-v3: Avoid accessing ICH registers

Just like on GICv2, we're a bit hammer-happy with GICv3, and access
them more often than we should.

Adopt a policy similar to what we do for GICv2, only save/restoring
the minimal set of registers. As we don't access the registers
linearly anymore (we may skip some), the convoluted accessors become
slightly simpler, and we can drop the ugly indexing macro that
tended to confuse the reviewers.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
Marc Zyngier [Tue, 9 Feb 2016 17:37:39 +0000 (17:37 +0000)]
KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit

The GICD_SGIR register lives a long way from the beginning of
the handler array, which is searched linearly. As this is hit
pretty often, let's move it up. This saves us some precious
cycles when the guest is generating IPIs.

Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
Marc Zyngier [Tue, 9 Feb 2016 17:36:09 +0000 (17:36 +0000)]
KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit

So far, we're always writing all possible LRs, setting the empty
ones with a zero value. This is obvious doing a lot of work for
nothing, and we're better off clearing those we've actually
dirtied on the exit path (it is very rare to inject more than one
interrupt at a time anyway).

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Reset LRs at boot time
Marc Zyngier [Thu, 3 Mar 2016 15:43:58 +0000 (15:43 +0000)]
KVM: arm/arm64: vgic-v2: Reset LRs at boot time

In order to let make the GICv2 code more lazy in the way it
accesses the LRs, it is necessary to start with a clean slate.

Let's reset the LRs on each CPU when the vgic is probed.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
Marc Zyngier [Tue, 9 Feb 2016 17:09:49 +0000 (17:09 +0000)]
KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty

On exit, any empty LR will be signaled in GICH_ELRSR*. Which
means that we do not have to save it, and we can just clear
its state in the in-memory copy.

Take this opportunity to move the LR saving code into its
own function.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
Marc Zyngier [Tue, 9 Feb 2016 17:07:18 +0000 (17:07 +0000)]
KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function

In order to make the saving path slightly more readable and
prepare for some more optimizations, let's move the GICH_ELRSR
saving to its own function.

No functional change.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
Marc Zyngier [Tue, 9 Feb 2016 17:01:33 +0000 (17:01 +0000)]
KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required

Next on our list of useless accesses is the maintenance interrupt
status registers (GICH_MISR, GICH_EISR{0,1}).

It is pointless to save them if we haven't asked for a maintenance
interrupt the first place, which can only happen for two reasons:
- Underflow: GICH_HCR_UIE will be set,
- EOI: GICH_LR_EOI will be set.

These conditions can be checked on the in-memory copies of the regs.
Should any of these two condition be valid, we must read GICH_MISR.
We can then check for GICH_MISR_EOI, and only when set read
GICH_EISR*.

This means that in most case, we don't have to save them at all.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
Marc Zyngier [Tue, 2 Feb 2016 19:35:34 +0000 (19:35 +0000)]
KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers

GICv2 registers are *slow*. As in "terrifyingly slow". Which is bad.
But we're equaly bad, as we make a point in accessing them even if
we don't have any interrupt in flight.

A good solution is to first find out if we have anything useful to
write into the GIC, and if we don't, to simply not do it. This
involves tracking which LRs actually have something valid there.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoKVM: s390: allocate only one DMA page per VM
David Hildenbrand [Wed, 2 Dec 2015 07:53:52 +0000 (08:53 +0100)]
KVM: s390: allocate only one DMA page per VM

We can fit the 2k for the STFLE interpretation and the crypto
control block into one DMA page. As we now only have to allocate
one DMA page, we can clean up the code a bit.

As a nice side effect, this also fixes a problem with crycbd alignment in
case special allocation debug options are enabled, debugged by Sascha
Silbe.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: enable STFLE interpretation only if enabled for the guest
David Hildenbrand [Wed, 2 Dec 2015 08:43:29 +0000 (09:43 +0100)]
KVM: s390: enable STFLE interpretation only if enabled for the guest

Not setting the facility list designation disables STFLE interpretation,
this is what we want if the guest was told to not have it.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: wake up when the VCPU cpu timer expires
David Hildenbrand [Mon, 22 Feb 2016 13:14:50 +0000 (14:14 +0100)]
KVM: s390: wake up when the VCPU cpu timer expires

When the VCPU cpu timer expires, we have to wake up just like when the ckc
triggers. For now, setting up a cpu timer in the guest and going into
enabled wait will never lead to a wakeup. This patch fixes this problem.
Just as for the ckc, we have to take care of waking up too early. We
have to recalculate the sleep time and go back to sleep.

Please note that the timer callback calls kvm_s390_get_cpu_timer() from
interrupt context. As the timer is canceled when leaving handle_wait(),
and we don't do any VCPU cpu timer writes/updates in that function, we can
be sure that we will never try to read the VCPU cpu timer from the same cpu
that is currentyl updating the timer (deadlock).

Reported-by: Sascha Silbe <silbe@linux.vnet.ibm.com>
Tested-by: Sascha Silbe <silbe@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: step the VCPU timer while in enabled wait
David Hildenbrand [Mon, 22 Feb 2016 12:52:27 +0000 (13:52 +0100)]
KVM: s390: step the VCPU timer while in enabled wait

The cpu timer is a mean to measure task execution time. We want
to account everything for a VCPU for which it is responsible. Therefore,
if the VCPU wants to sleep, it shall be accounted for it.

We can easily get this done by not disabling cpu timer accounting when
scheduled out while sleeping because of enabled wait.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: protect VCPU cpu timer with a seqcount
David Hildenbrand [Wed, 17 Feb 2016 20:53:33 +0000 (21:53 +0100)]
KVM: s390: protect VCPU cpu timer with a seqcount

For now, only the owning VCPU thread (that has loaded the VCPU) can get a
consistent cpu timer value when calculating the delta. However, other
threads might also be interested in a more recent, consistent value. Of
special interest will be the timer callback of a VCPU that executes without
having the VCPU loaded and could run in parallel with the VCPU thread.

The cpu timer has a nice property: it is only updated by the owning VCPU
thread. And speaking about accounting, a consistent value can only be
calculated by looking at cputm_start and the cpu timer itself in
one shot, otherwise the result might be wrong.

As we only have one writing thread at a time (owning VCPU thread), we can
use a seqcount instead of a seqlock and retry if the VCPU refreshed its
cpu timer. This avoids any heavy locking and only introduces a counter
update/check plus a handful of smp_wmb().

The owning VCPU thread should never have to retry on reads, and also for
other threads this might be a very rare scenario.

Please note that we have to use the raw_* variants for locking the seqcount
as lockdep will produce false warnings otherwise. The rq->lock held during
vcpu_load/put is also acquired from hardirq context. Lockdep cannot know
that we avoid potential deadlocks by disabling preemption and thereby
disable concurrent write locking attempts (via vcpu_put/load).

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: step VCPU cpu timer during kvm_run ioctl
David Hildenbrand [Mon, 15 Feb 2016 08:42:25 +0000 (09:42 +0100)]
KVM: s390: step VCPU cpu timer during kvm_run ioctl

Architecturally we should only provide steal time if we are scheduled
away, and not if the host interprets a guest exit. We have to step
the guest CPU timer in these cases.

In the first shot, we will step the VCPU timer only during the kvm_run
ioctl. Therefore all time spent e.g. in interception handlers or on irq
delivery will be accounted for that VCPU.

We have to take care of a few special cases:
- Other VCPUs can test for pending irqs. We can only report a consistent
  value for the VCPU thread itself when adding the delta.
- We have to take care of STP sync, therefore we have to extend
  kvm_clock_sync() and disable preemption accordingly
- During any call to disable/enable/start/stop we could get premeempted
  and therefore get start/stop calls. Therefore we have to make sure we
  don't get into an inconsistent state.

Whenever a VCPU is scheduled out, sleeping, in user space or just about
to enter the SIE, the guest cpu timer isn't stepped.

Please note that all primitives are prepared to be called from both
environments (cpu timer accounting enabled or not), although not completely
used in this patch yet (e.g. kvm_s390_set_cpu_timer() will never be called
while cpu timer accounting is enabled).

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: abstract access to the VCPU cpu timer
David Hildenbrand [Mon, 15 Feb 2016 08:40:12 +0000 (09:40 +0100)]
KVM: s390: abstract access to the VCPU cpu timer

We want to manually step the cpu timer in certain scenarios in the future.
Let's abstract any access to the cpu timer, so we can hide the complexity
internally.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: store cpu id in vcpu->cpu when scheduled in
David Hildenbrand [Fri, 12 Feb 2016 19:41:56 +0000 (20:41 +0100)]
KVM: s390: store cpu id in vcpu->cpu when scheduled in

By storing the cpu id, we have a way to verify if the current cpu is
owning a VCPU.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: s390: Add diag "watchdog functions" to trace event decoding
Alexander Yarygin [Fri, 19 Feb 2016 09:21:26 +0000 (12:21 +0300)]
KVM: s390: Add diag "watchdog functions" to trace event decoding

DIAG 0x288 may occur now. Let's add its code to the diag table in
sie.h.

Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
8 years agoKVM: MMU: micro-optimize gpte_access
Paolo Bonzini [Tue, 23 Feb 2016 13:19:20 +0000 (14:19 +0100)]
KVM: MMU: micro-optimize gpte_access

Avoid AND-NOT, most x86 processor lack an instruction for it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: simplify last_pte_bitmap
Paolo Bonzini [Tue, 23 Feb 2016 11:51:19 +0000 (12:51 +0100)]
KVM: MMU: simplify last_pte_bitmap

Branch-free code is fun and everybody knows how much Avi loves it,
but last_pte_bitmap takes it a bit to the extreme.  Since the code
is simply doing a range check, like

(level == 1 ||
 ((gpte & PT_PAGE_SIZE_MASK) && level < N)

we can make it branch-free without storing the entire truth table;
it is enough to cache N.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: coalesce more page zapping in mmu_sync_children
Paolo Bonzini [Thu, 25 Feb 2016 09:47:38 +0000 (10:47 +0100)]
KVM: MMU: coalesce more page zapping in mmu_sync_children

mmu_sync_children can only process up to 16 pages at a time.  Check
if we need to reschedule, and do not bother zapping the pages until
that happens.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: move zap/flush to kvm_mmu_get_page
Paolo Bonzini [Wed, 24 Feb 2016 10:26:10 +0000 (11:26 +0100)]
KVM: MMU: move zap/flush to kvm_mmu_get_page

kvm_mmu_get_page is the only caller of kvm_sync_page_transient
and kvm_sync_pages.  Moving the handling of the invalid_list there
removes the need for the underdocumented kvm_sync_page_transient
function.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: invert return value of mmu.sync_page and *kvm_sync_page*
Paolo Bonzini [Wed, 24 Feb 2016 10:07:14 +0000 (11:07 +0100)]
KVM: MMU: invert return value of mmu.sync_page and *kvm_sync_page*

Return true if the page was synced (and the TLB must be flushed)
and false if the page was zapped.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: cleanup __kvm_sync_page and its callers
Paolo Bonzini [Wed, 24 Feb 2016 09:28:01 +0000 (10:28 +0100)]
KVM: MMU: cleanup __kvm_sync_page and its callers

Calling kvm_unlink_unsync_page in the middle of __kvm_sync_page makes
things unnecessarily tricky.  If kvm_mmu_prepare_zap_page is called,
it will call kvm_unlink_unsync_page too.  So kvm_unlink_unsync_page can
be called just as well at the beginning or the end of __kvm_sync_page...
which means that we might do it in kvm_sync_page too and remove the
parameter.

kvm_sync_page ends up being the same code that kvm_sync_pages used
to have before the previous patch.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: use kvm_sync_page in kvm_sync_pages
Paolo Bonzini [Wed, 24 Feb 2016 09:19:30 +0000 (10:19 +0100)]
KVM: MMU: use kvm_sync_page in kvm_sync_pages

If the last argument is true, kvm_unlink_unsync_page is called anyway in
__kvm_sync_page (either by kvm_mmu_prepare_zap_page or by __kvm_sync_page
itself).  Therefore, kvm_sync_pages can just call kvm_sync_page, instead
of going through kvm_unlink_unsync_page+__kvm_sync_page.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: move TLB flush out of __kvm_sync_page
Paolo Bonzini [Wed, 24 Feb 2016 09:03:27 +0000 (10:03 +0100)]
KVM: MMU: move TLB flush out of __kvm_sync_page

By doing this, kvm_sync_pages can use __kvm_sync_page instead of
reinventing it.  Because of kvm_mmu_flush_or_zap, the code does not
end up being more complex than before, and more cleanups to kvm_sync_pages
will come in the next patches.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: introduce kvm_mmu_flush_or_zap
Paolo Bonzini [Wed, 24 Feb 2016 10:21:55 +0000 (11:21 +0100)]
KVM: MMU: introduce kvm_mmu_flush_or_zap

This is a generalization of mmu_pte_write_flush_tlb, that also
takes care of calling kvm_mmu_commit_zap_page.  The next
patches will introduce more uses.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: drop local copy of mul_u64_u32_div
Paolo Bonzini [Fri, 4 Mar 2016 08:28:41 +0000 (09:28 +0100)]
KVM: i8254: drop local copy of mul_u64_u32_div

A function that does the same as i8254.c's muldiv64 has been added
(for KVM's own use, in fact!) in include/linux/math64.h.  Use it
instead of muldiv64.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: check kvm_mmu_pages and mmu_page_path indices
Xiao Guangrong [Wed, 24 Feb 2016 08:46:06 +0000 (09:46 +0100)]
KVM: MMU: check kvm_mmu_pages and mmu_page_path indices

Give a special invalid index to the root of the walk, so that we
can check the consistency of kvm_mmu_pages and mmu_page_path.

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Extracted from a bigger patch proposed by Guangrong. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: Fix ubsan warnings
Paolo Bonzini [Tue, 23 Feb 2016 12:54:25 +0000 (13:54 +0100)]
KVM: MMU: Fix ubsan warnings

kvm_mmu_pages_init is doing some really yucky stuff.  It is setting
up a sentinel for mmu_page_clear_parents; however, because of a) the
way levels are numbered starting from 1 and b) the way mmu_page_path
sizes its arrays with PT64_ROOT_LEVEL-1 elements, the access can be
out of bounds.  This is harmless because the code overwrites up to the
first two elements of parents->idx and these are initialized, and
because the sentinel is not needed in this case---mmu_page_clear_parents
exits anyway when it gets to the end of the array.  However ubsan
complains, and everyone else should too.

This fix does three things.  First it makes the mmu_page_path arrays
PT64_ROOT_LEVEL elements in size, so that we can write to them without
checking the level in advance.  Second it disintegrates kvm_mmu_pages_init
between mmu_unsync_walk (to reset the struct kvm_mmu_pages) and
for_each_sp (to place the NULL sentinel at the end of the current path).
This is okay because the mmu_page_path is only used in
mmu_pages_clear_parents; mmu_pages_clear_parents itself is called within
a for_each_sp iterator, and hence always after a call to mmu_pages_next.
Third it changes mmu_pages_clear_parents to just use the sentinel to
stop iteration, without checking the bounds on level.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Mike Krinkin <krinkin.m.u@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: cleanup handle_abnormal_pfn
Paolo Bonzini [Tue, 23 Feb 2016 14:28:51 +0000 (15:28 +0100)]
KVM: MMU: cleanup handle_abnormal_pfn

The goto and temporary variable are unnecessary, just use return
statements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: VMX: use vmcs_clear/set_bits for debug register exits
Paolo Bonzini [Fri, 26 Feb 2016 11:09:49 +0000 (12:09 +0100)]
KVM: VMX: use vmcs_clear/set_bits for debug register exits

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: ensure __gfn_to_pfn_memslot initializes *writable
Paolo Bonzini [Tue, 23 Feb 2016 14:36:01 +0000 (15:36 +0100)]
KVM: ensure __gfn_to_pfn_memslot initializes *writable

For the kvm_is_error_hva, ubsan complains if the uninitialized writable
is passed to __direct_map, even though the value itself is not used
(__direct_map goes to mmu_set_spte->set_spte->set_mmio_spte but never
looks at that argument).

Ensuring that __gfn_to_pfn_memslot initializes *writable is cheap and
avoids this kind of issue.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: document KVM_REINJECT_CONTROL ioctl
Radim Krčmář [Wed, 2 Mar 2016 21:56:53 +0000 (22:56 +0100)]
KVM: document KVM_REINJECT_CONTROL ioctl

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: turn kvm_kpit_state.reinject into atomic_t
Radim Krčmář [Wed, 2 Mar 2016 21:56:52 +0000 (22:56 +0100)]
KVM: i8254: turn kvm_kpit_state.reinject into atomic_t

Document possible races between readers and concurrent update to the
ioctl.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: move PIT timer function initialization
Radim Krčmář [Wed, 2 Mar 2016 21:56:51 +0000 (22:56 +0100)]
KVM: i8254: move PIT timer function initialization

We can do it just once.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: don't assume layout of kvm_kpit_state
Radim Krčmář [Wed, 2 Mar 2016 21:56:50 +0000 (22:56 +0100)]
KVM: i8254: don't assume layout of kvm_kpit_state

channels has offset 0 and correct size now, but that can change.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: remove pointless dereference of PIT
Radim Krčmář [Wed, 2 Mar 2016 21:56:49 +0000 (22:56 +0100)]
KVM: i8254: remove pointless dereference of PIT

PIT is known at that point.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: remove pit and kvm from kvm_kpit_state
Radim Krčmář [Wed, 2 Mar 2016 21:56:48 +0000 (22:56 +0100)]
KVM: i8254: remove pit and kvm from kvm_kpit_state

kvm isn't ever used and pit can be accessed with container_of.
If you *really* need kvm, pit_state_to_pit(ps)->kvm.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: refactor kvm_free_pit
Radim Krčmář [Wed, 2 Mar 2016 21:56:47 +0000 (22:56 +0100)]
KVM: i8254: refactor kvm_free_pit

Could be easier to read, but git history will become deeper.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: refactor kvm_create_pit
Radim Krčmář [Wed, 2 Mar 2016 21:56:46 +0000 (22:56 +0100)]
KVM: i8254: refactor kvm_create_pit

Locks are gone, so we don't need to duplicate error paths.
Use goto everywhere.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: remove notifiers from PIT discard policy
Radim Krčmář [Wed, 2 Mar 2016 21:56:45 +0000 (22:56 +0100)]
KVM: i8254: remove notifiers from PIT discard policy

Discard policy doesn't rely on information from notifiers, so we don't
need to register notifiers unconditionally.  We kept correct counts in
case userspace switched between policies during runtime, but that can be
avoided by reseting the state.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: remove unnecessary uses of PIT state lock
Radim Krčmář [Wed, 2 Mar 2016 21:56:44 +0000 (22:56 +0100)]
KVM: i8254: remove unnecessary uses of PIT state lock

- kvm_create_pit had to lock only because it exposed kvm->arch.vpit very
  early, but initialization doesn't use kvm->arch.vpit since the last
  patch, so we can drop locking.
- kvm_free_pit is only run after there are no users of KVM and therefore
  is the sole actor.
- Locking in kvm_vm_ioctl_reinject doesn't do anything, because reinject
  is only protected at that place.
- kvm_pit_reset isn't used anywhere and its locking can be dropped if we
  hide it.

Removing useless locking allows to see what actually is being protected
by PIT state lock (values accessible from the guest).

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: pass struct kvm_pit instead of kvm in PIT
Radim Krčmář [Wed, 2 Mar 2016 21:56:43 +0000 (22:56 +0100)]
KVM: i8254: pass struct kvm_pit instead of kvm in PIT

This patch passes struct kvm_pit into internal PIT functions.
Those functions used to get PIT through kvm->arch.vpit, even though most
of them never used *kvm for other purposes.  Another benefit is that we
don't need to set kvm->arch.vpit during initialization.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: tone down WARN_ON pit.state_lock
Radim Krčmář [Wed, 2 Mar 2016 21:56:42 +0000 (22:56 +0100)]
KVM: i8254: tone down WARN_ON pit.state_lock

If the guest could hit this, it would hang the host kernel, bacause of
sheer number of those reports.  Internal callers have to be sensible
anyway, so we now only check for it in an API function.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: use atomic_t instead of pit.inject_lock
Radim Krčmář [Wed, 2 Mar 2016 21:56:41 +0000 (22:56 +0100)]
KVM: i8254: use atomic_t instead of pit.inject_lock

The lock was an overkill, the same can be done with atomics.

A mb() was added in kvm_pit_ack_irq, to pair with implicit barrier
between pit_timer_fn and pit_do_work.  The mb() prevents a race that
could happen if pending == 0 and irq_ack == 0:

  kvm_pit_ack_irq:                | pit_timer_fn:
   p = atomic_read(&ps->pending); |
                                  |  atomic_inc(&ps->pending);
                                  |  queue_work(pit_do_work);
                                  | pit_do_work:
                                  |  atomic_xchg(&ps->irq_ack, 0);
                                  |  return;
   atomic_set(&ps->irq_ack, 1);   |
   if (p == 0) return;            |

where the interrupt would not be delivered in this tick of pit_timer_fn.
PIT would have eventually delivered the interrupt, but we sacrifice
perofmance to make sure that interrupts are not needlessly delayed.

sfence isn't enough: atomic_dec_if_positive does atomic_read first and
x86 can reorder loads before stores.  lfence isn't enough: store can
pass lfence, turning it into a nop.  A compiler barrier would be more
than enough as CPU needs to stall for unbelievably long to use fences.

This patch doesn't do anything in kvm_pit_reset_reinject, because any
order of resets can race, but the result differs by at most one
interrupt, which is ok, because it's the same result as if the reset
happened at a slightly different time.  (Original code didn't protect
the reset path with a proper lock, so users have to be robust.)

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: add kvm_pit_reset_reinject
Radim Krčmář [Wed, 2 Mar 2016 21:56:40 +0000 (22:56 +0100)]
KVM: i8254: add kvm_pit_reset_reinject

pit_state.pending and pit_state.irq_ack are always reset at the same
time.  Create a function for them.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: simplify atomics in kvm_pit_ack_irq
Radim Krčmář [Wed, 2 Mar 2016 21:56:39 +0000 (22:56 +0100)]
KVM: i8254: simplify atomics in kvm_pit_ack_irq

We already have a helper that does the same thing.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: i8254: change PIT discard tick policy
Radim Krčmář [Wed, 2 Mar 2016 21:56:38 +0000 (22:56 +0100)]
KVM: i8254: change PIT discard tick policy

Discard policy uses ack_notifiers to prevent injection of PIT interrupts
before EOI from the last one.

This patch changes the policy to always try to deliver the interrupt,
which makes a difference when its vector is in ISR.
Old implementation would drop the interrupt, but proposed one injects to
IRR, like real hardware would.

The old policy breaks legacy NMI watchdogs, where PIT is used through
virtual wire (LVT0): PIT never sends an interrupt before receiving EOI,
thus a guest deadlock with disabled interrupts will stop NMIs.

Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt
through IOAPIC.  (KVM's PIT is deeply rotten and luckily not used much
in modern systems.)

Even though there is a chance of regressions, I think we can fix the
LVT0 NMI bug without introducing a new tick policy.

Cc: <stable@vger.kernel.org>
Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: apply page track notifier
Xiao Guangrong [Wed, 24 Feb 2016 09:51:16 +0000 (17:51 +0800)]
KVM: MMU: apply page track notifier

Register the notifier to receive write track event so that we can update
our shadow page table

It makes kvm_mmu_pte_write() be the callback of the notifier, no function
is changed

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: simplify mmu_need_write_protect
Xiao Guangrong [Wed, 24 Feb 2016 09:51:15 +0000 (17:51 +0800)]
KVM: MMU: simplify mmu_need_write_protect

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: use page track for non-leaf shadow pages
Xiao Guangrong [Wed, 24 Feb 2016 09:51:14 +0000 (17:51 +0800)]
KVM: MMU: use page track for non-leaf shadow pages

non-leaf shadow pages are always write protected, it can be the user
of page track

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: page track: add notifier support
Xiao Guangrong [Wed, 24 Feb 2016 09:51:13 +0000 (17:51 +0800)]
KVM: page track: add notifier support

Notifier list is introduced so that any node wants to receive the track
event can register to the list

Two APIs are introduced here:
- kvm_page_track_register_notifier(): register the notifier to receive
  track event

- kvm_page_track_unregister_notifier(): stop receiving track event by
  unregister the notifier

The callback, node->track_write() is called when a write access on the
write tracked page happens

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: clear write-flooding on the fast path of tracked page
Xiao Guangrong [Wed, 24 Feb 2016 09:51:12 +0000 (17:51 +0800)]
KVM: MMU: clear write-flooding on the fast path of tracked page

If the page fault is caused by write access on write tracked page, the
real shadow page walking is skipped, we lost the chance to clear write
flooding for the page structure current vcpu is using

Fix it by locklessly waking shadow page table to clear write flooding
on the shadow page structure out of mmu-lock. So that we change the
count to atomic_t

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: let page fault handler be aware tracked page
Xiao Guangrong [Wed, 24 Feb 2016 09:51:11 +0000 (17:51 +0800)]
KVM: MMU: let page fault handler be aware tracked page

The page fault caused by write access on the write tracked page can not
be fixed, it always need to be emulated. page_fault_handle_page_track()
is the fast path we introduce here to skip holding mmu-lock and shadow
page table walking

However, if the page table is not present, it is worth making the page
table entry present and readonly to make the read access happy

mmu_need_write_protect() need to be cooked to avoid page becoming writable
when making page table present or sync/prefetch shadow page table entries

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: page track: introduce kvm_slot_page_track_{add,remove}_page
Xiao Guangrong [Wed, 24 Feb 2016 09:51:10 +0000 (17:51 +0800)]
KVM: page track: introduce kvm_slot_page_track_{add,remove}_page

These two functions are the user APIs:
- kvm_slot_page_track_add_page(): add the page to the tracking pool
  after that later specified access on that page will be tracked

- kvm_slot_page_track_remove_page(): remove the page from the tracking
  pool, the specified access on the page is not tracked after the last
  user is gone

Both of these are called under the protection both of mmu-lock and
kvm->srcu or kvm->slots_lock

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: page track: add the framework of guest page tracking
Xiao Guangrong [Wed, 24 Feb 2016 09:51:09 +0000 (17:51 +0800)]
KVM: page track: add the framework of guest page tracking

The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero

We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers

Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array

Currently, only write track mode is supported

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: introduce kvm_mmu_slot_gfn_write_protect
Xiao Guangrong [Wed, 24 Feb 2016 09:51:08 +0000 (17:51 +0800)]
KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect

Split rmap_write_protect() and introduce the function to abstract the write
protection based on the slot

This function will be used in the later patch

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage
Xiao Guangrong [Wed, 24 Feb 2016 09:51:07 +0000 (17:51 +0800)]
KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage

Abstract the common operations from account_shadowed() and
unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage()
and kvm_mmu_gfn_allow_lpage()

These two functions will be used by page tracking in the later patch

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed
Xiao Guangrong [Wed, 24 Feb 2016 09:51:06 +0000 (17:51 +0800)]
KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed

kvm_lpage_info->write_count is used to detect if the large page mapping
for the gfn on the specified level is allowed, rename it to disallow_lpage
to reflect its purpose, also we rename has_wrprotected_page() to
mmu_gfn_lpage_is_disallowed() to make the code more clearer

Later we will extend this mechanism for page tracking: if the gfn is
tracked then large mapping for that gfn on any level is not allowed.
The new name is more straightforward

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agokvm: x86: Check dest_map->vector to match eoi signals for rtc
Joerg Roedel [Mon, 29 Feb 2016 15:04:45 +0000 (16:04 +0100)]
kvm: x86: Check dest_map->vector to match eoi signals for rtc

Using the vector stored at interrupt delivery makes the eoi
matching safe agains irq migration in the ioapic.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agokvm: x86: Track irq vectors in ioapic->rtc_status.dest_map
Joerg Roedel [Mon, 29 Feb 2016 15:04:44 +0000 (16:04 +0100)]
kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map

This allows backtracking later in case the rtc irq has been
moved to another vcpu/vector.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agokvm: x86: Convert ioapic->rtc_status.dest_map to a struct
Joerg Roedel [Mon, 29 Feb 2016 15:04:43 +0000 (16:04 +0100)]
kvm: x86: Convert ioapic->rtc_status.dest_map to a struct

Currently this is a bitmap which tracks which CPUs we expect
an EOI from. Move this bitmap to a struct so that we can
track additional information there.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoMerge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus...
Paolo Bonzini [Thu, 3 Mar 2016 13:35:44 +0000 (14:35 +0100)]
Merge branch 'kvm-ppc-next' of git://git./linux/kernel/git/paulus/powerpc into HEAD

The highlights are:

* Enable VFIO device on PowerPC, from David Gibson
* Optimizations to speed up IPIs between vcpus in HV KVM,
  from Suresh Warrier (who is also Suresh E. Warrier)
* In-kernel handling of IOMMU hypercalls, and support for dynamic DMA
  windows (DDW), from Alexey Kardashevskiy.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 years agoKVM: PPC: Add support for 64bit TCE windows
Alexey Kardashevskiy [Tue, 1 Mar 2016 06:54:40 +0000 (17:54 +1100)]
KVM: PPC: Add support for 64bit TCE windows

The existing KVM_CREATE_SPAPR_TCE only supports 32bit windows which is not
enough for directly mapped windows as the guest can get more than 4GB.

This adds KVM_CREATE_SPAPR_TCE_64 ioctl and advertises it
via KVM_CAP_SPAPR_TCE_64 capability. The table size is checked against
the locked memory limit.

Since 64bit windows are to support Dynamic DMA windows (DDW), let's add
@bus_offset and @page_shift which are also required by DDW.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
8 years agoKVM: PPC: Add @offset to kvmppc_spapr_tce_table
Alexey Kardashevskiy [Tue, 1 Mar 2016 06:54:39 +0000 (17:54 +1100)]
KVM: PPC: Add @offset to kvmppc_spapr_tce_table

This enables userspace view of TCE tables to start from non-zero offset
on a bus. This will be used for huge DMA windows.

This only changes the internal structure, the user interface needs to
change in order to use an offset.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
8 years agoKVM: PPC: Add @page_shift to kvmppc_spapr_tce_table
Alexey Kardashevskiy [Tue, 1 Mar 2016 06:54:38 +0000 (17:54 +1100)]
KVM: PPC: Add @page_shift to kvmppc_spapr_tce_table

At the moment the kvmppc_spapr_tce_table struct can only describe
4GB windows and handle fixed size (4K) pages. Dynamic DMA windows
support more so these limits need to be extended.

This replaces window_size (in bytes, 4GB max) with page_shift (32bit)
and size (64bit, in pages).

This should cause no behavioural change as this is changing
the internal structures only - the user interface still only
allows one to create a 32-bit table with 4KiB pages at this stage.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
8 years agoKVM: PPC: Reserve KVM_CAP_SPAPR_TCE_64 capability number
Alexey Kardashevskiy [Tue, 1 Mar 2016 06:54:37 +0000 (17:54 +1100)]
KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_64 capability number

This adds a capability number for 64-bit TCE tables support.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
8 years agoKVM: arm/arm64: timer: Add active state caching
Marc Zyngier [Fri, 29 Jan 2016 19:04:48 +0000 (19:04 +0000)]
KVM: arm/arm64: timer: Add active state caching

Programming the active state in the (re)distributor can be an
expensive operation so it makes some sense to try and reduce
the number of accesses as much as possible. So far, we
program the active state on each VM entry, but there is some
opportunity to do less.

An obvious solution is to cache the active state in memory,
and only program it in the HW when conditions change. But
because the HW can also change things under our feet (the active
state can transition from 1 to 0 when the guest does an EOI),
some precautions have to be taken, which amount to only caching
an "inactive" state, and always programing it otherwise.

With this in place, we observe a reduction of around 700 cycles
on a 2GHz GICv2 platform for a NULL hypercall.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoARM: KVM: Switch the CP reg search to be a binary search
Marc Zyngier [Thu, 21 Jan 2016 17:34:22 +0000 (17:34 +0000)]
ARM: KVM: Switch the CP reg search to be a binary search

Doing a linear search is a bit silly when we can do a binary search.
Not that we trap that so many things that it has become a burden yet,
but it makes sense to align it with the arm64 code.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoARM: KVM: Rename struct coproc_reg::is_64 to is_64bit
Marc Zyngier [Thu, 21 Jan 2016 17:04:52 +0000 (17:04 +0000)]
ARM: KVM: Rename struct coproc_reg::is_64 to is_64bit

As we're going to play some tricks on the struct coproc_reg,
make sure its 64bit indicator field matches that of coproc_params.

Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoARM: KVM: Enforce sorting of all CP tables
Marc Zyngier [Thu, 21 Jan 2016 15:34:35 +0000 (15:34 +0000)]
ARM: KVM: Enforce sorting of all CP tables

Since we're obviously terrible at sorting the CP tables, make sure
we're going to do it properly (or fail to boot). arm64 has had the
same mechanism for a while, and nobody ever broke it...

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoARM: KVM: Properly sort the invariant table
Marc Zyngier [Thu, 21 Jan 2016 15:37:03 +0000 (15:37 +0000)]
ARM: KVM: Properly sort the invariant table

Not having the invariant table properly sorted is an oddity, and
may get in the way of future optimisations. Let's fix it.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Switch the sys_reg search to be a binary search
Marc Zyngier [Thu, 21 Jan 2016 18:27:04 +0000 (18:27 +0000)]
arm64: KVM: Switch the sys_reg search to be a binary search

Our 64bit sys_reg table is about 90 entries long (so far, and the
PMU support is likely to increase this). This means that on average,
it takes 45 comparaisons to find the right entry (and actually the
full 90 if we have to search the invariant table).

Not the most efficient thing. Specially when you think that this
table is already sorted. Switching to a binary search effectively
reduces the search to about 7 comparaisons. Slightly better!

As an added bonus, the comparison is done by comparing all the
fields at once, instead of one at a time.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add a new vcpu device control group for PMUv3
Shannon Zhao [Mon, 11 Jan 2016 13:35:32 +0000 (21:35 +0800)]
arm64: KVM: Add a new vcpu device control group for PMUv3

To configure the virtual PMUv3 overflow interrupt number, we use the
vcpu kvm_device ioctl, encapsulating the KVM_ARM_VCPU_PMU_V3_IRQ
attribute within the KVM_ARM_VCPU_PMU_V3_CTRL group.

After configuring the PMUv3, call the vcpu ioctl with attribute
KVM_ARM_VCPU_PMU_V3_INIT to initialize the PMUv3.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Introduce per-vcpu kvm device controls
Shannon Zhao [Mon, 11 Jan 2016 12:56:17 +0000 (20:56 +0800)]
arm64: KVM: Introduce per-vcpu kvm device controls

In some cases it needs to get/set attributes specific to a vcpu and so
needs something else than ONE_REG.

Let's copy the KVM_DEVICE approach, and define the respective ioctls
for the vcpu file descriptor.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add a new feature bit for PMUv3
Shannon Zhao [Mon, 11 Jan 2016 14:46:15 +0000 (22:46 +0800)]
arm64: KVM: Add a new feature bit for PMUv3

To support guest PMUv3, use one bit of the VCPU INIT feature array.
Initialize the PMU when initialzing the vcpu with that bit and PMU
overflow interrupt set.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Free perf event of PMU when destroying vcpu
Shannon Zhao [Fri, 11 Sep 2015 07:18:05 +0000 (15:18 +0800)]
arm64: KVM: Free perf event of PMU when destroying vcpu

When KVM frees VCPU, it needs to free the perf_event of PMU.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Reset PMU state when resetting vcpu
Shannon Zhao [Fri, 11 Sep 2015 03:30:22 +0000 (11:30 +0800)]
arm64: KVM: Reset PMU state when resetting vcpu

When resetting vcpu, it needs to reset the PMU state to initial status.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add PMU overflow interrupt routing
Shannon Zhao [Fri, 26 Feb 2016 11:29:19 +0000 (19:29 +0800)]
arm64: KVM: Add PMU overflow interrupt routing

When calling perf_event_create_kernel_counter to create perf_event,
assign a overflow handler. Then when the perf event overflows, set the
corresponding bit of guest PMOVSSET register. If this counter is enabled
and its interrupt is enabled as well, kick the vcpu to sync the
interrupt.

On VM entry, if there is counter overflowed and interrupt level is
changed, inject the interrupt with corresponding level. On VM exit, sync
the interrupt level as well if it has been changed.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMUSERENR register
Shannon Zhao [Tue, 8 Sep 2015 07:15:56 +0000 (15:15 +0800)]
arm64: KVM: Add access handler for PMUSERENR register

This register resets as unknown in 64bit mode while it resets as zero
in 32bit mode. Here we choose to reset it as zero for consistency.

PMUSERENR_EL0 holds some bits which decide whether PMU registers can be
accessed from EL0. Add some check helpers to handle the access from EL0.

When these bits are zero, only reading PMUSERENR will trap to EL2 and
writing PMUSERENR or reading/writing other PMU registers will trap to
EL1 other than EL2 when HCR.TGE==0. To current KVM configuration
(HCR.TGE==0) there is no way to get these traps. Here we write 0xf to
physical PMUSERENR register on VM entry, so that it will trap PMU access
from EL0 to EL2. Within the register access handler we check the real
value of guest PMUSERENR register to decide whether this access is
allowed. If not allowed, return false to inject UND to guest.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add helper to handle PMCR register bits
Shannon Zhao [Wed, 28 Oct 2015 04:10:30 +0000 (12:10 +0800)]
arm64: KVM: Add helper to handle PMCR register bits

According to ARMv8 spec, when writing 1 to PMCR.E, all counters are
enabled by PMCNTENSET, while writing 0 to PMCR.E, all counters are
disabled. When writing 1 to PMCR.P, reset all event counters, not
including PMCCNTR, to zero. When writing 1 to PMCR.C, reset PMCCNTR to
zero.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMSWINC register
Shannon Zhao [Tue, 8 Sep 2015 07:49:39 +0000 (15:49 +0800)]
arm64: KVM: Add access handler for PMSWINC register

Add access handler which emulates writing and reading PMSWINC
register and add support for creating software increment event.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMOVSSET and PMOVSCLR register
Shannon Zhao [Tue, 8 Sep 2015 07:03:26 +0000 (15:03 +0800)]
arm64: KVM: Add access handler for PMOVSSET and PMOVSCLR register

Since the reset value of PMOVSSET and PMOVSCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a handler to emulate writing
PMOVSSET or PMOVSCLR register.

When writing non-zero value to PMOVSSET, the counter and its interrupt
is enabled, kick this vcpu to sync PMU interrupt.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMINTENSET and PMINTENCLR register
Shannon Zhao [Tue, 8 Sep 2015 06:40:20 +0000 (14:40 +0800)]
arm64: KVM: Add access handler for PMINTENSET and PMINTENCLR register

Since the reset value of PMINTENSET and PMINTENCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a handler to emulate writing
PMINTENSET or PMINTENCLR register.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for event type register
Shannon Zhao [Tue, 23 Feb 2016 03:11:27 +0000 (11:11 +0800)]
arm64: KVM: Add access handler for event type register

These kind of registers include PMEVTYPERn, PMCCFILTR and PMXEVTYPER
which is mapped to PMEVTYPERn or PMCCFILTR.

The access handler translates all aarch32 register offsets to aarch64
ones and uses vcpu_sys_reg() to access their values to avoid taking care
of big endian.

When writing to these registers, create a perf_event for the selected
event type.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: PMU: Add perf event map and introduce perf event creating function
Shannon Zhao [Fri, 3 Jul 2015 06:27:25 +0000 (14:27 +0800)]
arm64: KVM: PMU: Add perf event map and introduce perf event creating function

When we use tools like perf on host, perf passes the event type and the
id of this event type category to kernel, then kernel will map them to
hardware event number and write this number to PMU PMEVTYPER<n>_EL0
register. When getting the event number in KVM, directly use raw event
type to create a perf_event for it.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMCNTENSET and PMCNTENCLR register
Shannon Zhao [Tue, 8 Sep 2015 04:26:13 +0000 (12:26 +0800)]
arm64: KVM: Add access handler for PMCNTENSET and PMCNTENCLR register

Since the reset value of PMCNTENSET and PMCNTENCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a handler to emulate writing
PMCNTENSET or PMCNTENCLR register.

When writing to PMCNTENSET, call perf_event_enable to enable the perf
event. When writing to PMCNTENCLR, call perf_event_disable to disable
the perf event.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for event counter register
Shannon Zhao [Tue, 8 Dec 2015 07:29:06 +0000 (15:29 +0800)]
arm64: KVM: Add access handler for event counter register

These kind of registers include PMEVCNTRn, PMCCNTR and PMXEVCNTR which
is mapped to PMEVCNTRn.

The access handler translates all aarch32 register offsets to aarch64
ones and uses vcpu_sys_reg() to access their values to avoid taking care
of big endian.

When reading these registers, return the sum of register value and the
value perf event counts.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMCEID0 and PMCEID1 register
Shannon Zhao [Mon, 7 Sep 2015 08:11:12 +0000 (16:11 +0800)]
arm64: KVM: Add access handler for PMCEID0 and PMCEID1 register

Add access handler which gets host value of PMCEID0 or PMCEID1 when
guest access these registers. Writing action to PMCEID0 or PMCEID1 is
UNDEFINED.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMSELR register
Shannon Zhao [Mon, 31 Aug 2015 09:20:22 +0000 (17:20 +0800)]
arm64: KVM: Add access handler for PMSELR register

Since the reset value of PMSELR_EL0 is UNKNOWN, use reset_unknown for
its reset handler. When reading PMSELR, return the PMSELR.SEL field to
guest.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Add access handler for PMCR register
Shannon Zhao [Thu, 18 Jun 2015 08:01:53 +0000 (16:01 +0800)]
arm64: KVM: Add access handler for PMCR register

Add reset handler which gets host value of PMCR_EL0 and make writable
bits architecturally UNKNOWN except PMCR.E which is zero. Add an access
handler for PMCR.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
8 years agoarm64: KVM: Define PMU data structure for each vcpu
Shannon Zhao [Fri, 11 Sep 2015 01:38:32 +0000 (09:38 +0800)]
arm64: KVM: Define PMU data structure for each vcpu

Here we plan to support virtual PMU for guest by full software
emulation, so define some basic structs and functions preparing for
futher steps. Define struct kvm_pmc for performance monitor counter and
struct kvm_pmu for performance monitor unit for each vcpu. According to
ARMv8 spec, the PMU contains at most 32(ARMV8_PMU_MAX_COUNTERS)
counters.

Since this only supports ARM64 (or PMUv3), add a separate config symbol
for it.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>