pandora-kernel.git
16 years agoKVM: Portability: Move x86 emulation and mmio device hook to x86.c
Carsten Otte [Tue, 30 Oct 2007 17:44:21 +0000 (18:44 +0100)]
KVM: Portability: Move x86 emulation and mmio device hook to x86.c

This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction

The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: Move kvm_get/set_msr[_common] to x86.c
Carsten Otte [Tue, 30 Oct 2007 17:44:17 +0000 (18:44 +0100)]
KVM: Portability: Move kvm_get/set_msr[_common] to x86.c

This patch moves the implementation of the functions of kvm_get/set_msr,
kvm_get/set_msr_common, and set_efer from kvm_main.c to x86.c. The
definition of EFER_RESERVED_BITS is moved too.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Fix gfn_to_page() acquiring mmap_sem twice
Anthony Liguori [Mon, 29 Oct 2007 20:15:20 +0000 (15:15 -0500)]
KVM: Fix gfn_to_page() acquiring mmap_sem twice

KVM's nopage handler calls gfn_to_page() which acquires the mmap_sem when
calling out to get_user_pages().  nopage handlers are already invoked with the
mmap_sem held though.  Introduce a __gfn_to_page() for use by the nopage
handler which requires the lock to already be held.

This was noticed by tglx.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Enable memory mapped TPR shadow (FlexPriority)
Sheng Yang [Mon, 29 Oct 2007 01:40:42 +0000 (09:40 +0800)]
KVM: VMX: Enable memory mapped TPR shadow (FlexPriority)

This patch based on CR8/TPR patch, and enable the TPR shadow (FlexPriority)
for 32bit Windows.  Since TPR is accessed very frequently by 32bit
Windows, especially SMP guest, with FlexPriority enabled, we saw significant
performance gain.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: Move control register helper functions to x86.c
Carsten Otte [Mon, 29 Oct 2007 15:09:35 +0000 (16:09 +0100)]
KVM: Portability: Move control register helper functions to x86.c

This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: move get/set_apic_base to x86.c
Carsten Otte [Mon, 29 Oct 2007 15:09:10 +0000 (16:09 +0100)]
KVM: Portability: move get/set_apic_base to x86.c

This patch moves the implementation of get_apic_base and set_apic_base
from kvm_main.c to x86.c

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: Move memory segmentation to x86.c
Carsten Otte [Mon, 29 Oct 2007 15:08:51 +0000 (16:08 +0100)]
KVM: Portability: Move memory segmentation to x86.c

This patch moves the definition of segment_descriptor_64 for AMD64 and
EM64T from kvm_main.c to segment_descriptor.h. It also adds a proper
#ifndef...#define...#endif around that header file.
The implementation of segment_base is moved from kvm_main.c to x86.c.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: Split kvm_vm_ioctl v3
Carsten Otte [Mon, 29 Oct 2007 15:08:35 +0000 (16:08 +0100)]
KVM: Portability: Split kvm_vm_ioctl v3

This patch splits kvm_vm_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
The patch is unchanged since last submission.

Common ioctls for all architectures are:
KVM_CREATE_VCPU, KVM_GET_DIRTY_LOG, KVM_SET_USER_MEMORY_REGION

x86 specific ioctls are:
KVM_SET_MEMORY_REGION,
KVM_GET/SET_NR_MMU_PAGES, KVM_SET_MEMORY_ALIAS, KVM_CREATE_IRQCHIP,
KVM_CREATE_IRQ_LINE, KVM_GET/SET_IRQCHIP
KVM_SET_TSS_ADDR

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Topup the mmu memory preallocation caches before emulating an insn
Avi Kivity [Sun, 28 Oct 2007 16:52:05 +0000 (18:52 +0200)]
KVM: MMU: Topup the mmu memory preallocation caches before emulating an insn

Emulation may cause a shadow pte to be instantiated, which requires
memory resources.  Make sure the caches are filled to avoid an oops.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Move page fault processing to common code
Avi Kivity [Sun, 28 Oct 2007 16:48:59 +0000 (18:48 +0200)]
KVM: Move page fault processing to common code

The code that dispatches the page fault and emulates if we failed to map
is duplicated across vmx and svm.  Merge it to simplify further bugfixing.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: don't depend on cr2 for mov abs emulation
Avi Kivity [Sun, 28 Oct 2007 14:34:25 +0000 (16:34 +0200)]
KVM: x86 emulator: don't depend on cr2 for mov abs emulation

The 'mov abs' instruction family (opcodes 0xa0 - 0xa3) still depends on cr2
provided by the page fault handler.  This is wrong for several reasons:

- if an instruction accessed misaligned data that crosses a page boundary,
  and if the fault happened on the second page, cr2 will point at the
  second page, not the data itself.

- if we're emulating in real mode, or due to a FlexPriority exit, there
  is no cr2 generated.

So, this change adds decoding for this instruction form and drops reliance
on cr2.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: Let gcc to choose which registers to save (i386)
Laurent Vivier [Thu, 25 Oct 2007 12:18:54 +0000 (14:18 +0200)]
KVM: SVM: Let gcc to choose which registers to save (i386)

This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of AMD i386

* Original code saves following registers:

    ebx, ecx, edx, esi, edi, ebp

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    ebx, ecx, edx, esi, edi

  - rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
    description.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: Let gcc to choose which registers to save (x86_64)
Laurent Vivier [Thu, 25 Oct 2007 12:18:53 +0000 (14:18 +0200)]
KVM: SVM: Let gcc to choose which registers to save (x86_64)

This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of AMD x86_64.

* Original code saves following registers:

    rbx, rcx, rdx, rsi, rdi, rbp,
    r8, r9, r10, r11, r12, r13, r14, r15

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    rbx, rcx, rdx, rsi, rdi
    r8, r9, r10, r11, r12, r13, r14, r15

  - rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
    description.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Let gcc to choose which registers to save (i386)
Laurent Vivier [Thu, 25 Oct 2007 12:18:55 +0000 (14:18 +0200)]
KVM: VMX: Let gcc to choose which registers to save (i386)

This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of intel i386.

* Original code saves following registers:

    eax, ebx, ecx, edx, edi, esi, ebp (using popa)

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    ebx, edi, rsi

  - doesn't save eax because it is an output operand (vmx->fail)

  - cannot put ecx in clobber description because it is an input operand,
    but as we modify it and we want to keep its value (vcpu), we must
    save it (pop/push)

  - ebp is saved (pop/push) because GCC seems to ignore its use the clobber
    description.

  - edx is saved (pop/push) because it is reserved by GCC (REGPARM) and
    cannot be put in the clobber description.

  - line "mov (%%esp), %3 \n\t" has been removed because %3
    is ecx and ecx is restored just after.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Let gcc to choose which registers to save (x86_64)
Laurent Vivier [Thu, 25 Oct 2007 12:18:52 +0000 (14:18 +0200)]
KVM: VMX: Let gcc to choose which registers to save (x86_64)

This patch lets GCC to determine which registers to save when we
switch to/from a VCPU in the case of intel x86_64.

* Original code saves following registers:

    rax, rbx, rcx, rdx, rsi, rdi, rbp,
    r8, r9, r10, r11, r12, r13, r14, r15

* Patched code:

  - informs GCC that we modify following registers
    using the clobber description:

    rbx, rdi, rsi,
    r8, r9, r10, r11, r12, r13, r14, r15

  - doesn't save rax because it is an output operand (vmx->fail)

  - cannot put rcx in clobber description because it is an input operand,
    but as we modify it and we want to keep its value (vcpu), we must
    save it (pop/push)

  - rbp is saved (pop/push) because GCC seems to ignore its use in the clobber
    description.

  - rdx is saved (pop/push) because it is reserved by GCC (REGPARM) and
    cannot be put in the clobber description.

  - line "mov (%%rsp), %3 \n\t" has been removed because %3
    is rcx and rcx is restored just after.

  - line ASM_VMX_VMWRITE_RSP_RDX() is moved out of the ifdef/else/endif

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add ioctl to tss address from userspace,
Izik Eidus [Wed, 24 Oct 2007 22:29:55 +0000 (00:29 +0200)]
KVM: Add ioctl to tss address from userspace,

Currently kvm has a wart in that it requires three extra pages for use
as a tss when emulating real mode on Intel.  This patch moves the allocation
internally, only requiring userspace to tell us where in the physical address
space we can place the tss.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add kernel-internal memory slots
Izik Eidus [Wed, 24 Oct 2007 21:57:46 +0000 (23:57 +0200)]
KVM: Add kernel-internal memory slots

Reserve a few memory slots for kernel internal use.  This is good for case
you have to register memory region and you want to be sure it was not
registered from userspace, and for case you want to register a memory region
that won't be seen from userspace.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Export memory slot allocation mechanism
Izik Eidus [Wed, 24 Oct 2007 21:52:57 +0000 (23:52 +0200)]
KVM: Export memory slot allocation mechanism

Remove kvm memory slot allocation mechanism from the ioctl
and put it to exported function.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Unmap kernel-allocated memory on slot destruction
Izik Eidus [Thu, 25 Oct 2007 09:54:04 +0000 (11:54 +0200)]
KVM: Unmap kernel-allocated memory on slot destruction

kvm_vm_ioctl_set_memory_region() is able to remove memory in addition to
adding it.  Therefore when using kernel swapping support for old userspaces,
we need to munmap the memory if the user request to remove it

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Per-architecture hypercall definitions
Christian Borntraeger [Thu, 11 Oct 2007 13:34:17 +0000 (15:34 +0200)]
KVM: Per-architecture hypercall definitions

Currently kvm provides hypercalls only for x86* architectures. To
provide hypercall infrastructure for other kvm architectures I split
kvm_para.h into a generic header file and architecture specific
definitions.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Split IOAPIC reset function and export for kernel RESET
Eddie Dong [Wed, 10 Oct 2007 10:15:54 +0000 (12:15 +0200)]
KVM: Split IOAPIC reset function and export for kernel RESET

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Export PIC reset for kernel device reset
Eddie Dong [Wed, 10 Oct 2007 10:14:25 +0000 (12:14 +0200)]
KVM: Export PIC reset for kernel device reset

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add a might_sleep() annotation to gfn_to_page()
Avi Kivity [Sun, 21 Oct 2007 09:03:36 +0000 (11:03 +0200)]
KVM: Add a might_sleep() annotation to gfn_to_page()

This will help trap accesses to guest memory in atomic context.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Move vmx_vcpu_reset() out of vmx_vcpu_setup()
Avi Kivity [Sun, 21 Oct 2007 09:00:39 +0000 (11:00 +0200)]
KVM: Move vmx_vcpu_reset() out of vmx_vcpu_setup()

Split guest reset code out of vmx_vcpu_setup().  Besides being cleaner, this
moves the realmode tss setup (which can sleep) outside vmx_vcpu_setup()
(which is executed with preemption enabled).

[izik: remove unused variable]

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: Split kvm_vcpu into arch dependent and independent parts (part 1)
Zhang Xiantao [Sat, 20 Oct 2007 07:34:38 +0000 (15:34 +0800)]
KVM: Portability: Split kvm_vcpu into arch dependent and independent parts (part 1)

First step to split kvm_vcpu.  Currently, we just use an macro to define
the common fields in kvm_vcpu for all archs, and all archs need to define
its own kvm_vcpu struct.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Allocate userspace memory for older userspace
Anthony Liguori [Thu, 18 Oct 2007 14:59:34 +0000 (09:59 -0500)]
KVM: Allocate userspace memory for older userspace

Allocate a userspace buffer for older userspaces.  Also eliminate phys_mem
buffer.  The memset() in kvmctl really kills initial memory usage but swapping
works even with old userspaces.

A side effect is that maximum guest side is reduced for older userspace on
i386.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Use virtual cpu accounting if available for guest times.
Christian Borntraeger [Thu, 18 Oct 2007 12:39:10 +0000 (14:39 +0200)]
KVM: Use virtual cpu accounting if available for guest times.

ppc and s390 offer the possibility to track process times precisely
by looking at cpu timer on every context switch, irq, softirq etc.
We can use that infrastructure as well for guest time accounting.
We need to account the used time before we change the state.
This patch adds a call to account_system_vtime to kvm_guest_enter
and kvm_guest exit. If CONFIG_VIRT_CPU_ACCOUNTING is not set,
account_system_vtime is defined in hardirq.h as an empty function,
which means this patch does not change the behaviour on other
platforms.

I compile tested this patch on x86 and function tested the patch on
s390.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Partial swapping of guest memory
Izik Eidus [Thu, 18 Oct 2007 09:09:33 +0000 (11:09 +0200)]
KVM: MMU: Partial swapping of guest memory

This allows guest memory to be swapped.  Pages which are currently mapped
via shadow page tables are pinned into memory, but all other pages can
be freely swapped.

The patch makes gfn_to_page() elevate the page's reference count, and
introduces kvm_release_page() that pairs with it.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Make gfn_to_page() always safe
Izik Eidus [Wed, 17 Oct 2007 17:17:48 +0000 (19:17 +0200)]
KVM: MMU: Make gfn_to_page() always safe

In case the page is not present in the guest memory map, return a dummy
page the guest can scribble on.

This simplifies error checking in its users.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Keep a reverse mapping of non-writable translations
Izik Eidus [Tue, 16 Oct 2007 12:43:46 +0000 (14:43 +0200)]
KVM: MMU: Keep a reverse mapping of non-writable translations

The current kvm mmu only reverse maps writable translation.  This is used
to write-protect a page in case it becomes a pagetable.

But with swapping support, we need a reverse mapping of read-only pages as
well:  when we evict a page, we need to remove any mapping to it, whether
writable or not.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Add rmap_next(), a helper for walking kvm rmaps
Izik Eidus [Tue, 16 Oct 2007 12:42:30 +0000 (14:42 +0200)]
KVM: MMU: Add rmap_next(), a helper for walking kvm rmaps

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: cmc, clc, cli, sti
Nitin A Kamble [Wed, 17 Oct 2007 01:23:27 +0000 (18:23 -0700)]
KVM: x86 emulator: cmc, clc, cli, sti

Instruction: cmc, clc, cli, sti
opcodes: 0xf5, 0xf8, 0xfa, 0xfb respectively.

[avi: fix reference to EFLG_IF which is not defined anywhere]

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Simplify page table walker
Avi Kivity [Wed, 17 Oct 2007 10:18:47 +0000 (12:18 +0200)]
KVM: MMU: Simplify page table walker

Simplify the walker level loop not to carry so much information from one
loop to the next.  In addition to being complex, this made kmap_atomic()
critical sections difficult to manage.

As a result of this change, kmap_atomic() sections are limited to actually
touching the guest pte, which allows the other functions called from the
walker to do sleepy operations.  This will happen when we enable swapping.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Implement emulation of instruction: inc & dec
Nitin A Kamble [Sat, 13 Oct 2007 00:40:33 +0000 (17:40 -0700)]
KVM: x86 emulator: Implement emulation of instruction: inc & dec

Instructions:
inc r16/r32 (opcode 0x40-0x47)
dec r16/r32 (opcode 0x48-0x4f)

Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Rename KVM_TLB_FLUSH to KVM_REQ_TLB_FLUSH
Avi Kivity [Tue, 16 Oct 2007 15:22:08 +0000 (17:22 +0200)]
KVM: Rename KVM_TLB_FLUSH to KVM_REQ_TLB_FLUSH

We now have a new namespace, KVM_REQ_*, for bits in vcpu->requests.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Move apic timer interrupt backlog processing to common code
Avi Kivity [Tue, 16 Oct 2007 14:23:22 +0000 (16:23 +0200)]
KVM: Move apic timer interrupt backlog processing to common code

Beside the obvious goodness of making code more common, this prevents
a livelock with the next patch which moves interrupt injection out of the
critical section.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add some \n in ioapic_debug()
Laurent Vivier [Fri, 12 Oct 2007 09:01:59 +0000 (11:01 +0200)]
KVM: Add some \n in ioapic_debug()

Add new-line at end of debug strings.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: apic round robin cleanup
Qing He [Mon, 24 Sep 2007 09:39:41 +0000 (17:39 +0800)]
KVM: apic round robin cleanup

If no apic is enabled in the bitmap of an interrupt delivery with delivery
mode of lowest priority, a warning should be reported rather than select
a fallback vcpu

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie (Yaozu) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Portability: split kvm_vcpu_ioctl
Carsten Otte [Thu, 11 Oct 2007 17:16:52 +0000 (19:16 +0200)]
KVM: Portability: split kvm_vcpu_ioctl

This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.

Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.

x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS

An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: When updating the dirty bit, inform the mmu about it
Avi Kivity [Thu, 11 Oct 2007 13:30:21 +0000 (15:30 +0200)]
KVM: MMU: When updating the dirty bit, inform the mmu about it

Since the mmu uses different shadow pages for dirty large pages and clean
large pages, this allows the mmu to drop ptes that are now invalid.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Move dirty bit updates to a separate function
Avi Kivity [Thu, 11 Oct 2007 13:22:59 +0000 (15:22 +0200)]
KVM: MMU: Move dirty bit updates to a separate function

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Instantiate real-mode shadows as user writable shadows
Avi Kivity [Thu, 11 Oct 2007 13:13:49 +0000 (15:13 +0200)]
KVM: MMU: Instantiate real-mode shadows as user writable shadows

This is consistent with real-mode permissions.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Disable write access on clean large pages
Avi Kivity [Thu, 11 Oct 2007 13:12:24 +0000 (15:12 +0200)]
KVM: MMU: Disable write access on clean large pages

By forcing clean huge pages to be read-only, we have separate roles
for the shadow of a clean large page and the shadow of a dirty large
page.  This is necessary because different ptes will be instantiated
for the two cases, even for read faults.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Fix nx access bit for huge pages
Avi Kivity [Thu, 11 Oct 2007 13:08:41 +0000 (15:08 +0200)]
KVM: MMU: Fix nx access bit for huge pages

We must set the bit before the shift, otherwise the wrong bit gets set.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Move guest pte dirty bit management to the guest pagetable walker
Avi Kivity [Thu, 11 Oct 2007 10:32:30 +0000 (12:32 +0200)]
KVM: Move guest pte dirty bit management to the guest pagetable walker

This is more consistent with the accessed bit management, and makes the dirty
bit available earlier for other purposes.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: More struct kvm_vcpu -> struct kvm cleanups
Anthony Liguori [Thu, 11 Oct 2007 01:08:41 +0000 (20:08 -0500)]
KVM: MMU: More struct kvm_vcpu -> struct kvm cleanups

This time, the biggest change is gpa_to_hpa. The translation of GPA to HPA does
not depend on the VCPU state unlike GVA to GPA so there's no need to pass in
the kvm_vcpu.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Clean up MMU functions to take struct kvm when appropriate
Anthony Liguori [Thu, 11 Oct 2007 00:25:50 +0000 (19:25 -0500)]
KVM: MMU: Clean up MMU functions to take struct kvm when appropriate

Some of the MMU functions take a struct kvm_vcpu even though they affect all
VCPUs.  This patch cleans up some of them to instead take a struct kvm.  This
makes things a bit more clear.

The main thing that was confusing me was whether certain functions need to be
called on all VCPUs.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Move x86 msr handling to new files x86.[ch]
Carsten Otte [Wed, 10 Oct 2007 15:16:19 +0000 (17:16 +0200)]
KVM: Move x86 msr handling to new files x86.[ch]

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Support assigning userspace memory to the guest
Izik Eidus [Tue, 9 Oct 2007 17:20:39 +0000 (19:20 +0200)]
KVM: Support assigning userspace memory to the guest

Instead of having the kernel allocate memory to the guest, let userspace
allocate it and pass the address to the kernel.

This is required for s390 support, but also enables features like memory
sharing and using hugetlbfs backed memory.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: CodingStyle cleanup
Mike Day [Mon, 8 Oct 2007 13:02:08 +0000 (09:02 -0400)]
KVM: CodingStyle cleanup

Signed-off-by: Mike D. Day <ncmike@ncultra.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Remove gratuitous casts from lapic.c
Rusty Russell [Mon, 8 Oct 2007 00:55:29 +0000 (10:55 +1000)]
KVM: Remove gratuitous casts from lapic.c

Since vcpu->apic is of the correct type, there's not need to cast.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Hoist kvm_create_lapic() into kvm_vcpu_init()
Rusty Russell [Mon, 8 Oct 2007 00:50:48 +0000 (10:50 +1000)]
KVM: Hoist kvm_create_lapic() into kvm_vcpu_init()

Move kvm_create_lapic() into kvm_vcpu_init(), rather than having svm
and vmx do it.  And make it return the error rather than a fairly
random -ENOMEM.

This also solves the problem that neither svm.c nor vmx.c actually
handles the error path properly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add kvm_free_lapic() to pair with kvm_create_lapic()
Rusty Russell [Mon, 8 Oct 2007 00:48:30 +0000 (10:48 +1000)]
KVM: Add kvm_free_lapic() to pair with kvm_create_lapic()

Instead of the asymetry of kvm_free_apic, implement kvm_free_lapic().
And guess what?  I found a minor bug: we don't need to hrtimer_cancel()
from kvm_main.c, because we do that in kvm_free_apic().

Also:
1) kvm_vcpu_uninit should be the reverse order from kvm_vcpu_init.
2) Don't set apic->regs_page to zero before freeing apic.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Allow dynamic allocation of the mmu shadow cache size
Izik Eidus [Tue, 2 Oct 2007 16:52:55 +0000 (18:52 +0200)]
KVM: Allow dynamic allocation of the mmu shadow cache size

The user is now able to set how many mmu pages will be allocated to the guest.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add general accessors to read and write guest memory
Izik Eidus [Mon, 1 Oct 2007 20:14:18 +0000 (22:14 +0200)]
KVM: Add general accessors to read and write guest memory

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Remove the usage of page->private field by rmap
Izik Eidus [Thu, 27 Sep 2007 12:11:22 +0000 (14:11 +0200)]
KVM: Remove the usage of page->private field by rmap

When kvm uses user-allocated pages in the future for the guest, we won't
be able to use page->private for rmap, since page->rmap is reserved for
the filesystem.  So we move the rmap base pointers to the memory slot.

A side effect of this is that we need to store the gfn of each gpte in
the shadow pages, since the memory slot is addressed by gfn, instead of
hfn like struct page.

Signed-off-by: Izik Eidus <izik@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Simplify vcpu_clear()
Avi Kivity [Sun, 30 Sep 2007 09:02:53 +0000 (11:02 +0200)]
KVM: VMX: Simplify vcpu_clear()

Now that smp_call_function_single() knows how to call a function on the
current cpu, there's no need to check explicitly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Don't clear the vmcs if the vcpu is not loaded on any processor
Avi Kivity [Sun, 30 Sep 2007 08:50:12 +0000 (10:50 +0200)]
KVM: VMX: Don't clear the vmcs if the vcpu is not loaded on any processor

Noted by Eddie Dong.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Any legacy prefix after a REX prefix nullifies its effect
Laurent Vivier [Tue, 25 Sep 2007 11:36:40 +0000 (13:36 +0200)]
KVM: x86 emulator: Any legacy prefix after a REX prefix nullifies its effect

This patch modifies the management of REX prefix according behavior
I saw in Xen 3.1.  In Xen, this modification has been introduced by
Jan Beulich.

http://lists.xensource.com/archives/html/xen-changelog/2007-01/msg00081.html

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Purify x86_decode_insn() error case management
Laurent Vivier [Mon, 24 Sep 2007 15:00:58 +0000 (17:00 +0200)]
KVM: Purify x86_decode_insn() error case management

The only valid case is on protected page access, other cases are errors.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86_emulator: no writeback for bt
Qing He [Mon, 24 Sep 2007 09:22:13 +0000 (17:22 +0800)]
KVM: x86_emulator: no writeback for bt

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Remove no_wb, use dst.type = OP_NONE instead
Laurent Vivier [Mon, 24 Sep 2007 09:10:56 +0000 (11:10 +0200)]
KVM: x86 emulator: Remove no_wb, use dst.type = OP_NONE instead

Remove no_wb, use dst.type = OP_NONE instead, idea stollen from xen-3.1

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: remove _eflags and use directly ctxt->eflags.
Laurent Vivier [Mon, 24 Sep 2007 09:10:55 +0000 (11:10 +0200)]
KVM: x86 emulator: remove _eflags and use directly ctxt->eflags.

Remove _eflags and use directly ctxt->eflags. Caching eflags is not needed as
it is restored to vcpu by kvm_main.c:emulate_instruction() from ctxt->eflags
only if emulation doesn't fail.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: split some decoding into functions for readability
Laurent Vivier [Mon, 24 Sep 2007 09:10:54 +0000 (11:10 +0200)]
KVM: x86 emulator: split some decoding into functions for readability

To improve readability, move push, writeback, and grp 1a/2/3/4/5/9 emulation
parts into functions.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Ignore reserved bits in cr3 in non-pae mode
Ryan Harper [Tue, 18 Sep 2007 19:05:16 +0000 (14:05 -0500)]
KVM: MMU: Ignore reserved bits in cr3 in non-pae mode

This patch removes the fault injected when the guest attempts to set reserved
bits in cr3.  X86 hardware doesn't generate a fault when setting reserved bits.
The result of this patch is that vmware-server, running within a kvm guest,
boots and runs memtest from an iso.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Make flooding detection work when guest page faults are bypassed
Avi Kivity [Sun, 23 Sep 2007 12:10:49 +0000 (14:10 +0200)]
KVM: MMU: Make flooding detection work when guest page faults are bypassed

When we allow guest page faults to reach the guests directly, we lose
the fault tracking which allows us to detect demand paging.  So we provide
an alternate mechnism by clearing the accessed bit when we set a pte, and
checking it later to see if the guest actually used it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Allow not-present guest page faults to bypass kvm
Avi Kivity [Sun, 16 Sep 2007 16:58:32 +0000 (18:58 +0200)]
KVM: Allow not-present guest page faults to bypass kvm

There are two classes of page faults trapped by kvm:
 - host page faults, where the fault is needed to allow kvm to install
   the shadow pte or update the guest accessed and dirty bits
 - guest page faults, where the guest has faulted and kvm simply injects
   the fault back into the guest to handle

The second class, guest page faults, is pure overhead.  We can eliminate
some of it on vmx using the following evil trick:
 - when we set up a shadow page table entry, if the corresponding guest pte
   is not present, set up the shadow pte as not present
 - if the guest pte _is_ present, mark the shadow pte as present but also
   set one of the reserved bits in the shadow pte
 - tell the vmx hardware not to trap faults which have the present bit clear

With this, normal page-not-present faults go directly to the guest,
bypassing kvm entirely.

Unfortunately, this trick only works on Intel hardware, as AMD lacks a
way to discriminate among page faults based on error code.  It is also
a little risky since it uses reserved bits which might become unreserved
in the future, so a module parameter is provided to disable it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Further reduce efer reloads
Avi Kivity [Wed, 29 Aug 2007 00:48:05 +0000 (03:48 +0300)]
KVM: VMX: Further reduce efer reloads

KVM avoids reloading the efer msr when the difference between the guest
and host values consist of the long mode bits (which are switched by
hardware) and the NX bit (which is emulated by the KVM MMU).

This patch also allows KVM to ignore SCE (syscall enable) when the guest
is running in 32-bit mode.  This is because the syscall instruction is
not available in 32-bit mode on Intel processors, so the SCE bit is
effectively meaningless.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Call x86_decode_insn() only when needed
Laurent Vivier [Tue, 18 Sep 2007 09:27:37 +0000 (11:27 +0200)]
KVM: Call x86_decode_insn() only when needed

Move emulate_ctxt to kvm_vcpu to keep emulate context when we exit from kvm
module. Call x86_decode_insn() only when needed. Modify x86_emulate_insn() to
not modify the context if it must be re-entered.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn()
Laurent Vivier [Tue, 18 Sep 2007 09:27:27 +0000 (11:27 +0200)]
KVM: emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn()

emulate_instruction() calls now x86_decode_insn() and x86_emulate_insn().
x86_emulate_insn() is x86_emulate_memop() without the decoding part.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: move all decoding process to function x86_decode_insn()
Laurent Vivier [Tue, 18 Sep 2007 09:27:19 +0000 (11:27 +0200)]
KVM: x86 emulator: move all decoding process to function x86_decode_insn()

Split the decoding process into a new function x86_decode_insn().

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: move all x86_emulate_memop() to a structure
Laurent Vivier [Tue, 18 Sep 2007 09:52:50 +0000 (11:52 +0200)]
KVM: x86 emulator: move all x86_emulate_memop() to a structure

Move all x86_emulate_memop() common variables between decode and execute to a
structure decode_cache.  This will help in later separating decode and
emulate.

            struct decode_cache {
                u8 twobyte;
                u8 b;
                u8 lock_prefix;
                u8 rep_prefix;
                u8 op_bytes;
                u8 ad_bytes;
                struct operand src;
                struct operand dst;
                unsigned long *override_base;
                unsigned int d;
                unsigned long regs[NR_VCPU_REGS];
                unsigned long eip;
                /* modrm */
                u8 modrm;
                u8 modrm_mod;
                u8 modrm_reg;
                u8 modrm_rm;
                u8 use_modrm_ea;
                unsigned long modrm_ea;
                unsigned long modrm_val;
           };

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: remove unused functions
Laurent Vivier [Tue, 18 Sep 2007 09:26:38 +0000 (11:26 +0200)]
KVM: x86 emulator: remove unused functions

Remove #ifdef functions never used

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Refactor hypercall infrastructure (v3)
Anthony Liguori [Mon, 17 Sep 2007 19:57:50 +0000 (14:57 -0500)]
KVM: Refactor hypercall infrastructure (v3)

This patch refactors the current hypercall infrastructure to better
support live migration and SMP.  It eliminates the hypercall page by
trapping the UD exception that would occur if you used the wrong hypercall
instruction for the underlying architecture and replacing it with the right
one lazily.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to
communicate with userspace as PIO or MMIO can be used.  There is no code
in tree that uses userspace hypercalls.

[avi: fix #ud injection on vmx]

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Add vmmcall/vmcall to x86_emulate (v3)
Anthony Liguori [Mon, 17 Sep 2007 19:57:49 +0000 (14:57 -0500)]
KVM: x86 emulator: Add vmmcall/vmcall to x86_emulate (v3)

Add vmmcall/vmcall to x86_emulate.  Future patch will implement functionality
for these instructions.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86
Linus Torvalds [Wed, 30 Jan 2008 13:40:09 +0000 (00:40 +1100)]
Merge git://git./linux/kernel/git/x86/linux-2.6-x86

* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: (890 commits)
  x86: fix nodemap_size according to nodeid bits
  x86: fix overlap between pagetable with bss section
  x86: add PCI IDs to k8topology_64.c
  x86: fix early_ioremap pagetable ops
  x86: use the same pgd_list for PAE and 64-bit
  x86: defer cr3 reload when doing pud_clear()
  x86: early boot debugging via FireWire (ohci1394_dma=early)
  x86: don't special-case pmd allocations as much
  x86: shrink some ifdefs in fault.c
  x86: ignore spurious faults
  x86: remove nx_enabled from fault.c
  x86: unify fault_32|64.c
  x86: unify fault_32|64.c with ifdefs
  x86: unify fault_32|64.c by ifdef'd function bodies
  x86: arch/x86/mm/init_32.c printk fixes
  x86: arch/x86/mm/init_32.c cleanup
  x86: arch/x86/mm/init_64.c printk fixes
  x86: unify ioremap
  x86: fixes some bugs about EFI memory map handling
  x86: use reboot_type on EFI 32
  ...

16 years ago[net] Gracefully handle shared e1000/1000e driver PCI ID's
Linus Torvalds [Wed, 30 Jan 2008 13:30:15 +0000 (00:30 +1100)]
[net] Gracefully handle shared e1000/1000e driver PCI ID's

Both the old e1000 driver and the new e1000e driver can drive some
PCI-Express e1000 cards, and we should avoid ambiguity about which
driver will pick up the support for those cards when both drivers are
enabled.

This solves the problem by having the old driver support those cards if
the new driver isn't configured, but otherwise ceding support for PCI
Express versions of the e1000 chipset to the newer driver.  Thus
allowing both legacy configurations where only the old driver is active
(and handles all chips it knows about) and the new configuration with
the new driver handling the more modern PCIE variants.

Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoMake !NETFILTER_ADVANCED enable IP6_NF_MATCH_IPV6HEADER
Linus Torvalds [Wed, 30 Jan 2008 13:26:10 +0000 (00:26 +1100)]
Make !NETFILTER_ADVANCED enable IP6_NF_MATCH_IPV6HEADER

We want IPV6HEADER matching for the non-advanced default netfilter
configuration, since it's part of the standard netfilter setup of at
least some distributions (eg Fedora).

Otherwise NETFILTER_ADVANCED loses much of its point, since even
non-advanced users would have to enable all the advanced options just to
get a working IPv6 netfilter setup.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agox86: fix nodemap_size according to nodeid bits
Yinghai Lu [Wed, 30 Jan 2008 12:34:12 +0000 (13:34 +0100)]
x86: fix nodemap_size according to nodeid bits

memnode.map is s16 array because of nodeid is 16 bit now.

so need to increase the nodemap_size according to that bits.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: fix overlap between pagetable with bss section
Yinghai Lu [Wed, 30 Jan 2008 12:34:12 +0000 (13:34 +0100)]
x86: fix overlap between pagetable with bss section

one early crash on one 8 node 256g machine:

Command line: console=uart8250,io,0x3f8,115200n8 initrd=kernel.org/mydisk11_x86_64.gz rw root=/dev/ram0 debug initcall_debug apic=debug acpi.debug_level=0x0000000f pci=routeirq ip=dhcp load_ramdisk=1 ramdisk_size=131072 BOOT_IMAGE=kernel.org/bzImage_2.6.25_k8.1
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
 BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000dffe0000 (usable)
 BIOS-e820: 00000000dffe0000 - 00000000dffee000 (ACPI data)
 BIOS-e820: 00000000dffee000 - 00000000dffff050 (ACPI NVS)
 BIOS-e820: 00000000dffff050 - 00000000e0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000004020000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
end_pfn_map = 67239936
Kernel panic - not syncing: Duplicated early reservation d40000-e42000

Pid: 0, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #3

Call Trace:
 [<ffffffff80221545>] lapic_get_maxlvt+0x0/0x10
 [<ffffffff80221657>] clear_local_APIC+0x5/0xcf
 [<ffffffff80221726>] disable_local_APIC+0x5/0x17
 [<ffffffff8021fe16>] smp_send_stop+0x46/0x4c
 [<ffffffff80235293>] panic+0x94/0x13e
 [<ffffffff80bc3b03>] sctp_eps_proc_init+0x12/0x34
 [<ffffffff80b9f1c5>] reserve_early+0x30/0x6c
 [<ffffffff80803925>] init_memory_mapping+0x2cd/0x2dc
 [<ffffffff80b9dc01>] setup_arch+0x21f/0x44e
 [<ffffffff80b978be>] start_kernel+0x6f/0x2c7
 [<ffffffff80b971cc>] _sinittext+0x1cc/0x1d3

it turns out there is overlap between pgtable and bss...

in System.map we have
ffffffff80d40420 b rsi_table
ffffffff80d40620 B krb5_seq_lock
ffffffff80d40628 b i.20437
ffffffff80d40630 b xprt_rdma_inline_write_padding
ffffffff80d40638 b sunrpc_table_header
ffffffff80d40640 b zero
ffffffff80d40644 b min_memreg
ffffffff80d40648 b rpcrdma_tk_lock_g
ffffffff80d40650 B sctp_assocs_id_lock
ffffffff80d40658 B proc_net_sctp
ffffffff80d40660 B sctp_assocs_id
ffffffff80d40680 B sysctl_sctp_mem
ffffffff80d40690 B sysctl_sctp_rmem
ffffffff80d406a0 B sysctl_sctp_wmem
ffffffff80d406b0 b sctp_ctl_socket
ffffffff80d406b8 b sctp_pf_inet6_specific
ffffffff80d406c0 b sctp_pf_inet_specific
ffffffff80d406c8 b sctp_af_v4_specific
ffffffff80d406d0 b sctp_af_v6_specific
ffffffff80d406d8 b sctp_rand.33270
ffffffff80d406dc b sctp_memory_pressure
ffffffff80d406e0 b sctp_sockets_allocated
ffffffff80d406e4 b sctp_memory_allocated
ffffffff80d406e8 b sctp_sysctl_header
ffffffff80d406f0 b zero
ffffffff80d406f4 A __bss_stop
ffffffff80d406f4 A _end

need to round up table_start to PAGE_SIZE.

also make the panic more informative.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: add PCI IDs to k8topology_64.c
Joachim Deguara [Wed, 30 Jan 2008 12:34:12 +0000 (13:34 +0100)]
x86: add PCI IDs to k8topology_64.c

This just adds the PCI IDs of AMD's family 10h and 11h CPU's northbridges to
k8topology discovery.

Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: fix early_ioremap pagetable ops
Jeremy Fitzhardinge [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: fix early_ioremap pagetable ops

Put appropriate pagetable update hooks in so that paravirt knows
what's going on in there.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: use the same pgd_list for PAE and 64-bit
Jeremy Fitzhardinge [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: use the same pgd_list for PAE and 64-bit

Use a standard list threaded through page->lru for maintaining the pgd
list on PAE.  This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: defer cr3 reload when doing pud_clear()
Jeremy Fitzhardinge [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: defer cr3 reload when doing pud_clear()

PAE mode requires that we reload cr3 in order to guarantee that
changes to the pgd will be noticed by the processor.  This means that
in principle pud_clear needs to reload cr3 every time.  However,
because reloading cr3 implies a tlb flush, we want to avoid it where
possible.

pud_clear() is only used in a couple of places:
 - in free_pmd_range(), when pulling down a range of process address space, and
 - huge_pmd_unshare()

In both cases, the calling code will do a a tlb flush anyway, so
there's no need to do it within pud_clear().

In free_pmd_range(), the pud_clear is immediately followed by
pmd_free_tlb(); we can hook that to make the mmu_gather do an
unconditional full flush to make sure cr3 gets reloaded.

In huge_pmd_unshare, it is followed by flush_tlb_range, which always
results in a full cr3-reload tlb flush.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: early boot debugging via FireWire (ohci1394_dma=early)
Bernhard Kaindl [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: early boot debugging via FireWire (ohci1394_dma=early)

This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.

If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.

With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.

In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.

An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.

A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt

It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff

Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: don't special-case pmd allocations as much
Jeremy Fitzhardinge [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: don't special-case pmd allocations as much

In x86 PAE mode, stop treating pmds as a special case.  Previously
they were always allocated and freed with the pgd.  The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.

This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.

There is a complicating wart, however.  When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3.  Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.

This patch simply avoids reloading cr3 unless the update is to the
current pagetable.  Later patches will optimise this further.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: shrink some ifdefs in fault.c
Harvey Harrison [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: shrink some ifdefs in fault.c

The change from current to tsk in do_page_fault is safe as
this is set at the very beginning of the function.

Removes a likely() annotation from the 64-bit version, this
could have instead been added to 32-bit.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: ignore spurious faults
Jeremy Fitzhardinge [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: ignore spurious faults

When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem.  They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).

This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: remove nx_enabled from fault.c
Harvey Harrison [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: remove nx_enabled from fault.c

On !PAE 32-bit, _PAGE_NX will be 0, making is_prefetch always
return early.  The test is sufficient on PAE as __supported_pte_mask
is updated in the same places as nx_enabled in init_32.c which also
takes disable_nx into account.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: unify fault_32|64.c
Harvey Harrison [Wed, 30 Jan 2008 12:34:11 +0000 (13:34 +0100)]
x86: unify fault_32|64.c

Unify includes in moved fault.c.

Modify Makefiles to pick up unified file.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: unify fault_32|64.c with ifdefs
Harvey Harrison [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: unify fault_32|64.c with ifdefs

Elimination of these ifdefs can be done in a unified file.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: unify fault_32|64.c by ifdef'd function bodies
Harvey Harrison [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: unify fault_32|64.c by ifdef'd function bodies

It's about time to get on with unifying these files, elimination
of the ugly ifdefs can occur in the unified file.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: arch/x86/mm/init_32.c printk fixes
Ingo Molnar [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: arch/x86/mm/init_32.c printk fixes

printk fixes. NOP in terms of functionality, but strings got
a bit larger due to the KERN_ markers that were added.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: arch/x86/mm/init_32.c cleanup
Ingo Molnar [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: arch/x86/mm/init_32.c cleanup

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: arch/x86/mm/init_64.c printk fixes
Ingo Molnar [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: arch/x86/mm/init_64.c printk fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: unify ioremap
Thomas Gleixner [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: unify ioremap

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: fixes some bugs about EFI memory map handling
Huang, Ying [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: fixes some bugs about EFI memory map handling

This patch fixes some bugs of EFI memory handing code.

- On x86_64, it is possible that EFI memory map can not be mapped via
  identity map, so efi_map_memmap is removed, just use early_ioremap.

- On i386, the EFI memory map mapping take effect cross paging_init,
  so it is not necessary to use efi_map_memmap.

- EFI memory map is unmapped in efi_enter_virtual_mode to avoid
  early_ioremap leak.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: use reboot_type on EFI 32
Huang, Ying [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: use reboot_type on EFI 32

This patch makes reboot_type of BOOT_EFI is used on i386 too. Because
correpsonding reboot code of i386 and x86_64 is merged.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: unify page fault oops printing
Harvey Harrison [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: unify page fault oops printing

This changes the oops dumping format for page faults to
be similar between X86_32 and 64.

This is the first user of printk_address on X86_32.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86: introduce show_fault_oops helper to fault_32|64.c
Harvey Harrison [Wed, 30 Jan 2008 12:34:10 +0000 (13:34 +0100)]
x86: introduce show_fault_oops helper to fault_32|64.c

This will help when unifying the oops dumping code on 32/64
bit.  No functional changes.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>