pandora-kernel.git
16 years agoKVM: VMX: Add ept_sync_context in flush_tlb
Sheng Yang [Sun, 6 Jul 2008 11:16:51 +0000 (19:16 +0800)]
KVM: VMX: Add ept_sync_context in flush_tlb

Fix a potention issue caused by kvm_mmu_slot_remove_write_access(). The
old behavior don't sync EPT TLB with modified EPT entry, which result
in inconsistent content of EPT TLB and EPT table.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: mmu_shrink: kvm_mmu_zap_page requires slots_lock to be held
Marcelo Tosatti [Thu, 3 Jul 2008 21:33:02 +0000 (18:33 -0300)]
KVM: mmu_shrink: kvm_mmu_zap_page requires slots_lock to be held

kvm_mmu_zap_page() needs slots lock held (rmap_remove->gfn_to_memslot,
for example).

Since kvm_lock spinlock is held in mmu_shrink(), do a non-blocking
down_read_trylock().

Untested.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: KVM guest: make kvm_smp_prepare_boot_cpu() static
Adrian Bunk [Mon, 30 Jun 2008 22:19:19 +0000 (01:19 +0300)]
x86: KVM guest: make kvm_smp_prepare_boot_cpu() static

This patch makes the needlessly global kvm_smp_prepare_boot_cpu() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: fix suspend/resume support
Joerg Roedel [Wed, 2 Jul 2008 14:02:11 +0000 (16:02 +0200)]
KVM: SVM: fix suspend/resume support

On suspend the svm_hardware_disable function is called which frees all svm_data
variables. On resume they are not re-allocated. This patch removes the
deallocation of svm_data from the hardware_disable function to the
hardware_unsetup function which is not called on suspend.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: rename private structures
Christian Borntraeger [Fri, 27 Jun 2008 13:05:40 +0000 (15:05 +0200)]
KVM: s390: rename private structures

While doing some tests with our lcrash implementation I have seen a
naming conflict with prefix_info in kvm_host.h vs. addrconf.h

To avoid future conflicts lets rename private definitions in
asm/kvm_host.h by adding the kvm_s390 prefix.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: Set guest storage limit and offset to sane values
Christian Borntraeger [Fri, 27 Jun 2008 13:05:38 +0000 (15:05 +0200)]
KVM: s390: Set guest storage limit and offset to sane values

Some machines do not accept 16EB as guest storage limit. Lets change the
default for the guest storage limit to a sane value. We also should set
the guest_origin to what userspace thinks it is. This allows guests
starting at an address != 0.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Fix memory leak on guest exit
Carsten Otte [Fri, 27 Jun 2008 13:05:34 +0000 (15:05 +0200)]
KVM: Fix memory leak on guest exit

This patch fixes a memory leak, we want to free the physmem when destroying
the vm.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: dont allocate dirty bitmap
Carsten Otte [Fri, 27 Jun 2008 13:05:31 +0000 (15:05 +0200)]
KVM: s390: dont allocate dirty bitmap

This patch #ifdefs the bitmap array for dirty tracking. We don't have dirty
tracking on s390 today, and we'd love to use our storage keys to store the
dirty information for migration. Therefore, we won't need this array at all,
and due to our limited amount of vmalloc space this limits the amount of guests
we can run.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: move slots_lock acquision down to vapic_exit
Marcelo Tosatti [Mon, 23 Jun 2008 15:04:25 +0000 (12:04 -0300)]
KVM: move slots_lock acquision down to vapic_exit

There is no need to grab slots_lock if the vapic_page will not
be touched.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Fake emulate Intel perfctr MSRs
Chris Lalancette [Fri, 20 Jun 2008 07:51:30 +0000 (09:51 +0200)]
KVM: VMX: Fake emulate Intel perfctr MSRs

Older linux guests (in this case, 2.6.9) can attempt to
access the performance counter MSRs without a fixup section, and injecting
a GPF kills the guest.  Work around by allowing the guest to write those MSRs.

Tested by me on RHEL-4 i386 and x86_64 guests, as well as F-9 guests.

Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Fix a wrong usage of vmcs_config
Sheng Yang [Wed, 18 Jun 2008 06:43:38 +0000 (14:43 +0800)]
KVM: VMX: Fix a wrong usage of vmcs_config

The function ept_update_paging_mode_cr0() write to
CPU_BASED_VM_EXEC_CONTROL based on vmcs_config.cpu_based_exec_ctrl. That's
wrong because the variable may not consistent with the content in the
CPU_BASE_VM_EXEC_CONTROL MSR.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Fix printk format
Avi Kivity [Sun, 22 Jun 2008 13:46:22 +0000 (16:46 +0300)]
KVM: MMU: Fix printk format

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: When debug is enabled, make it a run-time parameter
Avi Kivity [Sun, 22 Jun 2008 13:45:24 +0000 (16:45 +0300)]
KVM: MMU: When debug is enabled, make it a run-time parameter

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: lazily evaluate segment registers
Avi Kivity [Sun, 22 Jun 2008 13:22:51 +0000 (16:22 +0300)]
KVM: x86 emulator: lazily evaluate segment registers

Instead of prefetching all segment bases before emulation, read them at the
last moment.  Since most of them are unneeded, we save some cycles on
Intel machines where this is a bit expensive.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: avoid segment base adjust for lea
Avi Kivity [Mon, 16 Jun 2008 05:45:54 +0000 (22:45 -0700)]
KVM: x86 emulator: avoid segment base adjust for lea

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: simplify rip relative decoding
Avi Kivity [Mon, 16 Jun 2008 05:09:11 +0000 (22:09 -0700)]
KVM: x86 emulator: simplify rip relative decoding

rip relative decoding is relative to the instruction pointer of the next
instruction; by moving address adjustment until after decoding is complete,
we remove the need to determine the instruction size.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: simplify r/m decoding
Avi Kivity [Mon, 16 Jun 2008 04:53:26 +0000 (21:53 -0700)]
KVM: x86 emulator: simplify r/m decoding

Consolidate the duplicated code when not in any special case.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: simplify sib decoding
Avi Kivity [Mon, 16 Jun 2008 04:23:17 +0000 (21:23 -0700)]
KVM: x86 emulator: simplify sib decoding

Instead of using sparse switches, use simpler if/else sequences.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: handle undecoded rex.b with r/m = 5 in certain cases
Avi Kivity [Mon, 16 Jun 2008 04:13:41 +0000 (21:13 -0700)]
KVM: x86 emulator: handle undecoded rex.b with r/m = 5 in certain cases

x86_64 does not decode rex.b in certain cases, where the r/m field = 5.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: emulate nop and xchg reg, acc (opcodes 0x90 - 0x97)
Mohammed Gamal [Sun, 15 Jun 2008 16:37:38 +0000 (19:37 +0300)]
KVM: x86 emulator: emulate nop and xchg reg, acc (opcodes 0x90 - 0x97)

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Use printk_rlimit() instead of reporting emulation failures just once
Avi Kivity [Fri, 13 Jun 2008 19:45:42 +0000 (22:45 +0300)]
KVM: Use printk_rlimit() instead of reporting emulation failures just once

Emulation failure reports are useful, so allow more than one per the lifetime
of the module.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Support mixed endian machines
Tan, Li [Fri, 23 May 2008 06:54:09 +0000 (14:54 +0800)]
KVM: Support mixed endian machines

Currently kvmtrace is not portable. This will prevent from copying a
trace file from big-endian target to little-endian workstation for analysis.
In the patch, kernel outputs metadata containing a magic number to trace
log, and changes 64-bit words to be u64 instead of a pair of u32s.

Signed-off-by: Tan Li <li.tan@intel.com>
Acked-by: Jerone Young <jyoung5@us.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Do not calculate linear rip in emulation failure report
Glauber Costa [Tue, 10 Jun 2008 13:46:53 +0000 (10:46 -0300)]
KVM: Do not calculate linear rip in emulation failure report

If we're not gonna do anything (case in which failure is already
reported), we do not need to even bother with calculating the linear rip.

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: only abort guest entry if timer count goes from 0->1
Marcelo Tosatti [Wed, 11 Jun 2008 22:52:53 +0000 (19:52 -0300)]
KVM: only abort guest entry if timer count goes from 0->1

Only abort guest entry if the timer count went from 0->1, since for 1->2
or larger the bit will either be set already or a timer irq will have
been injected.

Using atomic_inc_and_test() for it also introduces an SMP barrier
to the LAPIC version (thought it was unecessary because of timer
migration, but guest can be scheduled to a different pCPU between exit
and kvm_vcpu_block(), so there is the possibility for a race).

Noticed by Avi.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add coalesced MMIO support (ia64 part)
Laurent Vivier [Fri, 30 May 2008 14:05:57 +0000 (16:05 +0200)]
KVM: Add coalesced MMIO support (ia64 part)

This patch enables coalesced MMIO for ia64 architecture.
It defines KVM_MMIO_PAGE_OFFSET and KVM_CAP_COALESCED_MMIO.
It enables the compilation of coalesced_mmio.c.

[akpm: fix compile error on ia64]

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add coalesced MMIO support (powerpc part)
Laurent Vivier [Fri, 30 May 2008 14:05:56 +0000 (16:05 +0200)]
KVM: Add coalesced MMIO support (powerpc part)

This patch enables coalesced MMIO for powerpc architecture.
It defines KVM_MMIO_PAGE_OFFSET and KVM_CAP_COALESCED_MMIO.
It enables the compilation of coalesced_mmio.c.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add coalesced MMIO support (x86 part)
Laurent Vivier [Fri, 30 May 2008 14:05:55 +0000 (16:05 +0200)]
KVM: Add coalesced MMIO support (x86 part)

This patch enables coalesced MMIO for x86 architecture.
It defines KVM_MMIO_PAGE_OFFSET and KVM_CAP_COALESCED_MMIO.
It enables the compilation of coalesced_mmio.c.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add coalesced MMIO support (common part)
Laurent Vivier [Fri, 30 May 2008 14:05:54 +0000 (16:05 +0200)]
KVM: Add coalesced MMIO support (common part)

This patch adds all needed structures to coalesce MMIOs.
Until an architecture uses it, it is not compiled.

Coalesced MMIO introduces two ioctl() to define where are the MMIO zones that
can be coalesced:

- KVM_REGISTER_COALESCED_MMIO registers a coalesced MMIO zone.
  It requests one parameter (struct kvm_coalesced_mmio_zone) which defines
  a memory area where MMIOs can be coalesced until the next switch to
  user space. The maximum number of MMIO zones is KVM_COALESCED_MMIO_ZONE_MAX.

- KVM_UNREGISTER_COALESCED_MMIO cancels all registered zones inside
  the given bounds (bounds are also given by struct kvm_coalesced_mmio_zone).

The userspace client can check kernel coalesced MMIO availability by asking
ioctl(KVM_CHECK_EXTENSION) for the KVM_CAP_COALESCED_MMIO capability.
The ioctl() call to KVM_CAP_COALESCED_MMIO will return 0 if not supported,
or the page offset where will be stored the ring buffer.
The page offset depends on the architecture.

After an ioctl(KVM_RUN), the first page of the KVM memory mapped points to
a kvm_run structure. The offset given by KVM_CAP_COALESCED_MMIO is
an offset to the coalesced MMIO ring expressed in PAGE_SIZE relatively
to the address of the start of th kvm_run structure. The MMIO ring buffer
is defined by the structure kvm_coalesced_mmio_ring.

[akio: fix oops during guest shutdown]

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Akio Takebe <takebe_akio@jp.fujitsu.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: kvm_io_device: extend in_range() to manage len and write attribute
Laurent Vivier [Fri, 30 May 2008 14:05:53 +0000 (16:05 +0200)]
KVM: kvm_io_device: extend in_range() to manage len and write attribute

Modify member in_range() of structure kvm_io_device to pass length and the type
of the I/O (write or read).

This modification allows to use kvm_io_device with coalesced MMIO.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Avoid page prefetch on SVM
Avi Kivity [Thu, 29 May 2008 11:56:28 +0000 (14:56 +0300)]
KVM: MMU: Avoid page prefetch on SVM

SVM cannot benefit from page prefetching since guest page fault bypass
cannot by made to work there.  Avoid accessing the guest page table in
this case.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Move nonpaging_prefetch_page()
Avi Kivity [Thu, 29 May 2008 11:55:03 +0000 (14:55 +0300)]
KVM: MMU: Move nonpaging_prefetch_page()

In preparation for next patch. No code change.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: implement 'push imm' (opcode 0x68)
Avi Kivity [Thu, 29 May 2008 11:38:38 +0000 (14:38 +0300)]
KVM: x86 emulator: implement 'push imm' (opcode 0x68)

Encountered in FC6 boot sequence, now that we don't force ss.rpl = 0 during
the protected mode transition.  Not really necessary, but nice to have.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: simplify push imm8 emulation
Avi Kivity [Thu, 29 May 2008 11:26:29 +0000 (14:26 +0300)]
KVM: x86 emulator: simplify push imm8 emulation

Instead of fetching the data explicitly, use SrcImmByte.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Optimize prefetch_page()
Avi Kivity [Thu, 29 May 2008 11:20:16 +0000 (14:20 +0300)]
KVM: MMU: Optimize prefetch_page()

Instead of reading each pte individually, read 256 bytes worth of ptes and
batch process them.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Add support for mov r, sreg (0x8c) instruction
Guillaume Thouvenin [Tue, 27 May 2008 13:13:28 +0000 (15:13 +0200)]
KVM: x86 emulator: Add support for mov r, sreg (0x8c) instruction

Add support for mov r, sreg (0x8c) instruction

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Add support for mov seg, r (0x8e) instruction
Guillaume Thouvenin [Tue, 27 May 2008 12:49:15 +0000 (14:49 +0200)]
KVM: x86 emulator: Add support for mov seg, r (0x8e) instruction

Add support for mov r, sreg (0x8c) instruction.

[avi: drop the sreg decoding table in favor of 1:1 encoding]

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: adds support to mov r,imm (opcode 0xb8) instruction
Guillaume Thouvenin [Tue, 27 May 2008 08:19:16 +0000 (10:19 +0200)]
KVM: x86 emulator: adds support to mov r,imm (opcode 0xb8) instruction

Add support to mov r, imm (0xb8) instruction.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: add support for jmp far 0xea
Guillaume Thouvenin [Tue, 27 May 2008 08:19:08 +0000 (10:19 +0200)]
KVM: x86 emulator: add support for jmp far 0xea

Add support for jmp far (opcode 0xea) instruction.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Update c->dst.bytes in decode instruction
Guillaume Thouvenin [Tue, 27 May 2008 08:22:20 +0000 (10:22 +0200)]
KVM: x86 emulator: Update c->dst.bytes in decode instruction

Update c->dst.bytes in decode instruction instead of instruction
itself.  It's needed because if c->dst.bytes is equal to 0, the
instruction is not emulated.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Prefixes segment functions that will be exported with "kvm_"
Guillaume Thouvenin [Tue, 27 May 2008 08:18:46 +0000 (10:18 +0200)]
KVM: Prefixes segment functions that will be exported with "kvm_"

Prefixes functions that will be exported with kvm_.
We also prefixed set_segment() even if it still static
to be coherent.

signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MTRR support
Avi Kivity [Mon, 26 May 2008 17:06:35 +0000 (20:06 +0300)]
KVM: MTRR support

Add emulation for the memory type range registers, needed by VMware esx 3.5,
and by pci device assignment.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Order segment register constants in the same way as cpu operand encoding
Avi Kivity [Tue, 27 May 2008 13:26:01 +0000 (16:26 +0300)]
KVM: Order segment register constants in the same way as cpu operand encoding

This can be used to simplify the x86 instruction decoder.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Enable NMI with in-kernel irqchip
Sheng Yang [Thu, 15 May 2008 10:23:25 +0000 (18:23 +0800)]
KVM: VMX: Enable NMI with in-kernel irqchip

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: IOAPIC/LAPIC: Enable NMI support
Sheng Yang [Thu, 15 May 2008 01:52:48 +0000 (09:52 +0800)]
KVM: IOAPIC/LAPIC: Enable NMI support

[avi: fix ia64 build breakage]

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Remove unnecessary ->decache_regs() call
Avi Kivity [Sun, 25 May 2008 11:38:15 +0000 (14:38 +0300)]
KVM: Remove unnecessary ->decache_regs() call

Since we aren't modifying any register, there's no need to decache
the register state.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Remove decache_vcpus_on_cpu() and related callbacks
Avi Kivity [Tue, 13 May 2008 13:29:20 +0000 (16:29 +0300)]
KVM: Remove decache_vcpus_on_cpu() and related callbacks

Obsoleted by the vmx-specific per-cpu list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Add list of potentially locally cached vcpus
Avi Kivity [Tue, 13 May 2008 13:22:47 +0000 (16:22 +0300)]
KVM: VMX: Add list of potentially locally cached vcpus

VMX hardware can cache the contents of a vcpu's vmcs.  This cache needs
to be flushed when migrating a vcpu to another cpu, or (which is the case
that interests us here) when disabling hardware virtualization on a cpu.

The current implementation of decaching iterates over the list of all vcpus,
picks the ones that are potentially cached on the cpu that is being offlined,
and flushes the cache.  The problem is that it uses mutex_trylock() to gain
exclusive access to the vcpu, which fires off a (benign) warning about using
the mutex in an interrupt context.

To avoid this, and to make things generally nicer, add a new per-cpu list
of potentially cached vcus.  This makes the decaching code much simpler.  The
list is vmx-specific since other hardware doesn't have this issue.

[andrea: fix crash on suspend/resume]

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Handle virtualization instruction #UD faults during reboot
Avi Kivity [Tue, 13 May 2008 10:23:38 +0000 (13:23 +0300)]
KVM: Handle virtualization instruction #UD faults during reboot

KVM turns off hardware virtualization extensions during reboot, in order
to disassociate the memory used by the virtualization extensions from the
processor, and in order to have the system in a consistent state.
Unfortunately virtual machines may still be running while this goes on,
and once virtualization extensions are turned off, any virtulization
instruction will #UD on execution.

Fix by adding an exception handler to virtualization instructions; if we get
an exception during reboot, we simply spin waiting for the reset to complete.
If it's a true exception, BUG() so we can have our stack trace.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Fix false flooding when a pte points to page table
Avi Kivity [Thu, 15 May 2008 10:51:35 +0000 (13:51 +0300)]
KVM: MMU: Fix false flooding when a pte points to page table

The KVM MMU tries to detect when a speculative pte update is not actually
used by demand fault, by checking the accessed bit of the shadow pte.  If
the shadow pte has not been accessed, we deem that page table flooded and
remove the shadow page table, allowing further pte updates to proceed
without emulation.

However, if the pte itself points at a page table and only used for write
operations, the accessed bit will never be set since all access will happen
through the emulator.

This is exactly what happens with kscand on old (2.4.x) HIGHMEM kernels.
The kernel points a kmap_atomic() pte at a page table, and then
proceeds with read-modify-write operations to look at the dirty and accessed
bits.  We get a false flood trigger on the kmap ptes, which results in the
mmu spending all its time setting up and tearing down shadows.

Fix by setting the shadow accessed bit on emulated accesses.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Trivial vmcs_write64() code simplification
Avi Kivity [Mon, 12 May 2008 16:25:43 +0000 (19:25 +0300)]
KVM: VMX: Trivial vmcs_write64() code simplification

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: Fake MSR_K7 performance counters
Chris Lalancette [Mon, 5 May 2008 17:05:16 +0000 (13:05 -0400)]
KVM: SVM: Fake MSR_K7 performance counters

Attached is a patch that fixes a guest crash when booting older Linux kernels.
The problem stems from the fact that we are currently emulating
MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3].  Because of this,
setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to
write into MSR_K7_PERFCTR, which causes an OOPs.

The patch fixes it by just "fake" emulating the appropriate MSRs, throwing
away the data in the process.  This causes the NMI watchdog to not actually
work, but it's not such a big deal in a virtualized environment.

When we get a write to one of these counters, we printk_ratelimit() a warning.
I decided to print it out for all writes, even if the data is 0; it doesn't
seem to make sense to me to special case when data == 0.

Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit
guest.

Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: PIT: support mode 3
Aurelien Jarno [Fri, 2 May 2008 15:02:23 +0000 (17:02 +0200)]
KVM: PIT: support mode 3

The in-kernel PIT emulation ignores pending timers if operating
under mode 3, which for example Hurd uses.

This mode should output a square wave, high for (N+1)/2 counts and low
for (N-1)/2 counts. As we only care about the resulting interrupts, the
period is N, and mode 3 is the same as mode 2 with regard to
interrupts.

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Handle vma regions with no backing page
Anthony Liguori [Wed, 30 Apr 2008 20:37:07 +0000 (15:37 -0500)]
KVM: Handle vma regions with no backing page

This patch allows VMAs that contain no backing page to be used for guest
memory.  This is useful for assigning mmio regions to a guest.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: add tracing support for TDP page faults
Joerg Roedel [Wed, 30 Apr 2008 15:56:04 +0000 (17:56 +0200)]
KVM: SVM: add tracing support for TDP page faults

To distinguish between real page faults and nested page faults they should be
traced as different events. This is implemented by this patch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: add missing kvmtrace markers
Joerg Roedel [Wed, 30 Apr 2008 15:56:03 +0000 (17:56 +0200)]
KVM: SVM: add missing kvmtrace markers

This patch adds the missing kvmtrace markers to the svm
module of kvm.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: add missing kvmtrace bits
Joerg Roedel [Wed, 30 Apr 2008 15:56:02 +0000 (17:56 +0200)]
KVM: add missing kvmtrace bits

This patch adds some kvmtrace bits to the generic x86 code
where it is instrumented from SVM.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: implement dedicated INTR exit handler
Joerg Roedel [Wed, 30 Apr 2008 15:56:01 +0000 (17:56 +0200)]
KVM: SVM: implement dedicated INTR exit handler

With an exit handler for INTR intercepts its possible to account them using
kvmtrace.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: implement dedicated NMI exit handler
Joerg Roedel [Wed, 30 Apr 2008 15:56:00 +0000 (17:56 +0200)]
KVM: SVM: implement dedicated NMI exit handler

With an exit handler for NMI intercepts its possible to account them using
kvmtrace.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: move APIC_ACCESS trace entry to generic code
Joerg Roedel [Wed, 30 Apr 2008 15:55:59 +0000 (17:55 +0200)]
KVM: VMX: move APIC_ACCESS trace entry to generic code

This patch moves the trace entry for APIC accesses from the VMX code to the
generic lapic code. This way APIC accesses from SVM will also be traced.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: add statics were possible, function definition in lapic.h
Harvey Harrison [Sun, 27 Apr 2008 19:14:13 +0000 (12:14 -0700)]
KVM: add statics were possible, function definition in lapic.h

Noticed by sparse:
arch/x86/kvm/vmx.c:1583:6: warning: symbol 'vmx_disable_intercept_for_msr' was not declared. Should it be static?
arch/x86/kvm/x86.c:3406:5: warning: symbol 'kvm_task_switch_16' was not declared. Should it be static?
arch/x86/kvm/x86.c:3429:5: warning: symbol 'kvm_task_switch_32' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1968:6: warning: symbol 'kvm_mmu_remove_one_alloc_mmu_page' was not declared. Should it be static?
arch/x86/kvm/mmu.c:2014:6: warning: symbol 'mmu_destroy_caches' was not declared. Should it be static?
arch/x86/kvm/lapic.c:862:5: warning: symbol 'kvm_lapic_get_base' was not declared. Should it be static?
arch/x86/kvm/i8254.c:94:5: warning: symbol 'pit_get_gate' was not declared. Should it be static?
arch/x86/kvm/i8254.c:196:5: warning: symbol '__pit_timer_fn' was not declared. Should it be static?
arch/x86/kvm/i8254.c:561:6: warning: symbol '__inject_pit_timer_intr' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: remove long -> void *user -> long cast
Christian Borntraeger [Mon, 21 Apr 2008 11:48:24 +0000 (13:48 +0200)]
KVM: remove long -> void *user -> long cast

kvm_dev_ioctl casts the arg value to void __user *, just to recast it
again to long. This seems unnecessary.

According to objdump the binary code on x86 is unchanged by this patch.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agonet_sched: Add size table for qdiscs
Jussi Kivilinna [Sun, 20 Jul 2008 07:08:47 +0000 (00:08 -0700)]
net_sched: Add size table for qdiscs

Add size table functions for qdiscs and calculate packet size in
qdisc_enqueue().

Based on patch by Patrick McHardy
 http://marc.info/?l=linux-netdev&m=115201979221729&w=2

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agonet_sched: Add accessor function for packet length for qdiscs
Jussi Kivilinna [Sun, 20 Jul 2008 07:08:27 +0000 (00:08 -0700)]
net_sched: Add accessor function for packet length for qdiscs

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agonet_sched: Add qdisc_enqueue wrapper
Jussi Kivilinna [Sun, 20 Jul 2008 07:08:04 +0000 (00:08 -0700)]
net_sched: Add qdisc_enqueue wrapper

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agohighmem: Export totalhigh_pages.
David S. Miller [Sun, 20 Jul 2008 05:39:46 +0000 (22:39 -0700)]
highmem: Export totalhigh_pages.

Hash et al. sizing code in SCTP wants to make the
calculation totalram_pages - totalhigh_pages, just
like TCP.  But this requires an export for the
CONFIG_HIGHMEM case to work.

Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoipv6 mcast: Omit redundant address family checks in ip6_mc_source().
YOSHIFUJI Hideaki [Sun, 20 Jul 2008 05:36:07 +0000 (22:36 -0700)]
ipv6 mcast: Omit redundant address family checks in ip6_mc_source().

The caller has alredy checked for them.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agonet: Use standard structures for generic socket address structures.
YOSHIFUJI Hideaki [Sun, 20 Jul 2008 05:35:47 +0000 (22:35 -0700)]
net: Use standard structures for generic socket address structures.

Use sockaddr_storage{} for generic socket address storage
and ensures proper alignment.
Use sockaddr{} for pointers to omit several casts.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoipv6 netns: Make several "global" sysctl variables namespace aware.
YOSHIFUJI Hideaki [Sun, 20 Jul 2008 05:35:03 +0000 (22:35 -0700)]
ipv6 netns: Make several "global" sysctl variables namespace aware.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agonetns: Use net_eq() to compare net-namespaces for optimization.
YOSHIFUJI Hideaki [Sun, 20 Jul 2008 05:34:43 +0000 (22:34 -0700)]
netns: Use net_eq() to compare net-namespaces for optimization.

Without CONFIG_NET_NS, namespace is always &init_net.
Compiler will be able to omit namespace comparisons with this patch.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluet...
David S. Miller [Sat, 19 Jul 2008 07:30:39 +0000 (00:30 -0700)]
Merge branch 'master' of git://git./linux/kernel/git/holtmann/bluetooth-2.6

16 years agoipv6: remove unused macros from net/ipv6.h
Denis V. Lunev [Sat, 19 Jul 2008 07:29:42 +0000 (00:29 -0700)]
ipv6: remove unused macros from net/ipv6.h

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoipv6: remove unused parameter from ip6_ra_control
Denis V. Lunev [Sat, 19 Jul 2008 07:28:58 +0000 (00:28 -0700)]
ipv6: remove unused parameter from ip6_ra_control

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agotcp: fix kernel panic with listening_get_next
Daniel Lezcano [Sat, 19 Jul 2008 07:15:13 +0000 (00:15 -0700)]
tcp: fix kernel panic with listening_get_next

# BUG: unable to handle kernel NULL pointer dereference at
0000000000000038
IP: [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
PGD 11e4b9067 PUD 11d16c067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 3
Modules linked in: bridge ipv6 button battery ac loop dm_mod tg3 ext3
jbd edd fan thermal processor thermal_sys hwmon sg sata_svw libata dock
serverworks sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Pid: 3368, comm: slpd Not tainted 2.6.26-rc2-mm1-lxc4 #1
RIP: 0010:[<ffffffff821ed01e>] [<ffffffff821ed01e>]
listening_get_next+0x50/0x1b3
RSP: 0018:ffff81011e1fbe18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8100be0ad3c0 RCX: ffff8100619f50c0
RDX: ffffffff82475be0 RSI: ffff81011d9ae6c0 RDI: ffff8100be0ad508
RBP: ffff81011f4f1240 R08: 00000000ffffffff R09: ffff8101185b6780
R10: 000000000000002d R11: ffffffff820fdbfa R12: ffff8100be0ad3c8
R13: ffff8100be0ad6a0 R14: ffff8100be0ad3c0 R15: ffffffff825b8ce0
FS: 00007f6a0ebd16d0(0000) GS:ffff81011f424540(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 000000011dc20000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process slpd (pid: 3368, threadinfo ffff81011e1fa000, task
ffff81011f4b8660)
Stack: 00000000000002ee ffff81011f5a57c0 ffff81011f4f1240
ffff81011e1fbe90
0000000000001000 0000000000000000 00007fff16bf2590 ffffffff821ed9c8
ffff81011f5a57c0 ffff81011d9ae6c0 000000000000041a ffffffff820b0abd
Call Trace:
[<ffffffff821ed9c8>] ? tcp_seq_next+0x34/0x7e
[<ffffffff820b0abd>] ? seq_read+0x1aa/0x29d
[<ffffffff820d21b4>] ? proc_reg_read+0x73/0x8e
[<ffffffff8209769c>] ? vfs_read+0xaa/0x152
[<ffffffff82097a7d>] ? sys_read+0x45/0x6e
[<ffffffff8200bd2b>] ? system_call_after_swapgs+0x7b/0x80

Code: 31 a9 25 00 e9 b5 00 00 00 ff 45 20 83 7d 0c 01 75 79 4c 8b 75 10
48 8b 0e eb 1d 48 8b 51 20 0f b7 45 08 39 02 75 0e 48 8b 41 28 <4c> 39
78 38 0f 84 93 00 00 00 48 8b 09 48 85 c9 75 de 8b 55 1c
RIP [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
RSP <ffff81011e1fbe18>
CR2: 0000000000000038

This kernel panic appears with CONFIG_NET_NS=y.

How to reproduce ?

    On the buggy host (host A)
       * ip addr add 1.2.3.4/24 dev eth0

    On a remote host (host B)
       * ip addr add 1.2.3.5/24 dev eth0
       * iptables -A INPUT -p tcp -s 1.2.3.4 -j DROP
       * ssh 1.2.3.4

    On host A:
       * netstat -ta or cat /proc/net/tcp

This bug happens when reading /proc/net/tcp[6] when there is a req_sock
at the SYN_RECV state.

When a SYN is received the minisock is created and the sk field is set to
NULL. In the listening_get_next function, we try to look at the field
req->sk->sk_net.

When looking at how to fix this bug, I noticed that is useless to do
the check for the minisock belonging to the namespace. A minisock belongs
to a listen point and this one is per namespace, so when browsing the
minisock they are always per namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agotcp: Remove redundant checks when setting eff_sacks
Adam Langley [Sat, 19 Jul 2008 07:07:02 +0000 (00:07 -0700)]
tcp: Remove redundant checks when setting eff_sacks

Remove redundant checks when setting eff_sacks and make the number of SACKs a
compile time constant. Now that the options code knows how many SACK blocks can
fit in the header, we don't need to have the SACK code guessing at it.

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agotcp: options clean up
Adam Langley [Sat, 19 Jul 2008 07:04:31 +0000 (00:04 -0700)]
tcp: options clean up

This should fix the following bugs:
  * Connections with MD5 signatures produce invalid packets whenever SACK
    options are included
  * MD5 signatures are counted twice in the MSS calculations

Behaviour changes:
  * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK

    This is because we can't fit any SACK blocks in a packet with MD5 + TS
    options. There was discussion about disabling SACK rather than TS in
    order to fit in better with old, buggy kernels, but that was deemed to
    be unnecessary.

  * SYNs with MD5 don't include a TS option

    See above.

Additionally, it removes a bunch of duplicated logic for calculating options,
which should help avoid these sort of issues in the future.

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agotcp: Fix MD5 signatures for non-linear skbs
Adam Langley [Sat, 19 Jul 2008 07:01:42 +0000 (00:01 -0700)]
tcp: Fix MD5 signatures for non-linear skbs

Currently, the MD5 code assumes that the SKBs are linear and, in the case
that they aren't, happily goes off and hashes off the end of the SKB and
into random memory.

Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy
Polyakov. Also includes a couple of missed route_caps from Stephen's patch
in [2].

[1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2
[2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: Update sctp global memory limit allocations.
Vlad Yasevich [Sat, 19 Jul 2008 06:08:21 +0000 (23:08 -0700)]
sctp: Update sctp global memory limit allocations.

Update sctp global memory limit allocations to be the same as TCP.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: remove unnecessary byteshifting, calculate directly in big-endian
Harvey Harrison [Sat, 19 Jul 2008 06:07:09 +0000 (23:07 -0700)]
sctp: remove unnecessary byteshifting, calculate directly in big-endian

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: Allow only 1 listening socket with SO_REUSEADDR
Vlad Yasevich [Sat, 19 Jul 2008 06:06:32 +0000 (23:06 -0700)]
sctp: Allow only 1 listening socket with SO_REUSEADDR

When multiple socket bind to the same port with SO_REUSEADDR,
only 1 can be listining.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: Do not leak memory on multiple listen() calls
Vlad Yasevich [Sat, 19 Jul 2008 06:06:07 +0000 (23:06 -0700)]
sctp: Do not leak memory on multiple listen() calls

SCTP permits multiple listen call and on subsequent calls
we leak he memory allocated for the crypto transforms.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: Support ipv6only AF_INET6 sockets.
Vlad Yasevich [Sat, 19 Jul 2008 06:05:40 +0000 (23:05 -0700)]
sctp: Support ipv6only AF_INET6 sockets.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: Prevent uninitialized memory access
Florian Westphal [Sat, 19 Jul 2008 06:04:39 +0000 (23:04 -0700)]
sctp: Prevent uninitialized memory access

valgrind reports uninizialized memory accesses when running
sctp inside the network simulation cradle simulator:

 Conditional jump or move depends on uninitialised value(s)
    at 0x570E34A: sctp_assoc_sync_pmtu (associola.c:1324)
    by 0x57427DA: sctp_packet_transmit (output.c:403)
    by 0x5710EFF: sctp_outq_flush (outqueue.c:824)
    by 0x5710B88: sctp_outq_uncork (outqueue.c:701)
    by 0x5745262: sctp_cmd_interpreter (sm_sideeffect.c:1548)
    by 0x57444B7: sctp_side_effects (sm_sideeffect.c:976)
    by 0x5744460: sctp_do_sm (sm_sideeffect.c:945)
    by 0x572157D: sctp_primitive_ASSOCIATE (primitive.c:94)
    by 0x5725C04: __sctp_connect (socket.c:1094)
    by 0x57297DC: sctp_connect (socket.c:3297)

 Conditional jump or move depends on uninitialised value(s)
    at 0x575D3A5: mod_timer (timer.c:630)
    by 0x5752B78: sctp_cmd_hb_timers_start (sm_sideeffect.c:555)
    by 0x5754133: sctp_cmd_interpreter (sm_sideeffect.c:1448)
    by 0x5753607: sctp_side_effects (sm_sideeffect.c:976)
    by 0x57535B0: sctp_do_sm (sm_sideeffect.c:945)
    by 0x571E9AE: sctp_endpoint_bh_rcv (endpointola.c:474)
    by 0x573347F: sctp_inq_push (inqueue.c:104)
    by 0x572EF93: sctp_rcv (input.c:256)
    by 0x5689623: ip_local_deliver_finish (ip_input.c:230)
    by 0x5689759: ip_local_deliver (ip_input.c:268)
    by 0x5689CAC: ip_rcv_finish (dst.h:246)

#1 is due to "if (t->pmtu_pending)".
8a4794914f9cf2681235ec2311e189fe307c28c7 "[SCTP] Flag a pmtu change request"
suggests it should be initialized to 0.

#2 is the heartbeat timer 'expires' value, which is uninizialised, but
test by mod_timer().
T3_rtx_timer seems to be affected by the same problem, so initialize it, too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agosctp: Don't abort initialization when CONFIG_PROC_FS=n
Florian Westphal [Sat, 19 Jul 2008 06:03:44 +0000 (23:03 -0700)]
sctp: Don't abort initialization when CONFIG_PROC_FS=n

This puts CONFIG_PROC_FS defines around the proc init/exit functions
and also avoids compiling proc.c if procfs is not supported.
Also make SCTP_DBG_OBJCNT depend on procfs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agotcp: RTT metrics scaling
Stephen Hemminger [Sat, 19 Jul 2008 06:02:15 +0000 (23:02 -0700)]
tcp: RTT metrics scaling

Some of the metrics (RTT, RTTVAR and RTAX_RTO_MIN) are stored in
kernel units (jiffies) and this leaks out through the netlink API to
user space where the units for jiffies are unknown.

This patches changes the kernel to convert to/from milliseconds. This
changes the ABI, but milliseconds seemed like the most natural unit
for these parameters.  Values available via syscall in
/proc/net/rt_cache and netlink will be in milliseconds.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agopkt_sched: Fix noqueue_qdisc initialization.
David S. Miller [Sat, 19 Jul 2008 06:00:11 +0000 (23:00 -0700)]
pkt_sched: Fix noqueue_qdisc initialization.

Like noop_qdisc, it needs a dummy backpointer and
explicit qdisc->q.lock initialization.

Based upon a report by Stephen Hemminger.

Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agopkt_sched: Manage qdisc list inside of root qdisc.
David S. Miller [Sat, 19 Jul 2008 05:50:15 +0000 (22:50 -0700)]
pkt_sched: Manage qdisc list inside of root qdisc.

Idea is from Patrick McHardy.

Instead of managing the list of qdiscs on the device level, manage it
in the root qdisc of a netdev_queue.  This solves all kinds of
visibility issues during qdisc destruction.

The way to iterate over all qdiscs of a netdev_queue is to visit
the netdev_queue->qdisc, and then traverse it's list.

The only special case is to ignore builting qdiscs at the root when
dumping or doing a qdisc_lookup().  That was not needed previously
because builtin qdiscs were not added to the device's qdisc_list.

Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agopkt_sched: Get rid of u32_list.
David S. Miller [Sat, 19 Jul 2008 03:54:17 +0000 (20:54 -0700)]
pkt_sched: Get rid of u32_list.

The u32_list is just an indirect way of maintaining a reference
to a U32 node on a per-qdisc basis.

Just add an explicit node pointer for u32 to struct Qdisc an do
away with this global list.

Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agopacket: add PACKET_RESERVE sockopt
Patrick McHardy [Sat, 19 Jul 2008 01:05:19 +0000 (18:05 -0700)]
packet: add PACKET_RESERVE sockopt

Add new sockopt to reserve some headroom in the mmaped ring frames in
front of the packet payload. This can be used f.i. when the VLAN header
needs to be (re)constructed to avoid moving the entire payload.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agobnx2: Update version to 1.7.9.
Benjamin Li [Sat, 19 Jul 2008 00:58:57 +0000 (17:58 -0700)]
bnx2: Update version to 1.7.9.

Signed-off-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agobnx2: Fix Sparse warnings
Benjamin Li [Sat, 19 Jul 2008 00:57:26 +0000 (17:57 -0700)]
bnx2: Fix Sparse warnings

This patch will fix the following sparse warnings:

/home/benli/sparse/bnx2.c:297:8: warning: symbol 'val' shadows an earlier one
/home/benli/sparse/bnx2.c:286:60: originally declared here
/home/benli/sparse/bnx2.c:7461:7: warning: symbol 'i' shadows an earlier one
/home/benli/sparse/bnx2.c:7265:10: originally declared here

Signed-off-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agobnx2: Add TX multiqueue support.
Benjamin Li [Sat, 19 Jul 2008 00:55:11 +0000 (17:55 -0700)]
bnx2: Add TX multiqueue support.

Signed-off-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agobnx2: Update TPAT firmware
Benjamin Li [Sat, 19 Jul 2008 00:54:17 +0000 (17:54 -0700)]
bnx2: Update TPAT firmware

This change allows the first TX ring (CID 16) and the first TSS TX ring
(CID 32) to be used concurrently.  Before this change, we could get TSO
errors when both TX rings were used concurrently.

Signed-off-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoe1000: resolve tx multiqueue bug
Ben Hutchings [Sat, 19 Jul 2008 00:50:57 +0000 (17:50 -0700)]
e1000: resolve tx multiqueue bug

With the recent changes to tx mutiqueue, e1000 was not calling
netif_start_queue() before calling netif_wake_queue().
This causes an oops during loading of the driver.

(Based on commit d55b53fff0c2ddb639dca04c3f5a0854f292d982
("igb/ixgbe/e1000e: resolve tx multiqueue bug").)

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoigb/ixgbe/e1000e: resolve tx multiqueue bug
Jeff Kirsher [Fri, 18 Jul 2008 11:33:03 +0000 (04:33 -0700)]
igb/ixgbe/e1000e: resolve tx multiqueue bug

With the recent changes to tx mutiqueue, igb/ixgbe/e1000e was not calling
netif_tx_start_all_queues() before calling netif_tx_wake_all_queues().
This causes an issue during loading of the driver.

In addition, updated e1000e to use the updated tx mutliqueue api.

Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoproc: consolidate per-net single-release callers
Pavel Emelyanov [Fri, 18 Jul 2008 11:07:44 +0000 (04:07 -0700)]
proc: consolidate per-net single-release callers

They are symmetrical to single_open ones :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoproc: consolidate per-net single_open callers
Pavel Emelyanov [Fri, 18 Jul 2008 11:07:21 +0000 (04:07 -0700)]
proc: consolidate per-net single_open callers

There are already 7 of them - time to kill some duplicate code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoproc: clean the ip_misc_proc_init and ip_proc_init_net error paths
Pavel Emelyanov [Fri, 18 Jul 2008 11:06:50 +0000 (04:06 -0700)]
proc: clean the ip_misc_proc_init and ip_proc_init_net error paths

After all this stuff is moved outside, this function can look better.

Besides, I tuned the error path in ip_proc_init_net to make it have
only 2 exit points, not 3.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoproc: show per-net ip_devconf.forwarding in /proc/net/snmp
Pavel Emelyanov [Fri, 18 Jul 2008 11:06:26 +0000 (04:06 -0700)]
proc: show per-net ip_devconf.forwarding in /proc/net/snmp

This one has become per-net long ago, but the appropriate file
is per-net only now.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoproc: create /proc/net/snmp file in each net
Pavel Emelyanov [Fri, 18 Jul 2008 11:06:04 +0000 (04:06 -0700)]
proc: create /proc/net/snmp file in each net

All the statistics shown in this file have been made per-net already.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
16 years agoproc: create /proc/net/netstat file in each net
Pavel Emelyanov [Fri, 18 Jul 2008 11:05:17 +0000 (04:05 -0700)]
proc: create /proc/net/netstat file in each net

Now all the shown in it statistics is netnsizated, time to
show it in appropriate net.

The appropriate net init/exit ops already exist - they make
the sockstat file per net - so just extend them.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>