Pandora Sourcecodes - pandora-kernel.git/log

git.openpandora.org / pandora-kernel.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Vegard Nossum [Fri, 27 Jun 2008 19:35:50 +0000 (21:35 +0200)]

sched: fix warning

This patch fixes the following warning:

kernel/sched.c:1667: warning: 'cfs_rq_set_shares' defined but not used

This seems the correct way to fix this; cfs_rq_set_shares() is only used
in a single place, which is also inside #ifdef CONFIG_FAIR_GROUP_SCHED.

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Ingo Molnar [Fri, 27 Jun 2008 13:42:36 +0000 (15:42 +0200)]

sched: build fix

fix:

kernel/sched.c: In function ‘sched_group_set_shares':
kernel/sched.c:8635: error: implicit declaration of function ‘cfs_rq_set_shares'

Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Ingo Molnar [Sun, 29 Jun 2008 13:01:59 +0000 (15:01 +0200)]

sched: sched_clock_cpu() based cpu_clock(), lockdep fix

Vegard Nossum reported:

> WARNING: at kernel/lockdep.c:2738 check_flags+0x142/0x160()

which happens due to:

unsigned long long cpu_clock(int cpu)
{
         unsigned long long clock;
         unsigned long flags;

         raw_local_irq_save(flags);

as lower level functions can take locks, we must not do that, use
proper lockdep-annotated irq save/restore.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Ingo Molnar [Fri, 27 Jun 2008 12:49:35 +0000 (14:49 +0200)]

sched: export cpu_clock

the rcutorture module relies on cpu_clock.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Dhaval Giani [Tue, 24 Jun 2008 18:09:43 +0000 (23:39 +0530)]

sched: make sched_{rt,fair}.c ifdefs more readable

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:39 +0000 (13:41 +0200)]

sched: bias effective_load() error towards failing wake_affine().

Measurement shows that the difference between cgroup:/ and cgroup:/foo
wake_affine() results is that the latter succeeds significantly more.

Therefore bias the calculations towards failing the test.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:38 +0000 (13:41 +0200)]

sched: incremental effective_load()

Increase the accuracy of the effective_load values.

Not only consider the current increment (as per the attempted wakeup), but
also consider the delta between when we last adjusted the shares and the
current situation.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:37 +0000 (13:41 +0200)]

sched: correct wakeup weight calculations

rw_i = {2, 4, 1, 0}
s_i = {2/7, 4/7, 1/7, 0}

wakeup on cpu0, weight=1

rw'_i = {3, 4, 1, 0}
s'_i = {3/8, 4/8, 1/8, 0}

s_0 = S * rw_0 / \Sum rw_j ->
\Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct)

s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) =
1 * (2+1) / (7+1) = 3/8 (correct

so we find that adding 1 to cpu0 gains 5/56 in weight
if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Srivatsa Vaddagiri [Fri, 27 Jun 2008 11:41:36 +0000 (13:41 +0200)]

sched: fix mult overflow

It was observed these mults can overflow.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:35 +0000 (13:41 +0200)]

sched: update shares on wakeup

We found that the affine wakeup code needs rather accurate load figures
to be effective. The trouble is that updating the load figures is fairly
expensive with group scheduling. Therefore ratelimit the updating.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:34 +0000 (13:41 +0200)]

sched: fix shares boost logic

In case the domain is empty, pretend there is a single task on each cpu, so
that together with the boost logic we end up giving 1/n shares to each
cpu.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:33 +0000 (13:41 +0200)]

sched: disable source/target_load bias

The bias given by source/target_load functions can be very large, disable
it by default to get faster convergence.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:32 +0000 (13:41 +0200)]

sched: optimize effective_load()

s_i = S * rw_i / \Sum_j rw_j

-> \Sum_j rw_j = S * rw_i / s_i

-> s'_i = S * (rw_i + w) / (\Sum_j rw_j + w)

delta s = s' - s = S * (rw + w) / ((S * rw / s) + w)
= s * (S * (rw + w) / (S * rw + s * w) - 1)

a = S*(rw+w), b = S*rw + s*w

delta s = s * (a-b) / b

IOW, trade one divide for two multiplies

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:31 +0000 (13:41 +0200)]

sched: remove prio preference from balance decisions

Priority looses much of its meaning in a hierarchical context. So don't
use it in balance decisions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:30 +0000 (13:41 +0200)]

sched: fix task_h_load()

Currently task_h_load() computes the load of a task and uses that to either
subtract it from the total, or add to it.

However, removing or adding a task need not have any effect on the total load
at all. Imagine adding a task to a group that is local to one cpu - in that
case the total load of that cpu is unaffected.

So properly compute addition/removal:

s_i = S * rw_i / \Sum_j rw_j
s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg)

then s'_i - s_i gives the change in load.

Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight
for that cpu, wl the weight we add (subtract) and wg the weight contribution to
the runqueue.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:29 +0000 (13:41 +0200)]

sched: fix load scaling in group balancing

doing the load balance will change cfs_rq->load.weight (that's the whole point)
but since that's part of the scale factor, we'll scale back with a different
amount.

Weight getting smaller would result in an inflated moved_load which causes
it to stop balancing too soon.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:28 +0000 (13:41 +0200)]

sched: hierarchical load vs find_busiest_group

find_busiest_group() has some assumptions about task weight being in the
NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this
by replacing it with the average task weight, which will adapt the situation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:27 +0000 (13:41 +0200)]

sched: hierarchical load vs affine wakeups

With hierarchical grouping we can't just compare task weight to rq weight - we
need to scale the weight appropriately.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:26 +0000 (13:41 +0200)]

sched: persistent average load per task

Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of
cpu_avg_load_per_task() - this is useful because of the hierarchical group
model in which task weight can be much smaller.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:25 +0000 (13:41 +0200)]

sched: fix sched_balance_self() smp group balancing

Finding the least idle cpu is more accurate when done with updated shares.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:24 +0000 (13:41 +0200)]

sched: fix newidle smp group balancing

Re-compute the shares on newidle - so we can make a decision based on
recent data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:23 +0000 (13:41 +0200)]

sched: simplify the group load balancer

While thinking about the previous patch - I realized that using per domain
aggregate load values in load_balance_fair() is wrong. We should use the
load value for that CPU.

By not needing per domain hierarchical load values we don't need to store
per domain aggregate shares, which greatly simplifies all the math.

It basically falls apart in two separate computations:
- per domain update of the shares
- per CPU update of the hierarchical load

Also get rid of the move_group_shares() stuff - just re-compute the shares
again after a successful load balance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:22 +0000 (13:41 +0200)]

sched: no need to aggregate task_weight

We only need to know the task_weight of the busiest rq - nothing to do
if there are no tasks there.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:21 +0000 (13:41 +0200)]

sched: dont micro manage share losses

We used to try and contain the loss of 'shares' by playing arithmetic
games. Replace that by noticing that at the top sched_domain we'll
always have the full weight in shares to distribute.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Srivatsa Vaddagiri [Fri, 27 Jun 2008 11:41:20 +0000 (13:41 +0200)]

sched: kill task_group balancing

The idea was to balance groups until we've reached the global goal, however
Vatsa rightly pointed out that we might never reach that goal this way -
hence take out this logic.

[ the initial rationale for this 'feature' was to promote max concurrency
within a group - it does not however affect fairness ]

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:19 +0000 (13:41 +0200)]

sched: update aggregate when holding the RQs

It was observed that in __update_group_shares_cpu()

rq_weight > aggregate()->rq_weight

This is caused by forks/wakeups in between the initial aggregate pass and
locking of the RQs for load balance. To avoid this situation partially re-do
the aggregation once we have the RQs locked (which avoids new tasks from
appearing).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:18 +0000 (13:41 +0200)]

sched: fix sched_domain aggregation

Keeping the aggregate on the first cpu of the sched domain has two problems:
- it could collide between different sched domains on different cpus
- it could slow things down because of the remote accesses

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:17 +0000 (13:41 +0200)]

sched: add full schedstats to /proc/sched_debug

show all the schedstats in /debug/sched_debug as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:16 +0000 (13:41 +0200)]

sched: fix wakeup granularity and buddy granularity

Uncouple buddy selection from wakeup granularity.

The initial idea was that buddies could run ahead as far as a normal task
can - do this by measuring a pair 'slice' just as we do for a normal task.

This means we can drop the wakeup_granularity back to 5ms.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:15 +0000 (13:41 +0200)]

sched: sched_clock_cpu() based cpu_clock()

with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy
difference) use this to provide cpu_clock().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:14 +0000 (13:41 +0200)]

sched: revert revert of: fair-group: SMP-nice for group scheduling

Try again..

Initial commit: 18d95a2832c1392a2d63227a7a6d433cb9f2037e
Revert: 6363ca57c76b7b83639ca8c83fc285fa26a7880e

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:13 +0000 (13:41 +0200)]

sched: fix calc_delta_asym, #2

Ok, so why are we in this mess, it was:

1/w

but now we mixed that rw in the mix like:

rw/w

rw being \Sum w suggests: fiddling w, we should also fiddle rw, humm?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:12 +0000 (13:41 +0200)]

sched: fix calc_delta_asym()

calc_delta_asym() is supposed to do the same as calc_delta_fair() except
linearly shrink the result for negative nice processes - this causes them
to have a smaller preemption threshold so that they are more easily preempted.

The problem is that for task groups se->load.weight is the per cpu share of
the actual task group weight; take that into account.

Also provide a debug switch to disable the asymmetry (which I still don't
like - but it does greatly benefit some workloads)

This would explain the interactivity issues reported against group scheduling.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:11 +0000 (13:41 +0200)]

sched: revert the revert of: weight calculations

Try again..

initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c
revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Fri, 27 Jun 2008 11:41:10 +0000 (13:41 +0200)]

sched: clean up some unused variables

In file included from /mnt/build/linux-2.6/kernel/sched.c:1496:
/mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime':
/mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd'
/mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity':
/mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue'

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Ingo Molnar [Wed, 25 Jun 2008 10:26:59 +0000 (12:26 +0200)]

Merge branch 'linus' into sched/devel

Conflicts:

kernel/sched_rt.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Jun 2008 01:58:20 +0000 (18:58 -0700)]

Linux 2.6.26-rc8

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Jun 2008 01:12:33 +0000 (18:12 -0700)]

Merge branch 'release' of git://git./linux/kernel/git/aegl/linux-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] Eliminate NULL test after alloc_bootmem in iosapic_alloc_rte()
  [IA64] Handle count==0 in sn2_ptc_proc_write()
  [IA64] Fix boot failure on ia64/sn2

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Jun 2008 01:09:47 +0000 (18:09 -0700)]

Merge git://git./linux/kernel/git/steve/gfs2-2.6-fixes

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
[GFS2] fix gfs2 block allocation (cleaned up)
[GFS2] BUG: unable to handle kernel paging request at ffff81002690e000

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Jun 2008 01:09:06 +0000 (18:09 -0700)]

Merge branch 'kvm-updates-2.6.26' of git://git./linux/kernel/git/avi/kvm

* 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
  KVM: Remove now unused structs from kvm_para.h
  x86: KVM guest: Use the paravirt clocksource structs and functions
  KVM: Make kvm host use the paravirt clocksource structs
  x86: Make xen use the paravirt clocksource structs and functions
  x86: Add structs and functions for paravirt clocksource
  KVM: VMX: Fix host msr corruption with preemption enabled
  KVM: ioapic: fix lost interrupt when changing a device's irq
  KVM: MMU: Fix oops on guest userspace access to guest pagetable
  KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend)
  KVM: MMU: Fix rmap_write_protect() hugepage iteration bug
  KVM: close timer injection race window in __vcpu_run
  KVM: Fix race between timer migration and vcpu migration

commit | commitdiff | tree

Linus Torvalds [Tue, 24 Jun 2008 18:23:35 +0000 (11:23 -0700)]

Merge git://git./linux/kernel/git/wim/linux-2.6-watchdog

* git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog:
Revert "[WATCHDOG] hpwdt: Add CFLAGS to get driver working"

commit | commitdiff | tree

Linus Torvalds [Tue, 24 Jun 2008 18:21:47 +0000 (11:21 -0700)]

Merge branch 'x86-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
xen: remove support for non-PAE 32-bit

commit | commitdiff | tree

Linus Torvalds [Tue, 24 Jun 2008 18:20:59 +0000 (11:20 -0700)]

Merge branch 'for_linus' of git://git./linux/kernel/git/jwessel/linux-2.6-kgdb

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: sparse fix
kgdb: documentation update - remove kgdboe

commit | commitdiff | tree

Jie Luo [Tue, 24 Jun 2008 17:38:31 +0000 (10:38 -0700)]

enable bus mastering on i915 at resume time

On 9xx chips, bus mastering needs to be enabled at resume time for much of the
chip to function. With this patch, vblank interrupts will work as expected
on resume, along with other chip functions. Fixes kernel bugzilla #10844.

Signed-off-by: Jie Luo <clotho67@gmail.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:33 +0000 (16:17 +0200)]

KVM: Remove now unused structs from kvm_para.h

The kvm_* structs are obsoleted by the pvclock_* ones.
Now all users have been switched over and the old structs
can be dropped.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:32 +0000 (16:17 +0200)]

x86: KVM guest: Use the paravirt clocksource structs and functions

This patch updates the kvm host code to use the pvclock structs
and functions, thereby making it compatible with Xen.

The patch also fixes an initialization bug: on SMP systems the
per-cpu has two different locations early at boot and after CPU
bringup. kvmclock must take that in account when registering the
physical address within the host.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:31 +0000 (16:17 +0200)]

KVM: Make kvm host use the paravirt clocksource structs

This patch updates the kvm host code to use the pvclock structs.
It also makes the paravirt clock compatible with Xen.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:30 +0000 (16:17 +0200)]

x86: Make xen use the paravirt clocksource structs and functions

This patch updates the xen guest to use the pvclock structs
and helper functions.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:29 +0000 (16:17 +0200)]

x86: Add structs and functions for paravirt clocksource

This patch adds structs for the paravirt clocksource ABI
used by both xen and kvm (pvclock-abi.h).

It also adds some helper functions to read system time and
wall clock time from a paravirtual clocksource (pvclock.[ch]).
They are based on the xen code. They are enabled using
CONFIG_PARAVIRT_CLOCK.

Subsequent patches of this series will put the code in use.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Benjamin Marzinski [Tue, 24 Jun 2008 17:53:38 +0000 (12:53 -0500)]

[GFS2] fix gfs2 block allocation (cleaned up)

This patch fixes bz 450641.

This patch changes the computation for zero_metapath_length(), which it
renames to metapath_branch_start(). When you are extending the metadata
tree, The indirect blocks that point to the new data block must either
diverge from the existing tree either at the inode, or at the first
indirect block. They can diverge at the first indirect block because the
inode has room for 483 pointers while the indirect blocks have room for
509 pointers, so when the tree is grown, there is some free space in the
first indirect block. What metapath_branch_start() now computes is the
height where the first indirect block for the new data block is located.
It can either be 1 (if the indirect block diverges from the inode) or 2
(if it diverges from the first indirect block).

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>

commit | commitdiff | tree

Julia Lawall [Tue, 24 Jun 2008 08:22:05 +0000 (10:22 +0200)]

[IA64] Eliminate NULL test after alloc_bootmem in iosapic_alloc_rte()

As noted by Akinobu Mita alloc_bootmem and related functions never return
NULL and always return a zeroed region of memory. Thus a NULL test or
memset after calls to these functions is unnecessary.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Tony Luck <tony.luck@intel.com>

commit | commitdiff | tree

Cliff Wickman [Tue, 24 Jun 2008 17:20:06 +0000 (10:20 -0700)]

[IA64] Handle count==0 in sn2_ptc_proc_write()

The fix applied in e0c6d97c65e0784aade7e97b9411f245a6c543e7
"security hole in sn2_ptc_proc_write" didn't take into account
the case where count==0 (which results in a buffer underrun
when adding the trailing '\0'). Thanks to Andi Kleen for
pointing this out.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

commit | commitdiff | tree

Jes Sorensen [Tue, 24 Jun 2008 15:30:09 +0000 (11:30 -0400)]

[IA64] Fix boot failure on ia64/sn2

Call check_sal_cache_flush() after platform_setup() as
check_sal_cache_flush() now relies on being able to call platform
vector code.

Problem was introduced by: 3463a93def55c309f3c0d0a8aaf216be3be42d64
"Update check_sal_cache_flush to use platform_send_ipi()"

Signed-off-by: Jes Sorensen <jes@sgi.com>
Tested-by: Alex Chiang: <achiang@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

commit | commitdiff | tree

Jason Wessel [Tue, 24 Jun 2008 15:52:55 +0000 (10:52 -0500)]

kgdb: sparse fix

- Fix warning reported by sparse
kernel/kgdb.c:1502:6: warning: symbol 'kgdb_console_write' was not declared.
Should it be static?

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>

commit | commitdiff | tree

Jason Wessel [Tue, 24 Jun 2008 15:52:55 +0000 (10:52 -0500)]

kgdb: documentation update - remove kgdboe

kgdboe is not presently included kgdb, and there should be no
references to it.

Also fix the tcp port terminal connection example.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>

commit | commitdiff | tree

Jeremy Fitzhardinge [Fri, 9 May 2008 11:05:57 +0000 (12:05 +0100)]

xen: remove support for non-PAE 32-bit

Non-PAE operation has been deprecated in Xen for a while, and is
rarely tested or used. xen-unstable has now officially dropped
non-PAE support. Since Xen/pvops' non-PAE support has also been
broken for a while, we may as well completely drop it altogether.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Bob Peterson [Wed, 18 Jun 2008 16:30:40 +0000 (11:30 -0500)]

[GFS2] BUG: unable to handle kernel paging request at ffff81002690e000

This patch fixes bugzilla bug bz448866: gfs2: BUG: unable to
handle kernel paging request at ffff81002690e000.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>

commit | commitdiff | tree

Wim Van Sebroeck [Tue, 24 Jun 2008 13:09:26 +0000 (13:09 +0000)]

Revert "[WATCHDOG] hpwdt: Add CFLAGS to get driver working"

After Linus fixed the inline assembly, the CFLAGS option is not
needed anymore.

Signed-off-by: Thomas Mingarelli <Thomas.Mingarelli@hp.com>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

commit | commitdiff | tree

Avi Kivity [Tue, 24 Jun 2008 08:48:49 +0000 (11:48 +0300)]

KVM: VMX: Fix host msr corruption with preemption enabled

Switching msrs can occur either synchronously as a result of calls to
the msr management functions (usually in response to the guest touching
virtualized msrs), or asynchronously when preempting a kvm thread that has
guest state loaded. If we're unlucky enough to have the two at the same
time, host msrs are corrupted and the machine goes kaput on the next syscall.

Most easily triggered by Windows Server 2008, as it does a lot of msr
switching during bootup.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Avi Kivity [Tue, 17 Jun 2008 22:36:36 +0000 (15:36 -0700)]

KVM: ioapic: fix lost interrupt when changing a device's irq

The ioapic acknowledge path translates interrupt vectors to irqs.  It
currently uses a first match algorithm, stopping when it finds the first
redirection table entry containing the vector.  That fails however if the
guest changes the irq to a different line, leaving the old redirection table
entry in place (though masked).  Result is interrupts not making it to the
guest.

Fix by always scanning the entire redirection table.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Avi Kivity [Thu, 12 Jun 2008 13:54:41 +0000 (16:54 +0300)]

KVM: MMU: Fix oops on guest userspace access to guest pagetable

KVM has a heuristic to unshadow guest pagetables when userspace accesses
them, on the assumption that most guests do not allow userspace to access
pagetables directly. Unfortunately, in addition to unshadowing the pagetables,
it also oopses.

This never triggers on ordinary guests since sane OSes will clear the
pagetables before assigning them to userspace, which will trigger the flood
heuristic, unshadowing the pagetables before the first userspace access. One
particular guest, though (Xenner) will run the kernel in userspace, triggering
the oops. Since the heuristic is incorrect in this case, we can simply
remove it.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Wed, 11 Jun 2008 23:32:40 +0000 (20:32 -0300)]

KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend)

kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed
guests properly. It will instantiate two 2MB sptes pointing to the same
physical 2MB page when a guest large pte update is trapped.

Instead of duplicating code to handle this, disallow directory level
updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes
emulating one guest 4MB pte can be correctly created by the page fault
handling path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Sun, 8 Jun 2008 04:48:53 +0000 (01:48 -0300)]

KVM: MMU: Fix rmap_write_protect() hugepage iteration bug

rmap_next() does not work correctly after rmap_remove(), as it expects
the rmap chains not to change during iteration. Fix (for now) by restarting
iteration from the beginning.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Fri, 6 Jun 2008 19:37:36 +0000 (16:37 -0300)]

KVM: close timer injection race window in __vcpu_run

If a timer fires after kvm_inject_pending_timer_irqs() but before
local_irq_disable() the code will enter guest mode and only inject such
timer interrupt the next time an unrelated event causes an exit.

It would be simpler if the timer->pending irq conversion could be done
with IRQ's disabled, so that the above problem cannot happen.

For now introduce a new vcpu requests bit to cancel guest entry.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Fri, 6 Jun 2008 19:37:35 +0000 (16:37 -0300)]

KVM: Fix race between timer migration and vcpu migration

A guest vcpu instance can be scheduled to a different physical CPU
between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable().

If that happens, the timer will only be migrated to the current pCPU on
the next exit, meaning that guest LAPIC timer event can be delayed until
a host interrupt is triggered.

Fix it by cancelling guest entry if any vcpu request is pending. This
has the side effect of nicely consolidating vcpu->requests checks.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Thorsten Kranzkowski [Mon, 23 Jun 2008 20:57:22 +0000 (20:57 +0000)]

alpha: fix compile error in arch/alpha/mm/init.c

Commit 9267b4b3880d00dc2dab90f1d817c856939114f7 ("alpha: fix module load
failures on smp (bug #10926)") causes a regression for my ev4
uniprocessor build:

CC arch/alpha/mm/init.o
/export/data/repositories/linux-2.6/arch/alpha/mm/init.c:34: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘typeof’
make[2]: *** [arch/alpha/mm/init.o] Error 1
make[1]: *** [arch/alpha/mm] Error 2
make: *** [sub-make] Error 2

This fixes it for me (compile and boot tested):

Signed-off-by: Thorsten Kranzkowski <dl8bcu@dl8bcu.de>
Acked-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 23:25:11 +0000 (16:25 -0700)]

Merge branch 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6

* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: nfs_updatepage(): don't mark page as dirty if an error occurred
  NFS: Fix filehandle size comparisons in the mount code
  NFS: Reduce the NFS mount code stack usage.

commit | commitdiff | tree

Trond Myklebust [Thu, 5 Jun 2008 20:02:35 +0000 (16:02 -0400)]

NFS: nfs_updatepage(): don't mark page as dirty if an error occurred

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

commit | commitdiff | tree

Trond Myklebust [Thu, 19 Jun 2008 19:21:11 +0000 (15:21 -0400)]

NFS: Fix filehandle size comparisons in the mount code

Fix a sign issue in xdr_decode_fhstatus3()
Fix incorrect comparison in nfs_validate_mount_data()

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

commit | commitdiff | tree

Trond Myklebust [Thu, 19 Jun 2008 18:20:11 +0000 (14:20 -0400)]

NFS: Reduce the NFS mount code stack usage.

This appears to fix the Oops reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10826

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:49:22 +0000 (12:49 -0700)]

Merge branch 'core-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futexes: fix fault handling in futex_lock_pi

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:48:50 +0000 (12:48 -0700)]

Merge branch 'sched-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: refactor wait_for_completion_timeout()
  sched: fix wait_for_completion_timeout() spurious failure under heavy load
  sched: rt: dont stop the period timer when there are tasks wanting to run

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:48:17 +0000 (12:48 -0700)]

Merge branch 'x86-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  xen: don't drop NX bit
  xen: mask unwanted pte bits in __supported_pte_mask
  xen: Use wmb instead of rmb in xen_evtchn_do_upcall().
  x86: fix NULL pointer deref in __switch_to

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:45:49 +0000 (12:45 -0700)]

Merge branch 'for-linus' of git://git./linux/kernel/git/roland/infiniband

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mthca: Clear ICM pages before handing to FW

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:18:06 +0000 (12:18 -0700)]

Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: sb - Fix wrong assertions
ALSA: aw2 - Fix Oops at initialization

commit | commitdiff | tree

Nick Piggin [Mon, 23 Jun 2008 12:30:30 +0000 (14:30 +0200)]

mm: fix race in COW logic

There is a race in the COW logic.  It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page,
however it is possible to have two racing do_wp_page()ers with one
causing the other to mistakenly believe it is safe to take the shortcut
when it is not.  This could lead to data corruption.

Process 1 and process2 each have a wp pte of the same anon page (ie.
one forked the other).  The page's mapcount is 2.  Then they both
attempt to write to it around the same time...

  proc1 proc2 thr1 proc2 thr2
  CPU0 CPU1 CPU3
  do_wp_page() do_wp_page()
trylock_page()
  can_share_swap_page()
   load page mapcount (==2)
  reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
   trylock_page()
    can_share_swap_page()
     load page mapcount (==1)
    reuse = 1
   ptep_set_access_flags (allow W)

  write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)

Fix this by moving the page_remove_rmap of the old page after the pte
clear and flush.  Potentially the entire branch could be moved down
here, but in order to stay consistent, I won't (should probably move all
the *_mm_counter stuff with one patch).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 18:21:37 +0000 (11:21 -0700)]

Fix ZERO_PAGE breakage with vmware

Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:

  "This broke vmware 6.0.4.
   Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
   /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.

The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.

This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.

Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Gustavo Fernando Padovan [Mon, 23 Jun 2008 11:07:06 +0000 (12:07 +0100)]

removed unused var real_tty on n_tty_ioctl()

I noted that the 'struct tty_struct *real_tty' is not used in this
function, so I removed the code about 'real_tty'.

Signed-off-by: Gustavo Fernando Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Alan Cox [Mon, 23 Jun 2008 11:06:52 +0000 (12:06 +0100)]

tty_driver: Update required method documentation

Some of the requirement rules are now more relaxed. Also correct a
contradiction in the previous update

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Eli Cohen [Mon, 23 Jun 2008 16:29:58 +0000 (09:29 -0700)]

IB/mthca: Clear ICM pages before handing to FW

Current memfree FW has a bug which in some cases, assumes that ICM
pages passed to it are cleared.  This patch uses __GFP_ZERO to
allocate all ICM pages passed to the FW.  Once firmware with a fix is
released, we can make the workaround conditional on firmware version.

This fixes the bug reported by Arthur Kepner <akepner@sgi.com> here:
http://lists.openfabrics.org/pipermail/general/2008-May/050026.html

Cc: <stable@kernel.org>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Rewritten to be a one-liner using __GFP_ZERO instead of vmap()ing
  ICM memory and memset()ing it to 0. - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>

commit | commitdiff | tree

Thomas Gleixner [Mon, 23 Jun 2008 09:21:58 +0000 (11:21 +0200)]

futexes: fix fault handling in futex_lock_pi

This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.

David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.

The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.

When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.

After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:

- the task which acquired the rtmutex
- the pending owner of the pi_state which did not get the rtmutex

Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.

One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.

Reported-by: David Holmes <david.holmes@sun.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Holmes <david.holmes@sun.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 73 insertions(+), 20 deletions(-)

commit | commitdiff | tree

Takashi Iwai [Mon, 23 Jun 2008 09:58:06 +0000 (11:58 +0200)]

ALSA: sb - Fix wrong assertions

snd_assert() in save_mixer() and restore_mixer() in sb_mixer.c is
just wrong. The debug code wasn't tested at all, obviously...

Signed-off-by: Takashi Iwai <tiwai@suse.de>

commit | commitdiff | tree

Takashi Iwai [Mon, 23 Jun 2008 09:54:05 +0000 (11:54 +0200)]

ALSA: aw2 - Fix Oops at initialization

The irq handler may be called before the proper initialization of hardware.
Call snd_aw2_saa7146_setup() before the irq handler registration.

Signed-off-by: Takashi Iwai <tiwai@suse.de>

commit | commitdiff | tree

Ingo Molnar [Mon, 23 Jun 2008 09:30:23 +0000 (11:30 +0200)]

Merge branch 'linus' into sched/devel

commit | commitdiff | tree

Ingo Molnar [Mon, 23 Jun 2008 09:00:26 +0000 (11:00 +0200)]

Merge branch 'linus' into sched/urgent

commit | commitdiff | tree

Linus Torvalds [Sun, 22 Jun 2008 19:23:15 +0000 (12:23 -0700)]

Fix performance regression on lmbench select benchmark

Christian Borntraeger reported that reinstating cond_resched() with
CONFIG_PREEMPT caused a performance regression on lmbench:

For example select file 500:
23 microseconds
32 microseconds

and that's really because we totally unnecessarily do the cond_resched()
in the innermost loop of select(), which is just silly.

This moves it out from the innermost loop (which only ever loops ove the
bits in a single "unsigned long" anyway), which makes the performance
regression go away.

Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Christoph Lameter [Sat, 21 Jun 2008 23:46:35 +0000 (16:46 -0700)]

Slab: Fix memory leak in fallback_alloc()

The zonelist patches caused the loop that checks for available
objects in permitted zones to not terminate immediately. One object
per zone per allocation may be allocated and then abandoned.

Break the loop when we have successfully allocated one object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 23:43:56 +0000 (16:43 -0700)]

Merge branch 'for_linus' of git://git./linux/kernel/git/tytso/ext4

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Ext4: Fix online resize block group descriptor corruption

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 19:31:02 +0000 (12:31 -0700)]

Merge branch 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6

* 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6:
  hwmon: (lm75) sensor reading bugfix
  hwmon: (abituguru3) update driver detection
  hwmon: (w83791d) new maintainer
  hwmon: (abituguru3) Identify Abit AW8D board as such
  hwmon: Update the sysfs interface documentation
  hwmon: (adt7473) Initialize max_duty_at_overheat before use
  hwmon: (lm85) Fix function RANGE_TO_REG()

commit | commitdiff | tree

Bernhard Walle [Sat, 21 Jun 2008 17:01:02 +0000 (19:01 +0200)]

Add return value to reserve_bootmem_node()

This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.

This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Reported-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 15:44:08 +0000 (08:44 -0700)]

Merge git://git./linux/kernel/git/davem/net-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  netns: Don't receive new packets in a dead network namespace.
  sctp: Make sure N * sizeof(union sctp_addr) does not overflow.
  pppoe: warning fix
  ipv6: Drop packets for loopback address from outside of the box.
  ipv6: Remove options header when setsockopt's optlen is 0
  mac80211: detect driver tx bugs

commit | commitdiff | tree

Eric W. Biederman [Sat, 21 Jun 2008 05:16:51 +0000 (22:16 -0700)]

netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>]  [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30  EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS:  0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack:  0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
>  ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
>  000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
>  <IRQ>  [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
>  [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
>  [<ffffffff803fd645>] icmp_echo+0x65/0x70
>  [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
>  [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
>  [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
>  [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
>  [<ffffffff803be57d>] process_backlog+0x9d/0x100
>  [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
>  [<ffffffff80237be5>] __do_softirq+0x75/0x100
>  [<ffffffff8020c97c>] call_softirq+0x1c/0x30
>  [<ffffffff8020f085>] do_softirq+0x65/0xa0
>  [<ffffffff80237af7>] irq_exit+0x97/0xa0
>  [<ffffffff8020f198>] do_IRQ+0xa8/0x130
>  [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
>  [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
>  <EOI>  [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
>  [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
>  [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
>  [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP  [<ffffffff803fce17>] icmp_sk+0x17/0x30
>  RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it.  We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Sat, 21 Jun 2008 05:04:34 +0000 (22:04 -0700)]

sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

As noticed by Gabriel Campana, the kmalloc() length arg
passed in by sctp_getsockopt_local_addrs_old() can overflow
if ->addr_num is large enough.

Therefore, enforce an appropriate limit.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Sat, 21 Jun 2008 04:58:02 +0000 (21:58 -0700)]

pppoe: warning fix

Fix warning:
drivers/net/pppoe.c: In function 'pppoe_recvmsg':
drivers/net/pppoe.c:945: warning: comparison of distinct pointer types lacks a cast
because skb->len is unsigned int and total_len is size_t

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 00:10:04 +0000 (17:10 -0700)]

Merge branch 'release' of git://git./linux/kernel/git/aegl/linux-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] SN2: security hole in sn2_ptc_proc_write

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:28:54 +0000 (03:28 +0400)]

alpha: resurrect Cypress IDE quirk

Which was removed in the hope that generic legacy IDE quirk in
drivers/pci/probe.c is sufficient for Cypress IDE.
It isn't, as this controller has non-standard BAR layout:
secondary channel registers are in the BAR0-1 of the second
PCI function - not in the BAR2-3 of the same function, as the
generic quirk routine assumes.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:28:31 +0000 (03:28 +0400)]

alpha: fix compile failures with gcc-4.3 (bug #10438)

Vast majority of these build failures are gcc-4.3 warnings
about static functions and objects being referenced from
non-static (read: "extern inline") functions, in conjunction
with our -Werror.

We cannot just convert "extern inline" to "static inline",
as people keep suggesting all the time, because "extern inline"
logic is crucial for generic kernel build.
So
- just make sure that all callees of critical "extern inline"
functions are also "extern inline";
- use "static inline", wherever it's possible.

traps.c: work around gcc-4.3 being too smart about array
bounds-checking.

TODO: add "gnu_inline" attribute to all our "extern inline"
functions to ensure desired behaviour with future compilers.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:26:21 +0000 (03:26 +0400)]

alpha: link failure fix

With built-in scsi disk driver, the final link fails with a following
error:
`.exit.text' referenced in section `.rodata' of drivers/built-in.o:
defined in discarded section `.exit.text' of drivers/built-in.o

This happens with -Os (CONFIG_CC_OPTIMIZE_FOR_SIZE=y) with all gcc-4
versions, and also with -O2 and gcc-4.3.

The problem is in sd.c:sd_major() being inlined into __exit function
exit_sd(), and the compiler generating a jump table in .rodata section
for the 'switch' statement in sd_major(). So we have references to
discarded section.

Fixed with a big hammer in the form of -fno-jump-tables.

Note that jump tables vs. discarded sections is a generic problem,
other architectures are just lucky not to suffer from it. But with
a slightly more complex switch/case statement it can be reproduced
on x86 as well. So maybe at some point we should consider
-fno-jump-tables as a generic compile option...

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:25:39 +0000 (03:25 +0400)]

alpha: fix module load failures on smp (bug #10926)

To calculate addresses of locally defined variables, GCC uses 32-bit
displacement from the GP. Which doesn't work for per cpu variables in
modules, as an offset to the kernel per cpu area is way above 4G.

The workaround is to force allocation of a GOT entry for per cpu variable
using ldq instruction with a 'literal' relocation.
I had to use custom asm/percpu.h, as a required argument magic doesn't
work with asm-generic/percpu.h macros.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 23:19:44 +0000 (16:19 -0700)]

Linux 2.6.26-rc7

Pandora Kernel Repository