sched: re-tune NUMA topologies
authorIngo Molnar <mingo@elte.hu>
Thu, 29 May 2008 12:32:23 +0000 (14:32 +0200)
committerIngo Molnar <mingo@elte.hu>
Thu, 29 May 2008 12:46:30 +0000 (14:46 +0200)
improve the sysbench ramp-up phase and its peak throughput on
a 16way NUMA box, by turning on WAKE_AFFINE:

             tip/sched   tip/sched+wake-affine
-------------------------------------------------
    1:             700              830    +15.65%
    2:            1465             1391    -5.28%
    4:            3017             3105    +2.81%
    8:            5100             6021    +15.30%
   16:           10725            10745    +0.19%
   32:           10135            10150    +0.16%
   64:            9338             9240    -1.06%
  128:            8599             8252    -4.21%
  256:            8475             8144    -4.07%
-------------------------------------------------
  SUM:           57558            57882    +0.56%

this change also improves lat_ctx from 6.69 usecs to 1.11 usec:

  $ ./lat_ctx -s 0 2
  "size=0k ovr=1.19
  2 1.11

  $ ./lat_ctx -s 0 2
  "size=0k ovr=1.22
  2 6.69

in sysbench it's an overall win with some weakness at the lots-of-clients
side. That happens because we now under-balance this workload
a bit. To counter that effect, turn on NEWIDLE:

              wake-idle          wake-idle+newidle
 -------------------------------------------------
     1:             830              834    +0.43%
     2:            1391             1401    +0.65%
     4:            3105             3091    -0.43%
     8:            6021             6046    +0.42%
    16:           10745            10736    -0.08%
    32:           10150            10206    +0.55%
    64:            9240             9533    +3.08%
   128:            8252             8355    +1.24%
   256:            8144             8384    +2.87%
 -------------------------------------------------
   SUM:           57882            58591    +1.21%

as a bonus this not only improves the many-clients case but
also improves the (more important) rampup phase.

sysbench is a workload that quickly breaks down if the
scheduler over-balances, so since it showed an improvement
under NEWIDLE this change is definitely good.

include/linux/topology.h

index 4bb7074..24f3d22 100644 (file)
@@ -166,7 +166,9 @@ void arch_update_cpu_topology(void);
        .busy_idx               = 3,                    \
        .idle_idx               = 3,                    \
        .flags                  = SD_LOAD_BALANCE       \
-                               | SD_SERIALIZE, \
+                               | SD_BALANCE_NEWIDLE    \
+                               | SD_WAKE_AFFINE        \
+                               | SD_SERIALIZE,         \
        .last_balance           = jiffies,              \
        .balance_interval       = 64,                   \
 }