Merge branch 'upstream' into for-linus

[pandora-kernel.git] / Documentation / networking / scaling.txt
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt

index 7254b4b..fe67b5c 100644 (file)
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -27,7 +27,7 @@ applying a filter to each packet that assigns it to one of a small number
  of logical flows. Packets for each flow are steered to a separate receive
  queue, which in turn can be processed by separate CPUs. This mechanism is
  generally known as “Receive-side Scaling” (RSS). The goal of RSS and
-the other scaling techniques to increase performance uniformly.
+the other scaling techniques is to increase performance uniformly.
  Multi-queue distribution can also be used for traffic prioritization, but
  that is not the focus of these techniques.
  
@@ -52,7 +52,8 @@ module parameter for specifying the number of hardware queues to
  configure. In the bnx2x driver, for instance, this parameter is called
  num_queues. A typical RSS configuration would be to have one receive queue
  for each CPU if the device supports enough queues, or otherwise at least
-one for each cache domain at a particular cache level (L1, L2, etc.).
+one for each memory domain, where a memory domain is a set of CPUs that
+share a particular memory level (L1, L2, NUMA node, etc.).
  
  The indirection table of an RSS device, which resolves a queue by masked
  hash, is usually programmed by the driver at initialization. The
@@ -82,11 +83,17 @@ RSS should be enabled when latency is a concern or whenever receive
  interrupt processing forms a bottleneck. Spreading load between CPUs
  decreases queue length. For low latency networking, the optimal setting
  is to allocate as many queues as there are CPUs in the system (or the
-NIC maximum, if lower). Because the aggregate number of interrupts grows
-with each additional queue, the most efficient high-rate configuration
+NIC maximum, if lower). The most efficient high-rate configuration
  is likely the one with the smallest number of receive queues where no
-CPU that processes receive interrupts reaches 100% utilization. Per-cpu
-load can be observed using the mpstat utility.
+receive queue overflows due to a saturated CPU, because in default
+mode with interrupt coalescing enabled, the aggregate number of
+interrupts (and thus work) grows with each additional queue.
+
+Per-cpu load can be observed using the mpstat utility, but note that on
+processors with hyperthreading (HT), each hyperthread is represented as
+a separate CPU. For interrupt handling, HT has shown no benefit in
+initial tests, so limit the number of queues to the number of CPU cores
+in the system.
  
  
  RPS: Receive Packet Steering
@@ -145,7 +152,7 @@ the bitmap.
  == Suggested Configuration
  
  For a single queue device, a typical RPS configuration would be to set
-the rps_cpus to the CPUs in the same cache domain of the interrupting
+the rps_cpus to the CPUs in the same memory domain of the interrupting
  CPU. If NUMA locality is not an issue, this could also be all CPUs in
  the system. At high interrupt rate, it might be wise to exclude the
  interrupting CPU from the map since that already performs much work.
@@ -154,7 +161,7 @@ For a multi-queue system, if RSS is configured so that a hardware
  receive queue is mapped to each CPU, then RPS is probably redundant
  and unnecessary. If there are fewer hardware queues than CPUs, then
  RPS might be beneficial if the rps_cpus for each queue are the ones that
-share the same cache domain as the interrupting CPU for that queue.
+share the same memory domain as the interrupting CPU for that queue.
  
  
  RFS: Receive Flow Steering
@@ -179,10 +186,10 @@ are steered using plain RPS. Multiple table entries may point to the
  same CPU. Indeed, with many flows and few CPUs, it is very likely that
  a single application thread handles flows with many different flow hashes.
  
-rps_sock_table is a global flow table that contains the *desired* CPU for
-flows: the CPU that is currently processing the flow in userspace. Each
-table value is a CPU index that is updated during calls to recvmsg and
-sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
+rps_sock_flow_table is a global flow table that contains the *desired* CPU
+for flows: the CPU that is currently processing the flow in userspace.
+Each table value is a CPU index that is updated during calls to recvmsg
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
  and tcp_splice_read()).
  
  When the scheduler moves a thread to a new CPU while it has outstanding
@@ -236,7 +243,7 @@ configured. The number of entries in the global flow table is set through:
  
  The number of entries in the per-queue flow table are set through:
  
- /sys/class/net/<dev>/queues/tx-<n>/rps_flow_cnt
+ /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
  
  == Suggested Configuration
  
@@ -326,7 +333,7 @@ The queue chosen for transmitting a particular flow is saved in the
  corresponding socket structure for the flow (e.g. a TCP connection).
  This transmit queue is used for subsequent packets sent on the flow to
  prevent out of order (ooo) packets. The choice also amortizes the cost
-of calling get_xps_queues() over all packets in the connection. To avoid
+of calling get_xps_queues() over all packets in the flow. To avoid
  ooo packets, the queue for a flow can subsequently only be changed if
  skb->ooo_okay is set for a packet in the flow. This flag indicates that
  there are no outstanding packets in the flow, so the transmit queue can