Limelight Networks ixgbe(4) tuning

Hardware

Tuning

netperf can be used here to easily see throughput, and netstat -I ix0 -hw 1 will show us a running pps.

The processing limit, harvests, and TSO/LRO settings are ones that could ease CPU usage, though we are very light on inbound so TSO is likely more of a possibility. The rest are performance related.

BSD Router Project ixgbe(4) tuning (objective: routing)

Specific tuning for obtaining the best PPS performance (smallest packet size).

hardware tested

Benchs done on quad cores Intel Xeon L5630 2.13GHz with hyper-threading disabled (IBM System x3550 M3 ) with an Intel Ethernet Controller 10-Gigabit X540-AT2:

ix0@pci0:21:0:0:        class=0x020000 card=0x00018086 chip=0x15288086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller 10-Gigabit X540-AT2'
    class      = network
    subclass   = ethernet
    bar   [10] = type Prefetchable Memory, range 64, base 0xfbc00000, size 2097152, enabled
    bar   [20] = type Prefetchable Memory, range 64, base 0xfbe04000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks 
    cap 11[70] = MSI-X supports 64 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR link x8(x8)
                 speed 5.0(5.0) ASPM disabled(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 a0369fffff1e2814
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SRIOV 1
    ecap 000d[1d0] = ACS 1

Bench lab and packet generator/receiver

Lab diagram:

+-----------------------------------+      +----------------------------------+
|   Packet generator and receiver   |      |         Device under Test        |
|                                   |      |                                  |
| ix0: 8.8.8.1 (a0:36:9f:1e:1e:d8)  |=====>| ix0: 8.8.8.2 (a0:36:9f:1e:28:14) |
|      2001:db8:8::1                |      |      2001:db8:8::2               |
|                                   |      |                                  |
| ix1: 9.9.9.1 (a0:36:9f:1e:1e:da)  |<=====| ix1: 9.9.9.2 (a0:36:9f:1e:28:16) |
|      2001:db8:9::1                |      |      2001:db8:9::2               |
|                                   |      |                                  |
|                                   |      |        static routes             |
|                                   |      |      8.0.0.0/8 => 8.8.8.1        |
|                                   |      |      9.0.0.0/8 => 9.9.9.1        |
|                                   |      | 2001:db8:8:://48 =>2001:db8:8::1 |
|                                   |      | 2001:db8:9:://48 =>2001:db8:9::1 |
+-----------------------------------+      +----------------------------------+

Packet generator use this command line (2000 flows: 100 different source IP * 20 different destination IP):

 pkt-gen -i ix0 -f tx -n 1000000000 -l 60 -d 9.1.1.1:2000-9.1.1.100 -D a0:36:9f:1e:28:14 -s 8.1.1.1:2000-8.1.1.20 -w 4 

Packet receiver use this command line:

 pkt-gen -i ix1 -f rx -w 4 

=> The reference PPS value for all these benchs is the "packet receiver" value.

Generic advice

cf Router/Gateway page about NIC independant advice for a router/gateway usage (disabling Ethernet flow-control, disabling LRO/TSO, etc…)

hw.ix.[r|t]x_process_limit

Disabling rx_process_limit and tx_process_limit give the best performance (value in pps):

x rx|tx_process_limit=256(default)
+ rx|tx_process_limit=512
* rx|tx_process_limit=-1(nolimit)
+--------------------------------------------------------------------------------+
|*   ++ * x+                x                          *    * *   *             *|
||______M__A_________|                                                           |
|  |__A___|                                                                      |
|                                                      |______M__A________|      |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1656772       1786725       1690620     1704710.4     48677.407
+   5       1657620       1703418       1679787     1681153.8     17459.489
No difference proven at 95.0% confidence
*   5       1918159       2036665       1950208       1963257     44988.621
Difference at 95.0% confidence
        258547 +/- 68356.2
        15.1666% +/- 4.00984%
        (Student's t, pooled s = 46869.3)

…but under heavy load, if NIC queues=ncpu, the system will be unresponsive.

hw.ix.rxd and hw.ix.txd

What are the impact of modifying these limit on PPS?

x hw.ix.[r|t]xd=1024
+ hw.ix.[r|t]xd=2048
* hw.ix.[r|t]xd=4096
+--------------------------------------------------------------------------------+
|+  *                         +  *        +x     ** *         *      x     *  x  |
|                                      |____________________A________M__________||
|         |______________________M__A__________________________|                 |
|                    |_____________________A______M_______________|              |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1937832       2018222       2002526     1985231.6     35961.532
+   5       1881470       2011917       1937898     1943572.2     46782.722
No difference proven at 95.0% confidence
*   5       1886620       1989850       1968491       1956497     40190.155
No difference proven at 95.0% confidence

=> No difference

10gFreeBSD/Intel10G (last edited 2014-10-15 08:22:17 by OlivierCochardLabbé)