Tuning FreeBSD for routing

This wiki page resume all information from BSDRP website and Tuning FreeBSD for routing and firewalling, AsiaBSDCon 2018 paper.

FreeBSD 13.0-CURRENT on a Xeon E5 2697A 16 cores (32 threads) with a 40G Chelsio T580 and a 100G Mellanox ConnectX-4 is used here for the bench results.

Basic concepts

How to correctly bench a router

The two main functions of a router are:

This page focuses only on optimizing the forwarding rate: Maintaining the routing table belongs to the userland daemons.

The only metric measured here will be the packet forwarding speed using packets-per-second (pps) unit.

Differences with RFC 2544

RFC 2544 , Benchmarking Methodology for Network Interconnect Devices is a well-known reference to correctly bench routers.

But here we will not follow all recommendations given by this RFC for a simplest and faster methodology.

Here are some main divergences:

Ethernet line rate references

The first reference to know is the maximum Ethernet line rate (implying smallest frame size, like UDP packet with 1 byte of payload):

From these values and the fact that Ethernet is a full-duplex (FDX) media (able to receive and transmit at the same time), this means a FDX line-rate router must be able to forward at:

Throughput to bandwidth

In real use cases there is no need of these line-rate routers because Internet traffic is not comprise of only small size packets but a mix of multiple sizes.

This packet size distribution evolves with time but there is a fixed-in-time reference, called Simple Internet Mix (IMIX) which uses this distribution:

Using this Simple IMIX distribution it’s now possible to convert the packets-per-second to a more common value which is the bandwidth in bits per second (bps).

bps at the IP layer=PPS * ( 7 * 40 + 4 * 576 + 1500 ) / 12 * 8

Or the bandwidth at the Ethernet layer (need to add 14 Bytes for Ethernet headers), as seen by switch counters:

bps at the Ethernet layer = PPS * ( 7 * 54 + 4 * 590 + 1514 ) / 12 * 8

For real life use cases, the interesting ratio to use is the one when using a simple IMIX distribution size that fill the link capacity.

And here are the minimum PPS for a "FDX IMIX link-speed router":

Setting a benchmarking lab

A very simple benchmarking lab can be set up with only 2 servers like here:

+---------+   +-------+
|         |->-|       |
| pkt-gen |   |  DUT  |
|         |-<-|       |
+---------+   +-------+

The purpose is to measure throughput (number of packets per second) forwarded by the DUT under the worst case: Receiving only smallest packet size at media line rate on one interface and forward to the packet receiver using its other interface.

The throughput is measured at the packet receiver side: Using a switch, with advanced monitoring counters for each port, can be useful to double cross-check its counters versus pkt-gen and Ethernet drivers counters.

A more complex setup, like using 40Giga NIC and range of IP addresses with pkt-gen (consuming too much CPU for using one server), could use 3 nodes:

+---------+   +-------+   +--------+
|         |   |       |   |        |
| pkt-gen |->-|  DUT  |->-|pkt-gen |
|generator|   |       |   |receiver|
+---------+   +-------+   +--------+

Switch configuration

When netmap pkt-gen runs as a packet receiver it will NEVER generate a frame:

Then during your bench test, the generator will generate line-rate of traffic to the packet-receiver… but the MAC address of the receiver will age out on the switch table, and the switch will broadcast ALL traffic from the generator to ALL ports belonging to this VLAN ;-)

This is why it's always wise to statically configure MAC address of packet receiver on a switch before this kind of bench.

If you want to check the drivers statistics against the switch stats, you should disable ALL advanced feature on the switch's ports used for the bench too, like:

Multi-flows to benefit from multi-queue RSS

Current NIC chipset & drivers behaviour:

  1. NIC’s drivers create one queue per direction (transmit and receive) and per core detected with a maximum number of queues which is drivers dependant: 16 receiving (RX) queues for mlx4en, 8 RX queues for cxgbe and ixgbe as examples.
  2. NIC’s chipsets use a Toeplitz hash to balance received packets across each RX queues: All 4 tuples of the packets (source IP, destination IP, source port and destination port) are used.

To being able to load-balance IP flows between cores, IP traffic must include multiple flows for being hashed: Using tunnelling features like IPSec, GRE or PPPoE prevents this distribution

Minimum hardware requirements

Before tuning, to be able to reach the minimum requirement of a 10Gb/s router, the CPU must have 8 cores minimum, Intel Xeon or equivalent class.

NIC manufacturers known to take care of their FreeBSD drivers: Chelsio, Intel and Mellanox.

Tuning

Re-enabling fastforwarding

The first step is to disable ICMP redirect forwarding to enable tryforward.

echo net.inet.ip.redirect=0 >> /etc/sysctl.conf
echo net.inet6.ip6.redirect=0 >> /etc/sysctl.conf
service sysctl restart

This first tuning impact is massive:

Number of inet4 packets-per-second forwarded:
x net.inet.ip.redirect=1 (default)
+ net.inet.ip.redirect=0
+--------------------------------------------------------------------------+
|x                                                                        +|
|xx                                                                      ++|
|xx                                                                      ++|
|MA                                                                        |
|                                                                        |A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     2199966.5       2309062       2210097     2230250.6     45711.484
+   5       8211578     8259515.5       8244041     8235045.2      20946.73
Difference at 95.0% confidence
        6.00479e+06 +/- 51854.8
        269.243% +/- 7.86461%
        (Student's t, pooled s = 35554.9)

From 2.2Mpps with redirect enabled, it jumps to 8.2Mpps with redirect disabled.

Configure the NIC to use all cores/threads

Some drivers are not using all available cores or threads by default:

As an example with default Chelsio settings on a 32 cpus (16 cores * 2HT) the drivers is using a maximum of 8 RX queues and 16 TX queues only:

[root@pxetest2]~# sysctl kern.smp.cores
kern.smp.cores: 16
[root@pxetest2]~# sysctl hw.ncpu
hw.ncpu: 32
[root@pxetest2]~# sysctl hw.cxgbe.nrxq
hw.cxgbe.nrxq: 8
[root@pxetest2]~# sysctl hw.cxgbe.ntxq
hw.cxgbe.ntxq: 16

Increasing the number of RX queue to match number of core or thread need to be done into /boot/loader.conf:

echo hw.cxgbe.nrxq=32 >> /boot/loader.conf
echo hw.cxgbe.ntxq=32 >> /boot/loader.conf

With Mellanox the variable to use is dev.mce.X.conf.channels=32 and with Intel dev.X.Y.iflib.override_nrxds=32

And the benefit will be huge:

Number of inet4 packets-per-second forwarded
x 8 RX queues (default)
+ 16 RX queues (= all cores on this setup)
* 32 RX queues (= all threads on this setup)
+--------------------------------------------------------------------------+
|x                 +                                                      *|
|xx               ++                                                     **|
|xx               ++                                                     **|
|A|                                                                        |
|                 |A                                                       |
|                                                                        AM|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       8160234       8261012       8230285     8221687.8       42278.4
+   5      10787172      10913225      10820346      10830279     49477.458
Difference at 95.0% confidence
        2.60859e+06 +/- 67115.9
        31.7282% +/- 0.934431%
        (Student's t, pooled s = 46018.9)
*   5      19034716      19176514      19107114      19097753     56388.197
Difference at 95.0% confidence
        1.08761e+07 +/- 72681.8
        132.285% +/- 1.42045%
        (Student's t, pooled s = 49835.2)

From 8.1Mpps with default value to 19Mpps using 32 RX queues (=all threads)

Allow interrupts on HTT logical CPUs

Another very interesting feature added into 13-head: allow interrupts on hyperthreaded cores.

echo machdep.hyperthreading_intr_allowed=1 >> /boot/loader.conf

On a Intel Xeon CPU E5-2697A v4 2.60GHz (16 cores / 32 threads) server with a Chelsio T580-LP-CR configured with 32 RX queues (incoming traffic) and a Mellanox ConnectX-4 (outgoing traffic), the benefit is noticeable:

Number of inet4 packets-per-second forwarded:
x machdep.hyperthreading_intr_allowed=0 (default) and hw.cxgbe.nXxq=32
+ machdep.hyperthreading_intr_allowed=1 and hw.cxgbe.nXxq=32
+--------------------------------------------------------------------------+
| x                                                                       +|
| x                                                                       +|
|xx                                                                     +++|
||A                                                                        |
|                                                                        AM|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      19028337      19126428      19099721      19090535     37168.444
+   5      24396932      24508797      24494912      24476810     45506.933
Difference at 95.0% confidence
        5.38628e+06 +/- 60594.5
        28.2144% +/- 0.355956%
        (Student's t, pooled s = 41547.4)

With this feature, from 19Mpps to 24Mpps forwarded.

Notice this setup reach such value, because on the DUT the generated flow wasn't using the same ASIC for RX and TX, avoiding to overload the ASIC: All trafic was incoming to the Chelsio and outgoing toward the Mellanox. In case of traffic entering and exiting the other Chelsio port connected to the same ASIC, the performance stuck at about 19Mpps.

On a different setup, an Xeon E5 2650L V2 1.70GHz (10 Cores/20 Threads) with a Chelsio_T540-CR, result are less impressive but still an improvement:

Number of inet4 packets-per-second forwarded:
x machdep.hyperthreading_intr_allowed=0 (default) and hw.cxgbe.nXxq=16
+ machdep.hyperthreading_intr_allowed=1 and hw.cxgbe.nXxq=16
+--------------------------------------------------------------------------+
|        x                                                              ++ |
|        xx                                                   x   +   x ++ |
||________M_____________________A______________________________|           |
|                                                                   |__AM_||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     7586270.5       8644626       7616992     7990242.7     537190.89
+   5       8583402       8701804       8689267     8669742.3     48818.605
Difference at 95.0% confidence
        679500 +/- 556274
        8.50412% +/- 7.54931%
        (Student's t, pooled s = 381417)

Chelsio drivers tuning

By default, the Chelsio reserve hardware ressource for accelerated hardware features (RDMA, ISCSI, FCOE, TOE) that are useless on a router.

Disabling these features give more ressource to the chip:

cat <<EOF >>/boot/loader.conf
hw.cxgbe.toecaps_allowed="0"
hw.cxgbe.rdmacaps_allowed="0"
hw.cxgbe.iscsicaps_allowed="0"
hw.cxgbe.fcoecaps_allowed="0"
EOF

The benefit can be huge:

Number of inet4 packets-per-second forwarded:
x hw.cxgbe.toecaps|rdmacaps|iscsicaps|fcoecaps_allowed=1 (default)
+ hw.cxgbe.toecaps|rdmacaps|iscsicaps|fcoecaps_allowed=0
+--------------------------------------------------------------------------+
|                                                                         +|
|x                                                                        +|
|x                                                                        +|
|x                                                                        +|
|xx                                                                       +|
|A                                                                         |
|                                                                         A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      12747220      12830462      12779478      12785165     31206.467
+   5      24485838      24535582      24491950      24500818     20931.042
Difference at 95.0% confidence
        1.17157e+07 +/- 38751.1
        91.6348% +/- 0.51107%
        (Student's t, pooled s = 26570.2)

From 12.7Mpps to 24.5Mpps (+91%).

Disabling Ethernet Flow-control

Using Ethernet flow-control is not a good idea on a router (and on a server): If your NIC is overloaded it will ask an Ethernet "pause" to the direct peer (the switch), that will temporary stop emitting ALL flows toward the server.

But this kind of mechanism already exist in TCP, and TCP have a better management than this very-basic Ethernet Flow-control.

More information about this on When Flow Control is not a Good Thing

Disabling this feature is drivers dependent:

And there is not real impact of disabling feature on forwarding performance:

Number of inet4 packets-per-second forwarded:
x hw.cxgbe.pause_settings=0 (default)
+ hw.cxgbe.pause_settings=1
+--------------------------------------------------------------------------+
|x                           +  x   +  +    + x         +                 x|
|         |_____________________M____A_________________________|           |
|                              |_______M_A_________|                       |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      24440254      24536843      24481016      24487790      35015.44
+   5      24477223      24513195      24490633      24493013     13385.716
No difference proven at 95.0% confidence

Disabling LRO and TSO

All modern NIC support LRO and TSO features that needs to be disabled on a router:

To disable these features, you need to add -tso4 -tso6 -lro -vlanhwtso at the end of the ifconfig_xxx line into /etc/rc.conf.

There is no negative impact of disabling these features on forwarding performance, and on some hardware setup it can even brings some good improvement:

Number of inet4 packets-per-second forwarded
x enabled (default)
+ disabled (-tso4 -tso6 -lro -vlanhwtso)
+--------------------------------------------------------------------------+
|                                                                        ++|
|xxxxx                                                                  +++|
||_A_|                                                                     |
|                                                                        A||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      27463046      27724202      27568318      27599181     109278.63
+   5      32355798      32463268      32416758      32414376     47287.862
Difference at 95.0% confidence
        4.8152e+06 +/- 122795
        17.4469% +/- 0.511089%
        (Student's t, pooled s = 84196.1)

10gFreeBSD/Router (last edited 2020-02-19 10:09:09 by OlivierCochardLabbé)