Tuning FreeBSD for routing
This wiki page merge and update data from BSDRP website and Tuning FreeBSD for routing and firewalling, AsiaBSDCon 2018 paper.
FreeBSD 13.0-CURRENT on a Xeon E5 2697A 16 cores (32 threads) with a 40G Chelsio T580 and a 100G Mellanox ConnectX-4 is used here for the bench results.
Basic concepts
How to correctly bench a router
The two main functions of a router are:
- Forwarding packets between its interfaces;
- Maintaining routing table using some routing protocols.
This page focuses only on optimizing the forwarding rate: Maintaining the routing table belongs to the userland daemons.
The only metric measured here will be the packet forwarding speed using packets-per-second (pps) unit.
Differences with RFC 2544
RFC 2544 , Benchmarking Methodology for Network Interconnect Devices is a well-known reference to correctly bench routers.
But here we will not follow all recommendations given by this RFC for a simplest and faster methodology.
Here are some main divergences:
- Multiple frame size: In this paper, only the worst case matters, which is using the smallest Ethernet frame size. In this document one frame = one packet and unit fps=pps.
- Throughput is defined as the maximum frame rate supported by the DUT (device under test) without any drop: In this document the throughput is the outgoing forwarded frame rate when receiving at the maximum line rate.
- Bidirectional traffic: To simplify methodology, the bench labs described here generates only unidirectional traffic.
Ethernet line rate references
The first reference to know is the maximum Ethernet line rate (implying smallest frame size, like UDP packet with 1 byte of payload):
- Gigabit: 1.48 Mfps (frame-per-second)
- 10 Gigabit: 14.8 Mfps
- 40 Gigabit: 59.2 Mfps
From these values and the fact that Ethernet is a full-duplex (FDX) media (able to receive and transmit at the same time), this means a FDX line-rate router must be able to forward at:
- 3 Mpps = Gigabit FDX line-rate router
- 30 Mpps = 10 Gigabit FDX line-rate router
- 118 Mpps = 40 Gigabit FDX line-rate router
Throughput to bandwidth
In real use cases there is no need of these line-rate routers because Internet traffic is not comprise of only small size packets but a mix of multiple sizes.
This packet size distribution evolves with time but there is a fixed-in-time reference, called Simple Internet Mix (IMIX) which uses this distribution:
- 1 large (1500 Bytes) packet: 37%
- 4 medium (576 Bytes) packets: 56%
- 7 small (40 Bytes) packets: 7%
Using this Simple IMIX distribution it’s now possible to convert the packets-per-second to a more common value which is the bandwidth in bits per second (bps).
bps at the IP layer=PPS * ( 7 * 40 + 4 * 576 + 1500 ) / 12 * 8
Or the bandwidth at the Ethernet layer (need to add 14 Bytes for Ethernet headers), as seen by switch counters:
bps at the Ethernet layer = PPS * ( 7 * 54 + 4 * 590 + 1514 ) / 12 * 8
For real life use cases, the interesting ratio to use is the one when using a simple IMIX distribution size that fill the link capacity.
And here are the minimum PPS for a "FDX IMIX link-speed router":
- 700 Kpps = Gigabit IMIX router
- 7 Mpps = 10 Gigabit IMIX router
- 28 Mpps = 40 Gigabit IMIX router
Setting a benchmarking lab
A very simple benchmarking lab can be set up with only 2 servers like here:
+---------+ +-------+ | |->-| | | pkt-gen | | DUT | | |-<-| | +---------+ +-------+
The first server with dual port Network Interface Card (NIC) is used as a packet generator and receiver (using net/pkt-gen ).
- The second server is the Device Under Test (DUT) running FreeBSD that will be tuned.
The purpose is to measure throughput (number of packets per second) forwarded by the DUT under the worst case: Receiving only smallest packet size at media line rate on one interface and forward to the packet receiver using its other interface.
The throughput is measured at the packet receiver side: Using a switch, with advanced monitoring counters for each port, can be useful to double cross-check its counters versus pkt-gen and Ethernet drivers counters.
A more complex setup, like using 40Giga NIC and range of IP addresses with pkt-gen (consuming too much CPU for using one server), could use 3 nodes:
+---------+ +-------+ +--------+ | | | | | | | pkt-gen |->-| DUT |->-|pkt-gen | |generator| | | |receiver| +---------+ +-------+ +--------+
Switch configuration
When netmap pkt-gen runs as a packet receiver it will NEVER generate a frame:
Then during your bench test, the generator will generate line-rate of traffic to the packet-receiver… but the MAC address of the receiver will age out on the switch table, and the switch will broadcast ALL traffic from the generator to ALL ports belonging to this VLAN
This is why it's always wise to statically configure MAC address of packet receiver on a switch before this kind of bench.
If you want to check the drivers statistics against the switch stats, you should disable ALL advanced feature on the switch's ports used for the bench too, like:
- spanning-tree
- CDP/LLDP
- DTP (Cisco)
- Keep-alive
Multi-flows to benefit from multi-queue RSS
Current NIC chipset & drivers behaviour:
- NIC’s drivers create one queue per direction (transmit and receive) and per core detected with a maximum number of queues which is drivers dependant: 16 receiving (RX) queues for mlx4en, 8 RX queues for cxgbe and ixgbe as examples.
- NIC’s chipsets use a Toeplitz hash to balance received packets across each RX queues: All 4 tuples of the packets (source IP, destination IP, source port and destination port) are used.
To being able to load-balance IP flows between cores, IP traffic must include multiple flows for being hashed: Using tunnelling features like IPSec, GRE or PPPoE prevents this distribution
Minimum hardware requirements
Before tuning, to be able to reach the minimum requirement of a 10Gb/s router, the CPU must have 8 cores minimum, Intel Xeon or equivalent class.
NIC manufacturers known to take care of their FreeBSD drivers: Chelsio, Intel and Mellanox.
Tuning
Re-enabling fastforwarding
Should be useless since r367628 (Nov 12, 2020) that "..adds redirects to the fastforward path" that fixed it.
The first step is to disable sending IP redirects to enable tryforward.
echo net.inet.ip.redirect=0 >> /etc/sysctl.conf echo net.inet6.ip6.redirect=0 >> /etc/sysctl.conf service sysctl restart
This first tuning impact is massive:
Number of inet4 packets-per-second forwarded: x net.inet.ip.redirect=1 (default) + net.inet.ip.redirect=0 +--------------------------------------------------------------------------+ |x +| |xx ++| |xx ++| |MA | | |A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 2199966.5 2309062 2210097 2230250.6 45711.484 + 5 8211578 8259515.5 8244041 8235045.2 20946.73 Difference at 95.0% confidence 6.00479e+06 +/- 51854.8 269.243% +/- 7.86461% (Student's t, pooled s = 35554.9)
From 2.2Mpps with redirect enabled, it jumps to 8.2Mpps with redirect disabled.
Configure the NIC to use all cores/threads
Some drivers are not using all available cores or threads by default:
cxgbe: "The default is 8 or the number of CPU cores in the system, whichever is less."
- ixgbe seems to use same ase cxgbe
- mlx4en|mlx5en use same as number of threads
As an example with default Chelsio settings on a 32 cpus (16 cores * 2HT) the drivers is using a maximum of 8 RX queues and 16 TX queues only:
[root@pxetest2]~# sysctl kern.smp.cores kern.smp.cores: 16 [root@pxetest2]~# sysctl hw.ncpu hw.ncpu: 32 [root@pxetest2]~# sysctl hw.cxgbe.nrxq hw.cxgbe.nrxq: 8 [root@pxetest2]~# sysctl hw.cxgbe.ntxq hw.cxgbe.ntxq: 16
Increasing the number of RX queue to match number of core or thread need to be done into /boot/loader.conf:
echo hw.cxgbe.nrxq=32 >> /boot/loader.conf echo hw.cxgbe.ntxq=32 >> /boot/loader.conf
With Mellanox the variable to use is dev.mce.X.conf.channels=32 and with Intel dev.X.Y.iflib.override_nrxds=32
And the benefit will be huge:
Number of inet4 packets-per-second forwarded x 8 RX queues (default) + 16 RX queues (= all cores on this setup) * 32 RX queues (= all threads on this setup) +--------------------------------------------------------------------------+ |x + *| |xx ++ **| |xx ++ **| |A| | | |A | | AM| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 8160234 8261012 8230285 8221687.8 42278.4 + 5 10787172 10913225 10820346 10830279 49477.458 Difference at 95.0% confidence 2.60859e+06 +/- 67115.9 31.7282% +/- 0.934431% (Student's t, pooled s = 46018.9) * 5 19034716 19176514 19107114 19097753 56388.197 Difference at 95.0% confidence 1.08761e+07 +/- 72681.8 132.285% +/- 1.42045% (Student's t, pooled s = 49835.2)
From 8.1Mpps with default value to 19Mpps using 32 RX queues (=all threads)
Allow interrupts on HTT logical CPUs
Another very interesting feature added into 13-head: allow interrupts on hyperthreaded cores.
echo machdep.hyperthreading_intr_allowed=1 >> /boot/loader.conf
On a Intel Xeon CPU E5-2697A v4 2.60GHz (16 cores / 32 threads) server with a Chelsio T580-LP-CR configured with 32 RX queues (incoming traffic) and a Mellanox ConnectX-4 (outgoing traffic), the benefit is noticeable:
Number of inet4 packets-per-second forwarded: x machdep.hyperthreading_intr_allowed=0 (default) and hw.cxgbe.nXxq=32 + machdep.hyperthreading_intr_allowed=1 and hw.cxgbe.nXxq=32 +--------------------------------------------------------------------------+ | x +| | x +| |xx +++| ||A | | AM| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 19028337 19126428 19099721 19090535 37168.444 + 5 24396932 24508797 24494912 24476810 45506.933 Difference at 95.0% confidence 5.38628e+06 +/- 60594.5 28.2144% +/- 0.355956% (Student's t, pooled s = 41547.4)
With this feature, from 19Mpps to 24Mpps forwarded.
Notice this setup reach such value, because on the DUT the generated flow wasn't using the same ASIC for RX and TX, avoiding to overload the ASIC: All trafic was incoming to the Chelsio and outgoing toward the Mellanox. In case of traffic entering and exiting the other Chelsio port connected to the same ASIC, the performance stuck at about 19Mpps.
On a different setup, an Xeon E5 2650L V2 1.70GHz (10 Cores/20 Threads) with a Chelsio_T540-CR, result are less impressive but still an improvement:
Number of inet4 packets-per-second forwarded: x machdep.hyperthreading_intr_allowed=0 (default) and hw.cxgbe.nXxq=16 + machdep.hyperthreading_intr_allowed=1 and hw.cxgbe.nXxq=16 +--------------------------------------------------------------------------+ | x ++ | | xx x + x ++ | ||________M_____________________A______________________________| | | |__AM_|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 7586270.5 8644626 7616992 7990242.7 537190.89 + 5 8583402 8701804 8689267 8669742.3 48818.605 Difference at 95.0% confidence 679500 +/- 556274 8.50412% +/- 7.54931% (Student's t, pooled s = 381417)
Chelsio drivers tuning
By default, the Chelsio reserve hardware ressource for accelerated hardware features (RDMA, ISCSI, FCOE, TOE) that are useless on a router.
Disabling these features give more ressource to the chip:
cat <<EOF >>/boot/loader.conf hw.cxgbe.toecaps_allowed="0" hw.cxgbe.rdmacaps_allowed="0" hw.cxgbe.iscsicaps_allowed="0" hw.cxgbe.fcoecaps_allowed="0" EOF
The benefit can be huge:
Number of inet4 packets-per-second forwarded: x hw.cxgbe.toecaps|rdmacaps|iscsicaps|fcoecaps_allowed=1 (default) + hw.cxgbe.toecaps|rdmacaps|iscsicaps|fcoecaps_allowed=0 +--------------------------------------------------------------------------+ | +| |x +| |x +| |x +| |xx +| |A | | A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 12747220 12830462 12779478 12785165 31206.467 + 5 24485838 24535582 24491950 24500818 20931.042 Difference at 95.0% confidence 1.17157e+07 +/- 38751.1 91.6348% +/- 0.51107% (Student's t, pooled s = 26570.2)
From 12.7Mpps to 24.5Mpps (+91%).
Disabling Ethernet Flow-control
Using Ethernet flow-control is not a good idea on a router (and on a server): If your NIC is overloaded it will ask an Ethernet "pause" to the direct peer (the switch), that will temporary stop emitting ALL flows toward the server.
But this kind of mechanism already exist in TCP, and TCP have a better management than this very-basic Ethernet Flow-control.
More information about this on When Flow Control is not a Good Thing
Disabling this feature is drivers dependent:
Chelsio: Need to add hw.cxgbe.pause_settings="0" into /boot/loader.conf
Mellanox: Need to add dev.mce.X.rx_pauseframe_control=0 into /boot/loader.con or adding -mediaopt rxpause,txpause to the ifconfig_XX line of /etc/rc.conf
Intel: Need to add dev.[ix|igb|em].X.fc=0 into /etc/sysctl.conf
And there is not real impact of disabling feature on forwarding performance:
Number of inet4 packets-per-second forwarded: x hw.cxgbe.pause_settings=0 (default) + hw.cxgbe.pause_settings=1 +--------------------------------------------------------------------------+ |x + x + + + x + x| | |_____________________M____A_________________________| | | |_______M_A_________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 24440254 24536843 24481016 24487790 35015.44 + 5 24477223 24513195 24490633 24493013 13385.716 No difference proven at 95.0% confidence
Disabling LRO and TSO
All modern NIC support LRO and TSO features that needs to be disabled on a router:
- By waiting to store multiple packets at the NIC level before to hand them up to the stack: This add latency, and because all packets need to be sending out again, the stack have to split in different packets again before to hand them down to the NIC. Intel drivers readme include this note "The result of not disabling LRO when combined with ip forwarding or bridging can be low throughput or even a kernel panic."
This break the End-to-end principle
To disable these features, you need to add -tso4 -tso6 -lro -vlanhwtso at the end of the ifconfig_xxx line into /etc/rc.conf.
There is no negative impact of disabling these features on forwarding performance, and on some hardware setup it can even brings some good improvement:
Number of inet4 packets-per-second forwarded x enabled (default) + disabled (-tso4 -tso6 -lro -vlanhwtso) +--------------------------------------------------------------------------+ | ++| |xxxxx +++| ||_A_| | | A|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 27463046 27724202 27568318 27599181 109278.63 + 5 32355798 32463268 32416758 32414376 47287.862 Difference at 95.0% confidence 4.8152e+06 +/- 122795 17.4469% +/- 0.511089% (Student's t, pooled s = 84196.1)