compare FreeBSD and Linux TCP Congestion Control algorithms over emulated 1Gbps x 40ms WAN
Contents
Emulab environment
The testbed, configured in a dumbbell topology (pc3000 nodes with 1Gbps links), is sourced from Emulab.net: Emulab hardware.
The topology consists of a dumbbell structure with two nodes (s1, s2) acting as TCP traffic senders using iperf3, two nodes (rt1, rt2) functioning as routers with a single bottleneck link, and two nodes (r1, r2) serving as TCP traffic receivers.
The sender nodes (s1, s2) are used to test TCP congestion control algorithms on different operating systems, such as FreeBSD and Ubuntu Linux. The receiver nodes (r1, r2) run Ubuntu Linux 22.04.
The router nodes (rt1, rt2) operate on Ubuntu Linux 18.04 with shallow TX/RX ring buffers configured to 128 descriptors, and the L3 buffer set to 128 packets, resulting in a total routing buffer of approximately 256 packets.
A Dummynet box introduces a 40ms round-trip delay (RTT) on the bottleneck link. All senders transmit data traffic to their corresponding receivers (e.g., s1 => r1 and s2 => r2) at the same time.
The bottleneck link has a 1Gbps bandwidth, and the TCP traffic from the two senders will experience congestion at the rt1 node's output port.
test config
root@s1:~ # ping -c 5 r1 PING r1-link5 (10.1.4.3): 56 data bytes 64 bytes from 10.1.4.3: icmp_seq=0 ttl=62 time=40.313 ms 64 bytes from 10.1.4.3: icmp_seq=1 ttl=62 time=40.186 ms 64 bytes from 10.1.4.3: icmp_seq=2 ttl=62 time=40.233 ms 64 bytes from 10.1.4.3: icmp_seq=3 ttl=62 time=40.150 ms 64 bytes from 10.1.4.3: icmp_seq=4 ttl=62 time=40.250 ms --- r1-link5 ping statistics --- 5 packets transmitted, 5 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 40.150/40.227/40.313/0.056 ms root@s1:~ # root@s2:~ # ping -c 5 r2 PING r2-link6 (10.1.1.3): 56 data bytes 64 bytes from 10.1.1.3: icmp_seq=0 ttl=62 time=40.228 ms 64 bytes from 10.1.1.3: icmp_seq=1 ttl=62 time=40.249 ms 64 bytes from 10.1.1.3: icmp_seq=2 ttl=62 time=40.094 ms 64 bytes from 10.1.1.3: icmp_seq=3 ttl=62 time=40.571 ms 64 bytes from 10.1.1.3: icmp_seq=4 ttl=62 time=40.188 ms --- r2-link6 ping statistics --- 5 packets transmitted, 5 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 40.094/40.266/40.571/0.161 ms root@s2:~ #
- sender/receiver sysctl tuning
root@s1:~ # cat /etc/sysctl.conf ... net.inet.tcp.hostcache.enable=0 kern.ipc.maxsockbuf=16777216 net.inet.tcp.sendbuf_max=16777216 net.inet.tcp.recvbuf_max=16777216 root@r1:~# cat /etc/sysctl.conf ... # allow testing with 256MB buffers net.core.rmem_max = 268435456 net.core.wmem_max = 268435456 # increase Linux autotuning TCP buffer limit to 256MB net.ipv4.tcp_rmem = 4096 131072 268435456 net.ipv4.tcp_wmem = 4096 16384 268435456 # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1
- routers shallow buffer tuning
root@rt1:~# /sbin/ethtool -g eth5 Ring parameters for eth5: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 128 RX Mini: 0 RX Jumbo: 0 TX: 128 root@rt1:~# root@rt1:~# ifconfig eth5 eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.1.3.2 netmask 255.255.255.0 broadcast 10.1.3.255 inet6 fe80::204:23ff:feb7:17c9 prefixlen 64 scopeid 0x20<link> ether 00:04:23:b7:17:c9 txqueuelen 128 (Ethernet) RX packets 186 bytes 18000 (18.0 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 196 bytes 19354 (19.3 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 root@rt1:~# root@rt1:~# /sbin/tc qdisc show dev eth5 qdisc pfifo 8005: root refcnt 2 limit 128p root@rt1:~#
- switch to the FreeBSD RACK TCP stack
root@s1:~ # kldstat Id Refs Address Size Name 1 14 0xffffffff80200000 23a57e8 kernel 2 1 0xffffffff83200000 3cb1f8 zfs.ko 3 1 0xffffffff83020000 31e48 tcp_rack.ko 4 1 0xffffffff83052000 e0f0 tcphpts.ko root@s1:~ # root@s1:~ # sysctl net.inet.tcp.functions_default=rack net.inet.tcp.functions_default: freebsd -> rack root@s1:~ #
senders' kernel info |
FreeBSD 15.0-CURRENT #0 c6767dc1f236: Thu Jan 30 05:48:17 MST 2025 |
routers' kernel info |
Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-117-generic x86_64) |
receivers' kernel info |
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-122-generic x86_64) |
- iperf3 command
iperf3 -B ${src} --cport ${tcp_port} -c ${dst} -l 1M -t 200 -i 1 -f m -VC ${name}
test result
- highlighted performance stats ()
TCP congestion control algo |
average link utilization over the 200s test (1Gbps bottleneck) |
peer compare |
Linux CUBIC |
846 Mbits/sec (2nd iter) |
base |
FreeBSD default stack CUBIC |
470 Mbits/sec (2nd iter) |
-44.4% |
FreeBSD RACK stack CUBIC |
505 Mbits/sec (2nd iter) |
-40.3% |
Linux newreno |
695 Mbits/sec (1st iter) |
base |
FreeBSD default stack newreno |
301 Mbits/sec (1st iter) |
-56.7% |
FreeBSD RACK stack newreno |
444 Mbits/sec (3rd iter) |
-36.1% |
- throughput and congestion window of CUBIC in Linux TCP stack
TCP throughput: TCP congestion window:
- throughput and congestion window of CUBIC in FreeBSD default TCP stack
TCP throughput: TCP congestion window:
- throughput and congestion window of CUBIC in FreeBSD RACK TCP stack
TCP throughput: TCP congestion window:
throughput and congestion window of NewReno in Linux TCP stack
TCP throughput: TCP congestion window:
throughput and congestion window of NewReno in FreeBSD default TCP stack
TCP throughput: TCP congestion window:
throughput and congestion window of NewReno in FreeBSD RACK TCP stack
TCP throughput: TCP congestion window:
VM environment
test config
Virtual machines (VMs) are hosted by Bhyve in two separate physical boxs (Beelink SER5 AMD Mini PC) that are using FreeBSD 14.1 release OS. The two physical boxes are connected through a 1Gbps hub.
In each test, only one data sender and one data receiver are used, both are Virtual Machines (VMs). FreeBSD VM n1fbsd , and then Linux VM n1linuxvm is used to send TCP data traffic through the same physical link to the Linux VM receiver n2linuxvm . 40ms delay is added at the Linux receiver.
There is occasional TCP packet drops and we can evaluate congestion control performance. The minimum bandwidth delay product (BDP) is 1000Mbps x 40ms == 5 Mbytes.
root@n2linuxvm:~ # tc qdisc add dev enp0s5 root netem delay 40ms root@n2linuxvm:~ # tc qdisc show dev enp0s5 qdisc netem 8001: root refcnt 2 limit 1000 delay 40ms root@n2linuxvm:~ # root@n1fbsd:~ # ping -c 5 -S 192.168.50.37 192.168.50.89 PING 192.168.50.89 (192.168.50.89) from 192.168.50.37: 56 data bytes 64 bytes from 192.168.50.89: icmp_seq=0 ttl=64 time=44.003 ms 64 bytes from 192.168.50.89: icmp_seq=1 ttl=64 time=44.837 ms 64 bytes from 192.168.50.89: icmp_seq=2 ttl=64 time=43.978 ms 64 bytes from 192.168.50.89: icmp_seq=3 ttl=64 time=43.513 ms 64 bytes from 192.168.50.89: icmp_seq=4 ttl=64 time=43.631 ms --- 192.168.50.89 ping statistics --- 5 packets transmitted, 5 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 43.513/43.993/44.837/0.463 ms root@n1fbsd:~ # root@n1linuxvm:~ # ping -c 5 -I 192.168.50.154 192.168.50.89 PING 192.168.50.89 (192.168.50.89) from 192.168.50.154 : 56(84) bytes of data. 64 bytes from 192.168.50.89: icmp_seq=1 ttl=64 time=43.9 ms 64 bytes from 192.168.50.89: icmp_seq=2 ttl=64 time=44.0 ms 64 bytes from 192.168.50.89: icmp_seq=3 ttl=64 time=43.7 ms 64 bytes from 192.168.50.89: icmp_seq=4 ttl=64 time=44.0 ms 64 bytes from 192.168.50.89: icmp_seq=5 ttl=64 time=43.7 ms --- 192.168.50.89 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4031ms rtt min/avg/max/mdev = 43.706/43.862/44.015/0.130 ms root@n1linuxvm:~ #
- sender/receiver sysctl tuning
root@n1fbsd:~ # cat /etc/sysctl.conf ... # Don't cache ssthresh from previous connection net.inet.tcp.hostcache.enable=0 # In crease FreeBSD maximum socket buffer size up to 128MB kern.ipc.maxsockbuf=134217728 # Increase FreeBSD Max size of automatic send/receive buffer up to 128MB net.inet.tcp.sendbuf_max=134217728 net.inet.tcp.recvbuf_max=134217728 root@n1fbsd:~ # root@n2linuxvm:~ # cat /etc/sysctl.conf ... net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 # Increase Linux autotuning TCP buffer max up to 128MB buffers net.ipv4.tcp_rmem = 4096 131072 134217728 net.ipv4.tcp_wmem = 4096 16384 134217728 # Don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 root@n2linuxvm:~ #
- switch to the FreeBSD RACK TCP stack
root@n1fbsd:~ # kldstat Id Refs Address Size Name 1 5 0xffffffff80200000 1f75ca0 kernel 2 1 0xffffffff82810000 368d8 tcp_rack.ko 3 1 0xffffffff82847000 f0f0 tcphpts.ko root@n1fbsd:~ # sysctl net.inet.tcp.functions_default net.inet.tcp.functions_default: rack root@n1fbsd:~ #
sender info for FreeBSD |
FreeBSD 15.0-CURRENT (GENERIC) #0 main-n273771-e8263ace39c8 |
sender info for Linux |
Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-124-generic x86_64) |
receiver info for Linux |
Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-124-generic x86_64) |
- iperf3 command
iperf3 -B ${src} --cport ${tcp_port} -c ${dst} -l 1M -t 100 -i 1 -f m -VC ${name}
test result
- highlighted performance stats (FreeBSD congestion controls performs worse than peers in Linux with a large margin)
kern.hz value |
TCP congestion control algo |
iperf3 100 seconds average Bitrate |
100 (default) |
FreeBSD default stack CUBIC |
455 Mbits/sec (-46.9%) |
FreeBSD default stack newreno |
497 Mbits/sec (-37.4%) |
|
FreeBSD RACK stack CUBIC |
682 Mbits/sec (-20.4%) |
|
FreeBSD RACK stack newreno |
442 Mbits/sec (-44.3%) |
|
250, but irrelevant |
Linux CUBIC |
857 Mbits/sec (base) |
Linux newreno |
794 Mbits/sec (base) |
throughput and congestion window of CUBIC & NewReno in FreeBSD default TCP stack
TCP throughput: TCP congestion window:
throughput and congestion window of CUBIC & NewReno in FreeBSD RACK TCP stack
TCP throughput: TCP congestion window:
throughput and congestion window of CUBIC & NewReno in Linux
TCP throughput: TCP congestion window: