What need to be done to tune networking stack (WIP)

End host



All packets coming from network utilize kernel memory buffers named mbufs and mbuf clusters. A cluster is linked list of mbufs keeping all data of single packet. Single mbuf takes 256 bytes and mbuf cluster takes another 2048 bytes (or more, for jumbo frames). Very small packets fit in one mbuf but more commonly, a packet consumes mbuf cluster plus one extra mbuf.


This section describes how to reach 10G performance from your server


First, you need good NIC(s) capable of:

Intel 82576 is a good example of gigabit NIC.


Good overview of intel nic capabilities can be found here

Second, you need good CPU with many cores. Since you can easily get 16 different queues even for 82576 (8 for each port) it is considerable to but 24-core CPU like E5645. AMD seems to perform very bad on routing (however I can't prove it with any tests at the moment).
It seems that disabling HT speeds up things a bit (despite decreased number of queues).

OS tuning

Fixed in -CURRENT (r233937, r233938). Set net.bpf.optimize_writers=1 for best performance.

On FreeBSD older than 11.0, use more compact ip_fastfoward routine. It processes most packets falling back to 'normal' forward routine for fragments, packets with options, etc. Can save you up to 20% speed (but will break IPSec). Since FreeBSD 11.0, fastfordwarding was improved, renamed tryforward (no more break IPSec) and it's the default method.

Do not send IP redirects

Skip feeding /dev/random from network. Since FreeBSD 11:0: By adding harvest_mask="351" to /etc/rc.conf. For older FreeBSD, add theses line to /etc/sysctl.conf:

Entropy harvest is collecting first 2 bytes of each frame, and doing this under single mutex :-( Some benchs about this impact here Receipt for building a 10Mpps FreeBSD based router.

Sendmsg() cat't send messages more than maxdgram length. Default value causes routing software to fail with OSPF if jumbo frames is turned on.

Interface capabilities


This can affect you iff you're doing shaping.

Current netisr implementation can't split traffic into different ISR queues (patches are coming, 2012-02-23).
Every queue is covered by mutex which is much worse than using buf_ring(9) api (patches are coming, 2012-02-23).

Performance loss of 10-30% was observed on various scenarios (direct dispatch vs deferred of hybrid).

Traffic flow

Unfortunately, RSS is usually capable of hashing IPv4 and IPv4 traffic (L3+L4). All other traffic like PPPoE or MPLS or .. is usually received by queue 0.
This is bad, but even worse is that e1000 (and maybe others) unconditionally sets flowid to 0 effectively causing later hashing (by netisr, of flowtable, or lagg, or ..) to be skipped (patches for: igb).
Patch adds dev.XXX.Y.generate_flowid sysctl which needs to be set to 0 on links with non-ip traffic.


Note that the following nodes are single-threaded (uses NG_NODE_FORCE_WRITER):


Firewalling is possibly the one major reasons to use software router on 10G+ links. Some advices:


libalias-based NAT

NATs based on libalias(3) - ipfw nat, ng_nat, natd - are single-threaded. However, the lock is held per-instance. That means you can workaround this to utilize SMP by using several NAT instances (different ipfw nat's or several ng_nat nodes) which will run in parallel. For example, if you have 8 public adresses and need to NAT, then you can have 8 instances, one for each address, and send traffic from to first instance, to second, etc. Using more than 2*CPUs NAT instances will not give you any performance gain.

Also, you may need to raise hash table size. Open /sys/netinet/libalias/alias_local.h and patch it for the following sizes:

 #define LINK_TABLE_OUT_SIZE        8123 
 #define LINK_TABLE_IN_SIZE         16411

then recompile libalias consumers (kernel, natd, ppp) as it breaks ABI. If that's not enough for you, values can be set even bigger, just keep in mind that:

Dynamic routing


NetworkPerformanceTuning (last edited 2016-12-17 09:47:25 by OlivierCochardLabbé)