What need to be done to tune networking stack (WIP)

End host

Tunables

Sysctls

All packets coming from network utilize kernel memory buffers named mbufs and mbuf clusters. A cluster is linked list of mbufs keeping all data of single packet. Single mbuf takes 256 bytes and mbuf cluster takes another 2048 bytes (or more, for jumbo frames). Very small packets fit in one mbuf but more commonly, a packet consumes mbuf cluster plus one extra mbuf.

Router

This section describes how to reach 10G performance from your server

Hardware

First, you need good NIC(s) capable of:

Intel 82576 is a good example of gigabit NIC.

10G

Good overview of intel nic capabilities can be found here

Second, you need good CPU with many cores. Since you can easily get 16 different queues even for 82576 (8 for each port) it is considerable to but 24-core CPU like E5645. AMD seems to perform very bad on routing (however I can't prove it with any tests at the moment).
It seems that disabling HT speeds up things a bit (despite decreased number of queues).

OS tuning

Fixed in -CURRENT (r233937, r233938). Set net.bpf.optimize_writers=1 for best performance.

Use more compact ip_fastfoward routine. It processes most packets falling back to 'normal' forward routine for fragments, packets with options, etc. Can save you up to 20% speed (but will break IPSec).

Do not send IP redirects

Skip feeding /dev/random from network.

Sendmsg() cat't send messages more than maxdgram length. Default value causes routing software to fail with OSPF if jumbo frames is turned on.

Interface capabilities

Netisr

This can affect you iff you're doing shaping.

Current netisr implementation can't split traffic into different ISR queues (patches are coming, 2012-02-23).
Every queue is covered by mutex which is much worse than using buf_ring(9) api (patches are coming, 2012-02-23).

Performance loss of 10-30% was observed on various scenarios (direct dispatch vs deferred of hybrid).

Traffic flow

Unfortunately, RSS is usually capable of hashing IPv4 and IPv4 traffic (L3+L4). All other traffic like PPPoE or MPLS or .. is usually received by queue 0.
This is bad, but even worse is that e1000 (and maybe others) unconditionally sets flowid to 0 effectively causing later hashing (by netisr, of flowtable, or lagg, or ..) to be skipped (patches for: igb).
Patch adds dev.XXX.Y.generate_flowid sysctl which needs to be set to 0 on links with non-ip traffic.

Netgraph

Note that the following nodes are single-threaded (uses NG_NODE_FORCE_WRITER):

Firewalls

Firewalling is possibly the one major reasons to use software router on 10G+ links. Some advices:

IPFW:

libalias-based NAT

NATs based on libalias(3) - ipfw nat, ng_nat, natd - are single-threaded. However, the lock is held per-instance. That means you can workaround this to utilize SMP by using several NAT instances (different ipfw nat's or several ng_nat nodes) which will run in parallel. For example, if you have 8 public adresses and need to NAT 192.168.0.0/24, then you can have 8 instances, one for each address, and send traffic from 192.168.0.0/27 to first instance, 192.168.0.32/27 to second, etc. Using more than 2*CPUs NAT instances will not give you any performance gain.

Also, you may need to raise hash table size. Open /sys/netinet/libalias/alias_local.h and patch it for the following sizes:

 #define LINK_TABLE_OUT_SIZE        8123 
 #define LINK_TABLE_IN_SIZE         16411

then recompile libalias consumers (kernel, natd, ppp) as it breaks ABI. If that's not enough for you, values can be set even bigger, just keep in mind that:

Dynamic routing

Software

NetworkPerformanceTuning (last edited 2014-02-03 12:52:44 by AlexanderChernikov)