What need to be done to tune networking stack (WIP)
hw.igb.max_interrupt_rate - loader tunable limiting maximum number of interrupts per second generated by single igb(4)-driven NIC. Default value is 8000 and is too low; you may want to increase it upto 32000 or more.
- net.inet.tcp.tcbhashsize - loader tunable and read-only sysctl, size of hash used to find socket for incoming packet. 512 by default; busy content provider may benefit from increase upto 32K.
- kern.ipc.maxsockets - loader tunable and read/write sysctl, global limit for number of sockets in the system; each open socket takes roughly 1800 bytes of kernel memory (look at kern.ipc.numopensockets value to see current number of sockets), this includes number of closed but not yet destroyed sockets in TIME WAIT state (use net.inet.tcp.maxtcptw to limit their number).
- kern.maxfiles - read/write sysctl, global limit for number of open files in the system; each accepted socket produces open file that takes 128 bytes of kernel memory in addition to socket kernel data (look at kern.openfiles value to see current number of open files and do not forget of per-process limit kern.maxfilesperproc)
All packets coming from network utilize kernel memory buffers named mbufs and mbuf clusters. A cluster is linked list of mbufs keeping all data of single packet. Single mbuf takes 256 bytes and mbuf cluster takes another 2048 bytes (or more, for jumbo frames). Very small packets fit in one mbuf but more commonly, a packet consumes mbuf cluster plus one extra mbuf.
- kern.ipc.nmbclusters - loader tunable and read/write sysctl, global limit for number of mbuf clusters in the system; packet drops happen when this value is reached.
- kern.ipc.nmbjumbop - starting from FreeBSD 7, TCP sockets do not use 2K-sized mbuf clusters for outgoing data but page-sized mbuf clusters (4K in general) and this is loader tunable and read/write sysctl that limits their number.
This section describes how to reach 10G performance from your server
First, you need good NIC(s) capable of:
- [vlan] hardware checksum
- multiple receive queues (RSS)
- vlan hardware filtering
Intel 82576 is a good example of gigabit NIC.
Intel 82599/X520 can be chosen for 10G. RSS supports 16 queues per port.
Good overview of intel nic capabilities can be found here
- Chelsio (10/20/40G)
Second, you need good CPU with many cores. Since you can easily get 16 different queues even for 82576 (8 for each port) it is considerable to but 24-core CPU like E5645. AMD seems to perform very bad on routing (however I can't prove it with any tests at the moment).
It seems that disabling HT speeds up things a bit (despite decreased number of queues).
- Recent -STABLE version should always be used, say NO to -RELEASE.
- Say NO to i386 platform that greatly limits kernel virtual memory, move to amd64.
Ensure BPF is OFF. No tcpdump, cdpd, lldpd, dhcpd, dhcp-relay. Patches are coming. Check netstat -B.
- Ensure you're not running with rtsock generating tons of RTM_MISS messages
Use more compact ip_fastfoward routine. It processes most packets falling back to 'normal' forward routine for fragments, packets with options, etc. Can save you up to 20% speed (but will break IPSec).
Do not send IP redirects
Skip feeding /dev/random from network.
Sendmsg() cat't send messages more than maxdgram length. Default value causes routing software to fail with OSPF if jumbo frames is turned on.
Ensure RXCSUM, TXCSUM and VLAN_HWCSUM is turned on. It is the easiest thing that can be offloaded without any problems.
Ensure VLAN_HWFILTER is turned on if you're running vlans. card extracts vlan id and sends resulting mbuf directly to vlan interface
- Bump net.route.netisr_maxqlen to 2048 or higher value.
This can affect you iff you're doing shaping.
- Do NOT use netisr policy other than 'direct' if you can.
Current netisr implementation can't split traffic into different ISR queues (patches are coming, 2012-02-23).
Every queue is covered by mutex which is much worse than using buf_ring(9) api (patches are coming, 2012-02-23).
Performance loss of 10-30% was observed on various scenarios (direct dispatch vs deferred of hybrid).
- When using RSS NIC driver usually sets mbuf flowid data to the number of received queue. This permits later users (like lagg , netsr or multipath routing use existing data instead of hash calculations).
Unfortunately, RSS is usually capable of hashing IPv4 and IPv4 traffic (L3+L4). All other traffic like PPPoE or MPLS or .. is usually received by queue 0.
This is bad, but even worse is that e1000 (and maybe others) unconditionally sets flowid to 0 effectively causing later hashing (by netisr, of flowtable, or lagg, or ..) to be skipped (patches for: igb).
Patch adds dev.XXX.Y.generate_flowid sysctl which needs to be set to 0 on links with non-ip traffic.
Note that the following nodes are single-threaded (uses NG_NODE_FORCE_WRITER):
- ng_nat (see about libalias-based NAT below)
Firewalling is possibly the one major reasons to use software router on 10G+ links. Some advices:
- Use as little number of rules as possible. ipfw seems to consume CPU in linear way depending on the number of rules. 10-15 rules seems to be reasonable maximum for the main traffic flow.
- FIGHT FOR EVERY RULE. Again and again. Even 10 rules (which are traversed by most packets) can consume up to 40% processing. Complex configurations eats much more.
- Split out | in per each inbound and outbound interface.
- Use tables, tablearg in every place you can.
- Note skipto tablearg works in O(log(n)), where n is number of rules, so it possibly can be used to implement per-interface firewall. Use with care.
- The same applies for returning to ipfw from netgraph (if one_pass is set to 0)
Use dynamic rules with care. While dynamic rules are now much faster (243707, merge to 8/9, 2012-12-21) you should still them them as little as possible.
Note limit action implicitly calls check-state
- ipfw nat is libalias-based, see below
NATs based on libalias(3) - ipfw nat, ng_nat, natd - are single-threaded. However, the lock is held per-instance. That means you can workaround this to utilize SMP by using several NAT instances (different ipfw nat's or several ng_nat nodes) which will run in parallel. For example, if you have 8 public adresses and need to NAT 192.168.0.0/24, then you can have 8 instances, one for each address, and send traffic from 192.168.0.0/27 to first instance, 192.168.0.32/27 to second, etc. Using more than 2*CPUs NAT instances will not give you any performance gain.
Also, you may need to raise hash table size. Open /sys/netinet/libalias/alias_local.h and patch it for the following sizes:
#define LINK_TABLE_OUT_SIZE 8123 #define LINK_TABLE_IN_SIZE 16411
then recompile libalias consumers (kernel, natd, ppp) as it breaks ABI. If that's not enough for you, values can be set even bigger, just keep in mind that:
- they must be prime numbers
this consumes memory for all liablias instances in the system even if there is no traffic
- due to deficiency of hash function, there is no sense to make 'in' table size slightly more than 64k
- Reduce number of routes as much as you can. Typical solution is to use IGP routes + default instead of full-view.
- Quagga. Famous cisco-like daemon.
- bird. Juniper-like configs, multiple kernel tables, ability to filter kernel routes