This is a placeholder for a reorganise/update of the Networking page.
Yandex (TODO: name) has done a lot of investigation into packet forwarding performance on FreeBSD-9 and FreeBSD-10.
The locking around rtentry is expensive - do we actually need to refcount rtentry, or can we architect it out by not passing it by reference into the forwarding stack?
- Implement counter(9)-based interface for lockless refcounting on ifp's
- Convert current users of rtentry(9) api to new forwarding functions
Contention between CPUs when forwarding between multi-queue interfaces
There's contention between threads when forwarding packets from one interface to another - by default, each receive thread on a NIC will transmit into a number of destination TX queues on a destination NIC (based on the mbuf flowid.) Yandex modified lagg(4) to always transmit to the same CPU queue that is being received upon, eliminating the contention. This however may introduce some out of order behaviour which needs to be quantified. (AdrianChadd - it shouldn't, as the RX NIC should already be hashing flows for us anyway, so as long as all frames for a flow come in on a given NIC RX queue AND if TX/RX threads are pinned, then the destination TX queue should also be constant and this should result in no out of order behaviour within a given flow.) - (AlexanderChernikov There is another approach we can consider after andre@'s work: check if we have source interface set on mbuf and send it to the queue which is bound to current CPU)
Radix trie efficencies (or lack thereof)
(2013-10-14) melifaro@, luigi@
Split routing structures (radix/rtentry stuff) into 2 different things: feature-rich RIB for determining best routes among several ones and fast af-dependent one used to forward packets. Planned work:
- Fix masks bug in radix (done)
- Fix multipath
- Implement API to program FIB for given family
- Implement IPv4 FIB
Socket / mbuf lifecycle and refcounting
Netflix have noticed a lot of memory accesses when managing externally refcounted mbufs. The root seems to be the refcount itself - it comes out of a separate UMA pool full of DWORD (4-byte) refcounts, which almost always ends up being a read from memory rather than from L2/L3 cache. glebius@ has a patch to migrate mbuf refcounting into a couple of methods so mbufs that are backed by vmpage/sfbuf can use the refcount already in that structure, rather than adding another layer of refcounting.
TCP / UDP incoming load balancing across multiple threads/processes
(2013-10-04) - glebius@
mbuf backing store representation
(2013-10-04) - adrian@
Under memory fragmentation, a large send buffer can be represented by a large set of underlying, discontiguous memory pages. In the most pessimistic case, a large read via sendfile() can be decomposed into a small number of 4KiB page entries which are individually converted into separate mbufs and then passed to the network stack to send. If some pages are cached but others aren't, then this also results in a list of mbufs that represent each set of pages. Then in the network driver, this list of mbufs has to be walked again and turned into a gather DMA list for the NIC. This all takes time and touches memory.
Ideally the network stack should be able to deal with an mbuf which is made up of multiple underlying buffers. For the above case(s), an array of pgaes could be passed up, or some API that allows for 'struct buf' (and others) to be exposed as the underlying buffer without overhead.
The trick here is that almost all of the networking code assumes that:
you can directly fondle the m->m_data pointer for a given mbuf, without modifying the actual backing store, in order to reserve header space;
- there's only one buffer per mbuf, that you can access via mtod();
- walking the buffers that make up an mbuf is done by walking the m_next pointer list.
It may be quite a win to be able to represent mbufs as an abstract data source that will default to a single buffer, but could be backed by an array of vm_page_t entries, or a 'struct buf', or similar. A lot of code may need modifying to handle this - ie, all the direct mbuf pointer adjustment and manual iteration - but it will result in the mbuf consumer code being tidier.
(2013-10-04) - adrian@
There's been some recent work by Yahoo (TODO: URL) looking at how effective it is to batch IO related syscalls. They found that batching up to 32 read/write/accept/connect/close (zero copy or otherwise) syscalls into batches gave a significant performance boost.
Contention between TCP RX (ACK -> TCP window update, transmit) and TCP TX
(2013-10-04) - rwatson@
Eliminate the default use of ifnet send queues
(2013-10-09) - andre@
There are a few areas where the ifnet send queue macros and queue structure is in place. Andre has suggested the following:
- Remove the default ifnet send queue from the ifnet structure
- If devices wish to use ifnet (eg altq) then it's up to the network device itself in question to create, destroy and manage the queue itself
- .. this way the ifnet send queue won't be accidentally used by code that shouldn't be using it.