DevSummit/201405/NetworkStack

FreeBSD Developer Summit: Network Stack

Thursday May 15th, 9.00-12.00 location TBD

Overview

This working group will discuss networking stack topics. Please send proposal for topics to discuss to Luigi Rizzo rizzo@iet.unipi.it , they will be summarized here and then a selection will be done based on interest and relevance. Preference will be given to topics that actually need interactive discussion rather than one-way communications or status updates.

Goals

A list of changes and APIs that will help facility FreeBSD better using the above listed features.

Topics

The order is just for naming topics

#	Topic Description
1	batching
2	parallelism
3	alternate FIBs
4	TCP segmentation and reassembly
5	ifnet abstraction
6	ifnet extensions for multiqueue/ethtool
7	mbuf revisiting
8	backpressure
9	vlan QinQ, vxlan
10	Lightweight refcounters

Batching
improve batching support in the network paths, starting from if_input and ether_output. Preferably using backward compatible APIs so we can move on incrementally. Obvious example is to use the m_nextpkt field in mbufs. The solution does not have to be perfect, we need something badly now.
Parallelism
improve parallelism in the network stack. Not my area but there have been various suggestions to reduce contention on tcpcb's etc. Surely taking inspiration on what has been done on other platforms including research ones. See also lock contention issues with short lived connections at http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/183659
alternate FIBs (routing tables)
Low end machines can live with the current, slow routing tables (perhaps with a bit of caching), but we should provide hooks for alternatives such as Marko Zec's DXR, or other solutions which may come up.
TCP/UDP segmentation (and reassembly), checksums
Introduce support for software TCP segmentation, as low as possible in the stack so that we can push large segments down to the driver where it will be taken care of. This saves high repetitive costs for IP and ARP lookups on every segment. GSO is segmentation offloading in software above the driver, but we could also do something more efficient (sometimes called DSO) operating in the driver which could pre-allocate the extra buffers for headers. NIC-based checksums should be used if possible, and in any case support should be moved down the stack.
ifnet abstraction - see previous emails from Juniper ?
ifnet extensions for multiqueue / ethtool support
we should provide info on queue numbers and sizes, and other features that might be of use for multiqueue devices, mostly in the control path (so they can be externally linked). In linux these are managed with device-independent tools, which is much better than the custom methods we have now, and avoids polluting the ifnet with extra information.
mbuf revisiting
We need to move forward on this on a number of things.
- 'ephemeral' mbufs (which must be consumed inline, e.g. allocated on the stack or otherwise owned by the caller). These have a large number of uses.
- in-mbuf space for mtags (largely used for dummynet and perhaps other things).
- arrays vs linked lists of mbufs (trying to reduce the cost of chain traversals)
- L2/L3/L4 pointers in the pkthdr to avoid repeated packet parsing
backpressure
FreeBSD has no good backpressure mechanism to signal sockets that a link is busy. The most evident symptom is that UDP sockets are never blocking on transmission but can generate significant losses. For TCP, we need to backoff when the tx queue is full and hope that incoming acks will wake us up.
In linux, buffers in the tx queue hold a reference to the socket so completions can be used to notify sockets. Implementing the same mechanism in FreeBSD should be relatively straightforward.
VLAN QinQ
Nested VLAN tags are found in some environments. It would be good if we supported that. Also, it would be nice to import rwatson@'s VLAN priority patch.
vxlan
Similar to VLAN, but packets are instead encapsulated in UDP so they can be routed, and IP multicasting is used to do broadcasts.
Lightweight refcounters
We have several objects (interfaces, bridges, modules, etc.) that are long lived but do change occasionally, so some form of reference counting is needed. We are looking for solutions that are cheap on acquire/release.

Attending

In order to attend you need register for the developer summit as well as by email for the session and be confirmed by the working group organizer. Follow the guidelines described on the main page or what you received by email. For questions or if in doubt ask the session chairs.

Please do NOT add yourself here. Your name will appear automatically once you have received the confirmation email. You need to put your name on the general developer summit attendees list though.

Name	Username / Affiliation	Topics of Interest	Notes
Michael Bentkofsky	Verisign	all
Kevin Bowling	LLNW
Chris Busick
Adrian Chadd	adrian
Julien Charbon	Verisign	TCP stack parallelism
David Christensen	broadcom	ethtool
Eric Davis	edavis
Suresh Gumpula	Netapp
John Mark Gurney	jmg
Hash Haven	isilon	OFED, rx livelock
Marcus Hoffman	Norse
Mike Karels	mcafee
Joe Kloss	Panasas
Jeremiah Lott	Avere Systems
Jeffrey Meegan	isilon
Navdeep Parhar	navdeep / chelsio
Luis Otavio O Souza
Alfred Perlstein	Norse
Luigi Rizzo	luigi / Univ. Pisa
Jonathan Saks	xinuos
Hiroki Sato
Mike Silbersack	silby
Gleb Smirnoff	glebius	lightweight refcounters
Matt Smith	netgate
Lawrence Stewart	lstewart / netflix	backpressure
Jim Thompson	netgate
Bryan Venteicher	bryanv	mbuf ext, vlan, vxlan
Claudia Y	Norse
Bjoern Zeeb	bz

Notes

https://etherpad.wikimedia.org/p/123www (dumped below)

Network Session BSDCan May 2014

Batching

improve batching support in the network paths, starting from if_input and ether_output. Preferably using backward compatible APIs so we can move on incrementally. Obvious example is to use the m_nextpkt field in mbufs. The solution does not have to be perfect, we need something badly now.

Can we use a flag in the mbuf to handle batching? This may relate to scatter/gather lists (which are going to be discussed in the mbuf revisiting) At the moment there are no outstanding patches. Netflix suggests they may work on this in the fall. Modification to ether_input Drivers would have to know about ether_input changes — short list of ifp functions need a batched version added

Parallelism

improve parallelism in the network stack. Not my area but there have been various suggestions to reduce contention on tcpcb's etc. Surely taking inspiration on what has been done on other platforms including research ones. See also lock contention issues with short lived connections at http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/183659

TCP_INFO lock is a major point of contention for these. Two patches accepted so far: don’t lock from accept(); hold for shorter interval when handling timeouts. Verisign tried the pcbgroups lock but that didn't scale (issues with shared code between TCP & UDP as well). There is an unshared patch from Verisign that changes how the read and write locks are taken (more often read than write). Adrian has a patch to use RSS hash everywhere to prevent the use of locks? http://lists.freebsd.org/pipermail/freebsd-net/2014-May/038702.html http://people.freebsd.org/~adrian/norse/20140514-tcp-rss-timers-1.diff http://people.freebsd.org/~adrian/norse/20140514-tcp-rss-timers-2.diff Verisign folks present their code on screen. Patch to follow, this is their internal github I believe. Switching inpcb lock to reader/writer has doubled the connection setup/teardown performance (connections per second)

alternate FIBs (routing tables)

Low end machines can live with the current, slow routing tables (perhaps with a bit of caching), but we should provide hooks for alternatives such as Marko Zec's DXR, or other solutions which may come up.

Luigi will put up a projects branch so people can do an API review TCP/UDP segmentation (and reassembly), checksums

Introduce support for software TCP segmentation, as low as possible in the stack so that we can push large segments down to the driver where it will be taken care of. This saves high repetitive costs for IP and ARP lookups on every segment. GSO is segmentation offloading in software above the driver, but we could also do something more efficient (sometimes called DSO) operating in the driver which could pre-allocate the extra buffers for headers. NIC-based checksums should be used if possible, and in any case support should be moved down the stack.

Luigi can get a patch put up. Mike Karels asks about perf measurements. Luigi says that with their implementation in software they can get up to 9.2Gbps on hardware with segmentation turned off.

ifnet abstraction - see previous emails from Juniper ? The code is fully working and being tested for deployment at Juniper. gnn has patches (and marcel has said he'll share it general) One thing to add is inlining of a lot of code. Makes calls to ifnet procedural but can be inlined.

ifnet extensions for multiqueue / ethtool support

we should provide info on queue numbers and sizes, and other features that might be of use for multiqueue devices, mostly in the control path (so they can be externally linked). In linux these are managed with device-independent tools, which is much better than the custom methods we have now, and avoids polluting the ifnet with extra information.

Navdeep says he is interested in having the ifnet layer be aware of multiqueue Eric Davis (broadcom) - Highly configurable filtering engines are being pushed down into hardware. As stacks are pushed into user space it is important to be able to have control over NIC rings to steer traffic as needed. With multiqueue support we envision three types of demuxing: traditional RSS, filtered direction, and load balancing (single flow spread across queues).

mbuf revisiting

We need to move forward on this on a number of things.

'ephemeral' mbufs (which must be consumed inline, e.g. allocated on the stack or otherwise owned by the caller). These have a large number of uses. Navdeep says that proper use of the M_FREE bit provides this functionality already. in-mbuf space for mtags (largely used for dummynet and perhaps other things). arrays vs linked lists of mbufs (trying to reduce the cost of chain traversals) Issue at Netflix. Takes a lot of CPU time to walk mbuf chains (many cache misses). Navdeep says there was a similar mechanism in FreeBSD 7.0 or 7.1 time frame, but it was removed [no consumers or didn’t work]. Can look at that. L2/L3/L4 pointers in the pkthdr to avoid repeated packet parsing Might add a lot of space to each packet. Not concerned about that, but is the time it would take to populate these pointers really useful?

backpressure

FreeBSD has no good backpressure mechanism to signal sockets that a link is busy. The most evident symptom is that UDP sockets are never blocking on transmission but can generate significant losses. For TCP, we need to backoff when the tx queue is full and hope that incoming acks will wake us up. In linux, buffers in the tx queue hold a reference to the socket so completions can be used to notify sockets. Implementing the same mechanism in FreeBSD should be relatively straightforward.

Lawrence: As I understand it, Linux also defers socket updates until transmission time, e.g. there is implicit backpressure because they do not decrement the socket buffer count until the packet is actually put on the wire. You still have the ability to send packets which get gratuitously dropped below you. Waiting RTT to retransmit really sucks. Does anyone have any experience in a custom BSD stack implementing any sort of backpressure? Discussion about different ways to detect the problem, e.g. by probing the driver to see if it still has space. The cause here is simply heavy load (the driver’s hardware is full). Another issue is the “thundering herd” problem — really want a scheduler rather than simply waking every socket with data ready to transmit when the driver is ready to accept data again. Juniper has dealt with this, in part by having a “quench” mechanism for an interface, but the code may not be shareable (and may not work with FreeBSD network stack). A comment from (Scott?) — In the storage world, we never fail an I/O at entry, we always queue it up (infinitely long queue if necessary). Don’t signal back up the stack with the resulting thundering herd problem. Rather, just process all I/O operations in order when hardware resources are available. Navdeep points out that in networking, there is no hard cap on the amount of memory in the queue. Agreement that this needs to be dealt with, perhaps by notifying the application.

VLAN QinQ

Nested VLAN tags are found in some environments. It would be good if we supported that. Also, it would be nice to import rwatson@'s VLAN priority patch.

Can kind of do nested VLAN with netgraph today, but it would be better to support it directly, especially if it is a small patch. pfsense has support for priority bits already…

vxlan

Similar to VLAN, but packets are instead encapsulated in UDP so they can be routed, and IP multicasting is used to do broadcasts.

Lightweight refcounters

We have several objects (interfaces, bridges, modules, etc.) that are long lived but do change occasionally, so some form of reference counting is needed. We are looking for solutions that are cheap on acquire/release.

Gleb Smirnoff showed some slides from his presentation for context. Questions from Marcel @ Juniper — (a) x86-specific? yes, at least for now; other architectures can use critical sections. (b) code is not inlineable since it depends on checking PC @ interrupt time? correct. Patch was done about a month ago, some experiments done; no real data yet. Not API-compatible with RCU; can’t take a driver built for RCU and just have it call this instead.

DCTCP Patch for Review

http://simula.stanford.edu/~alizade/Site/DCTCP.html

MPTCP Lawrence will get us patches

Attending

some raw notes

Batching.

Luigi and people working on high-speed device drivers found that locking, etc. for every packet was very expensive. Batching amortizes these fixed costs. It would be good to support batching; we move networks up & down the stack in many places. Input is a good place to start, one interrupt usually already gives you a batch of packets. Output is also important too. How to do it in a backward-compatible way, so the entire stack doesn't need to change at once? mbufs already have a m_nextpkt field. Work on this stalled a bit because some people wanted a perfect solution, with errors caught by the compiler, etc. if_input & ether_output are two functions we could start with. Of course, we want performance analysis before committing any code. Think that would be relatively easy to debug.
- Suggestion (Alfred): We could use a flag in the mbuf, so that only if the flag is set you examine m_nextpkt. Older software wouldn’t set it and hence wouldn’t run batching code. Question: How similar is scatter/gather suggestion from Netflix?
  - Scott: I was thinking more single-packet, rather than multi-packet. Will be covered in “mbuf revisiting” topic later. Luigi: This approach is different, and attacks primarily the input buffer problem, where you have potentially multiple packets for different connections. Question: This [scatter/gather] would never be used on receive?
    - Some interfaces might spread traffic onto different buffers in their ring.
Question: Has anyone gotten halfway through developing batching? No? Many layers release a lock to avoid deadlock, handle one packet, take the lock again. Batching allows this to be done once for several packets instead. Is anyone planning to work on this? Scott from Netflix says, probably this fall. For input buffer, you would need to sort the packets (by consumer) into new batches. Not sure how much this will help. For one connection, something like large-receive offload would be similar (reducing the number of packets). Already implemented in some drivers. Batching would improve the case where traffic is received for multiple connections [or if the driver doesn’t have LRO or has limitations]. Has to be a two-way contract; there has to be a flag in the ifp that says “I will take packets with the batch flag sent” to ensure they will be processed (m_nextpkt not ignored). Or use a different entry point.

Parallelism.

PR 183659.
- http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/183659
Long-standing connections are mostly parallelized, the remaining problem is mostly when establishing or tearing down connections. TCP info lock is a point of contention (top 6 lock profiling entries are all related to this). Various patches. Two accepted so far.
- Don’t lock it from userspace accept() call. Hold lock less when handling timeouts.
SYN/FIN packets. Ideas.
- Make TCP info lock, or a new lock, the last lock acquired.
  PCB groups. Verisign tried this, based on a hash. Did not work because of [code shared between UDP & TCP?] … Verisign has an experimental patch which changes how the read & write locks are taken. (More often read than write).
  - Testing in pre-production systems with a synthetic workload (need other eyes). Suggestions: (a) tell the community what the workload being tested is, when the patch is released; (b) produce a dtrace flamegraph to show the effect of the change.
    - The workload is published in next week’s FreeBSD Journal.
  Adrian also has some ideas, including using RSS-like ideas to avoid locking at all.
  - http://people.freebsd.org/~adrian/norse/20140514-tcp-rss-timers-1.diff http://people.freebsd.org/~adrian/norse/20140514-tcp-rss-timers-2.diff
  Verisign experimental patches.
  - Avoid taking lock on TH_SYN. Add global pcbinfo list lock, to allow deferring info lock.
    - This is the first step towards removing the [write-mode] “inp” lock.
      Shared code between TCP & UDP, because INP lock is shared.
    Change to use read inp lock where possible (using pcbinfo list lock).
  Collectively, 2x improvement measured on a 12-core machine using 8 RSS queues.
Luigi: It would be great to have some text describing the new locking model with this type of change. Luigi: Not really comfortable with adding a single global lock (as the pcb list lock). Question: Is it just a single listening socket, or multiple?
- Have done it with both…
Sandvine: tcpinfo locking is too much, has ideas for reducing contention.
- code accepted: New lock for time_wait timers.
- provide detailed analysis: kern/183659 also “status wiki page???” with suggestions+ideas basic idea is to only use tcpinfo as leaf lock. SYN/FIN/RST
- Switch use of read/write lock and locking the inp instead result: *60k -> 130k (200% performance improvement)*
- note: tested in synthetic workload, want more testing.
Patch not to take INP_INFO lock on input for SYN
- Lock order is INP_INFO then INP lock.
Aggressive caching of the route in the tcpcb.
- There are races, but they are “ok” as they will happen anyhow.
- Add a function to allow for plugging in different routing table algorithms.

Alternative FIBs.

Luigi is advocating for an approach which does aggressive caching and is not too bothered if there is wrong information for a while. Current approach does a lookup based on the first N bits of the destination address and then a binary search. Marko’s work has been stalled for about a year, he published some patches last year for FreeBSD 8.
- Community probably cares the most about seeing the API, so having it on a private branch for people to look at it would be good.

TCP/UDP segmentation/reassembly & checksums.

We don’t have software TCP segmentation, we have to carry information in the mbufs.
Performance was doubled, without hardware support, by doing segmentation very low in the stack, right before input into driver. (Student project.) Linux calls this approach GSO, pushing large segments through the stack; the hardware can do segmentation if supported, otherwise we do it at the bottom layer. Simplifies TCP code since you can send arbitrarily large segments. George says: Put up a patch. He’s concerned about the quality of student code. (Needs review as always.) Basic approach: If you have a feature which is sometimes supported by hardware, assume you have it, and if the hardware doesn’t, emulate it at the lowest possible point. Question: How does this interact with the batching idea?
- If you get a huge segment, and the NIC does not support it, translate the huge segment into a batch and send that.

ifnet abstraction.

Compile-time option to let the procedural interface be inlined… At Juniper, we want the hardware drivers to be compiled once and be usable with either the Juniper network stack or FreeBSD network stack. Could be finished within two weeks. Challenge has been … trying to get an engineer to handle the commit … but this is now the third BSDcan where something should have been checked in already and isn’t.
- George can work to help get this taken care of.
Andre’s changes [?] to the ifnet layer have stopped, no more churn.

ifnet extensions for multiqueue/ethtool

Linux has their standard ifnet interface, with a single pointer to the extensions; if the interface does not support them, the system still runs. If it does, have interfaces to configure numbers of queues, numbers of buffers, etc.
- All of this is slow-path (configuration) code.
Think we should go for a similar route — ease configuration of 10gig interfaces. Do driver developers think this is a good idea? [Navdeep] It would be a good idea to have the ifnet layer aware of multiqueue. The queue scheduling is done within the NIC [driver?], it would be good to move all this into the ifnet layer. Getting to the point where RSS is common, putting complex filtering into hardware; it would be useful to be able to control this, steering traffic appropriately to the right queues etc.
- Especially important if you are running the stack in userland ala DPDK. There will be FreeBSD reference drivers for new hardware.
Question: Can we also move the buffering [which?] into this layer?

mbuf revisiting.

Ephemeral mbufs: Allocated by the caller, e.g. on the stack.
- Already exists with M_NOFREE?
In-mbuf space for mtags (for dummynet and perhaps other things). Arrays vs. linked lists of mbufs.
- Netflix case: takes a lot of time to walk mbuf lists. Navdeep: in 7.0 or 7.1 there was a similar mechanism, but it was removed because there were no consumers or it didn’t work. Lots of cache misses when processing these lists.
L2/L3/L4 pointers in the pkthdr could avoid repeated packet processing.
- Several drivers have to do the same parsing to e.g. determine whether a packet is TCP and find various fields. Luigi: Might add a huge number of bits to each packet. The issue is not so much the size as the time it takes to prepare. Is it really useful?

backpressure.

In Linux, if you stop and there is a packet in the queue, you notify the socket. In principle it is easy to do this. At end of transmission it might be tricky, it is interrupt context and might conflict with other processing being done on the socket itself. Lawrence: As I understand it, Linux also defers socket updates until transmission time, e.g. there is implicit backpressure because they do not decrement the socket buffer count until the packet is actually put on the wire. You still have the ability to send packets which get gratuitously dropped below you. Waiting RTT to retransmit really sucks. Does anyone have any experience in a custom BSD stack implementing any sort of backpressure?
- Have potentially multiple transmit queues; when in a backpressure situation and an mbuf is marked as complete, want some mechanism for waking applications which avoids the thundering herd problem. Ideally would have a scheduler, but at least a first start would be to have a list of sockets to be woken up via a task queue when a driver has space available. Haven’t thought this through in detail yet.
Q: Can you detect the problem in some other way? e.g. is there a way to detect that the driver is running out of space? what causes this? … A: The cause is purely that the hardware is full. Even for non-blocking transmission, don’t want to drop packets. In the Juniper network stack, an interface has a “quench” flag which gives it a more permanent characteristic. Getting back just an error indication is usually not enough on a heavily loaded system. Obviously Juniper has this sort of problem too (and also similar problems for input).
- Does Juniper have some code? Yes, but it may not be shareable; it also may not be code which would work with the FreeBSD network stack.
Comment (Scott?): In the storage world, we never fail an I/O at entry, we always queue it up (infinitely long queue if necessary). Don’t signal back up the stack with the resulting thundering herd problem. Rather, just process all I/O operations in order when hardware resources are available. Navdeep: There is no hard cap on the amount of memory in the queue.

vlan QinQ.

Encapsulating VLAN in VLAN. Is there interest in this? Can kind of do it with netgraph today, but not a perfect solution. Some people need this for a project, look at using FreeBSD, and decide they can’t use it because it doesn’t support it; it would be good for us to support it, especially if it is a small patch. Some NICs have support for this nested case (e.g. checksum offloading of encapsulated packet). Main use case would be in datacenter, e.g. a server is moved across the datacenter (using VLAN) but don’t want to change apps which assume it’s on a different VLAN. Also, some bits in VLAN are for priority.
- pfsense already has support for this, there is some overhead since it uses mbuf tags, but it works.
bryan V:
- stack vlan interfaces.
- patch is small.
- should go in, relatively simple.
- also has patch to enable bits for vlan priority.
- Suggestion from group: perhaps using virtual interface nesting makes it cleaner.

vxlan.

lightweight refcounters.

Gleb Smirnoff showed some slides from his presentation for context.
Removing some locks, and making assumptions about a static configuration, was shown to greatly reduce cache line contention & improve throughput of a FreeBSD-derived network stack. counter_u64 (implemented as a set of per-cpu variables) was a big performance improvement for networking. Want to go further. Can we build a safe reference counting implementation on top of this? … Question [Marcel @ Juniper] — Is this x86 specific? [During a presentation of assembly language and restart-after-interrupt.]
- Yes right now, but other architectures can use critical sections (at higher cost). Other architectures could potentially use a similar mechanism to reduce overhead.
Code is not inlineable if interrupt works by checking PC. [Another question from Marcel.] Question: Have you looked at the Dragonfly lightweight mechanism?
- Think it is similar to RCU… Idea here is to be very lightweight, but absolutely not RCU to avoid legal issues.
Patch was done about a month ago, some experiments done; no real data yet. Question: What data structures would you use this for? For example, routing tables or ipfw rules… Question: Is this API-compatible with RCU, so that someone with a driver which used RCU could easily transition? Not really…

Navdeep has run into problems with bufring, particularly on Haswell CPUs, where the queue thinks it is full when it is not. (Probably missing a memory barrier?) Have others seen this?

Anton suggested examining the Intel DPDK — it has an implementation of bufring which presumably has been tested on Haswell CPUs. DPDK is under BSD license. Found bugs in the CAM layer where there are reads coming from the bufring which are not atomic. If you wrap in proper atomics, it will fix these problems. But atomics are too expensive… Shows up trivially on Haswell.

Results

(Add a list or attach slides detailing the achieved results here.)

DevSummit/201405/NetworkStack (last edited 2021-04-25T09:09:46+0000 by JethroNederhof)