FreeBSD Developer Summit: Network Stack

Thursday May 15th, 9.00-12.00 location TBD


This working group will discuss networking stack topics. Please send proposal for topics to discuss to Luigi Rizzo , they will be summarized here and then a selection will be done based on interest and relevance. Preference will be given to topics that actually need interactive discussion rather than one-way communications or status updates.


A list of changes and APIs that will help facility FreeBSD better using the above listed features.


The order is just for naming topics


Topic Description






alternate FIBs


TCP segmentation and reassembly


ifnet abstraction


ifnet extensions for multiqueue/ethtool


mbuf revisiting




vlan QinQ, vxlan


Lightweight refcounters


In order to attend you need register for the developer summit as well as by email for the session and be confirmed by the working group organizer. Follow the guidelines described on the main page or what you received by email. For questions or if in doubt ask the session chairs.

Please do NOT add yourself here. Your name will appear automatically once you have received the confirmation email. You need to put your name on the general developer summit attendees list though.


Username / Affiliation

Topics of Interest


Michael Bentkofsky



Kevin Bowling


Chris Busick

Adrian Chadd


Julien Charbon


TCP stack parallelism

David Christensen



Eric Davis


Suresh Gumpula


John Mark Gurney


Hash Haven


OFED, rx livelock

Marcus Hoffman


Mike Karels


Joe Kloss


Jeremiah Lott

Avere Systems

Jeffrey Meegan


Navdeep Parhar

navdeep / chelsio

Luis Otavio O Souza

Alfred Perlstein


Luigi Rizzo

luigi / Univ. Pisa

Jonathan Saks


Hiroki Sato

Mike Silbersack


Gleb Smirnoff


lightweight refcounters

Matt Smith


Lawrence Stewart

lstewart / netflix


Jim Thompson


Bryan Venteicher


mbuf ext, vlan, vxlan

Claudia Y


Bjoern Zeeb


Notes (dumped below)

Network Session BSDCan May 2014


Can we use a flag in the mbuf to handle batching? This may relate to scatter/gather lists (which are going to be discussed in the mbuf revisiting) At the moment there are no outstanding patches. Netflix suggests they may work on this in the fall. Modification to ether_input Drivers would have to know about ether_input changes — short list of ifp functions need a batched version added


TCP_INFO lock is a major point of contention for these. Two patches accepted so far: don’t lock from accept(); hold for shorter interval when handling timeouts. Verisign tried the pcbgroups lock but that didn't scale (issues with shared code between TCP & UDP as well). There is an unshared patch from Verisign that changes how the read and write locks are taken (more often read than write). Adrian has a patch to use RSS hash everywhere to prevent the use of locks? Verisign folks present their code on screen. Patch to follow, this is their internal github I believe. Switching inpcb lock to reader/writer has doubled the connection setup/teardown performance (connections per second)

alternate FIBs (routing tables)

Luigi will put up a projects branch so people can do an API review TCP/UDP segmentation (and reassembly), checksums

Luigi can get a patch put up. Mike Karels asks about perf measurements. Luigi says that with their implementation in software they can get up to 9.2Gbps on hardware with segmentation turned off.

ifnet abstraction - see previous emails from Juniper ? The code is fully working and being tested for deployment at Juniper. gnn has patches (and marcel has said he'll share it general) One thing to add is inlining of a lot of code. Makes calls to ifnet procedural but can be inlined.

ifnet extensions for multiqueue / ethtool support

Navdeep says he is interested in having the ifnet layer be aware of multiqueue Eric Davis (broadcom) - Highly configurable filtering engines are being pushed down into hardware. As stacks are pushed into user space it is important to be able to have control over NIC rings to steer traffic as needed. With multiqueue support we envision three types of demuxing: traditional RSS, filtered direction, and load balancing (single flow spread across queues).

mbuf revisiting

'ephemeral' mbufs (which must be consumed inline, e.g. allocated on the stack or otherwise owned by the caller). These have a large number of uses. Navdeep says that proper use of the M_FREE bit provides this functionality already. in-mbuf space for mtags (largely used for dummynet and perhaps other things). arrays vs linked lists of mbufs (trying to reduce the cost of chain traversals) Issue at Netflix. Takes a lot of CPU time to walk mbuf chains (many cache misses). Navdeep says there was a similar mechanism in FreeBSD 7.0 or 7.1 time frame, but it was removed [no consumers or didn’t work]. Can look at that. L2/L3/L4 pointers in the pkthdr to avoid repeated packet parsing Might add a lot of space to each packet. Not concerned about that, but is the time it would take to populate these pointers really useful?


Lawrence: As I understand it, Linux also defers socket updates until transmission time, e.g. there is implicit backpressure because they do not decrement the socket buffer count until the packet is actually put on the wire. You still have the ability to send packets which get gratuitously dropped below you. Waiting RTT to retransmit really sucks. Does anyone have any experience in a custom BSD stack implementing any sort of backpressure? Discussion about different ways to detect the problem, e.g. by probing the driver to see if it still has space. The cause here is simply heavy load (the driver’s hardware is full). Another issue is the “thundering herd” problem — really want a scheduler rather than simply waking every socket with data ready to transmit when the driver is ready to accept data again. Juniper has dealt with this, in part by having a “quench” mechanism for an interface, but the code may not be shareable (and may not work with FreeBSD network stack). A comment from (Scott?) — In the storage world, we never fail an I/O at entry, we always queue it up (infinitely long queue if necessary). Don’t signal back up the stack with the resulting thundering herd problem. Rather, just process all I/O operations in order when hardware resources are available. Navdeep points out that in networking, there is no hard cap on the amount of memory in the queue. Agreement that this needs to be dealt with, perhaps by notifying the application.


Can kind of do nested VLAN with netgraph today, but it would be better to support it directly, especially if it is a small patch. pfsense has support for priority bits already…


Lightweight refcounters

Gleb Smirnoff showed some slides from his presentation for context. Questions from Marcel @ Juniper — (a) x86-specific? yes, at least for now; other architectures can use critical sections. (b) code is not inlineable since it depends on checking PC @ interrupt time? correct. Patch was done about a month ago, some experiments done; no real data yet. Not API-compatible with RCU; can’t take a driver built for RCU and just have it call this instead.

DCTCP Patch for Review

MPTCP Lawrence will get us patches


some raw notes



Alternative FIBs.

TCP/UDP segmentation/reassembly & checksums.

ifnet abstraction.

ifnet extensions for multiqueue/ethtool

mbuf revisiting.


vlan QinQ.


lightweight refcounters.

Navdeep has run into problems with bufring, particularly on Haswell CPUs, where the queue thinks it is full when it is not. (Probably missing a memory barrier?) Have others seen this?


(Add a list or attach slides detailing the achieved results here.)

201405DevSummit/NetworkStack (last edited 2014-05-26 18:01:52 by AlfredPerlstein)