FreeBSD Developer Summit: Network Stack

Thursday May 15th, 9.00-12.00 location TBD

Overview

This working group will discuss networking stack topics. Please send proposal for topics to discuss to Luigi Rizzo rizzo@iet.unipi.it , they will be summarized here and then a selection will be done based on interest and relevance. Preference will be given to topics that actually need interactive discussion rather than one-way communications or status updates.

Goals

A list of changes and APIs that will help facility FreeBSD better using the above listed features.

Topics

The order is just for naming topics

#

Topic Description

1

batching

2

parallelism

3

alternate FIBs

4

TCP segmentation and reassembly

5

ifnet abstraction

6

ifnet extensions for multiqueue/ethtool

7

mbuf revisiting

8

backpressure

9

vlan QinQ, vxlan

10

Lightweight refcounters

Attending

In order to attend you need register for the developer summit as well as by email for the session and be confirmed by the working group organizer. Follow the guidelines described on the main page or what you received by email. For questions or if in doubt ask the session chairs.

Please do NOT add yourself here. Your name will appear automatically once you have received the confirmation email. You need to put your name on the general developer summit attendees list though.

Name

Username / Affiliation

Topics of Interest

Notes

Michael Bentkofsky

Verisign

all

Kevin Bowling

LLNW

Chris Busick

Adrian Chadd

adrian

Julien Charbon

Verisign

TCP stack parallelism

David Christensen

broadcom

ethtool

Eric Davis

edavis

Suresh Gumpula

Netapp

John Mark Gurney

jmg

Hash Haven

isilon

OFED, rx livelock

Marcus Hoffman

Norse

Mike Karels

mcafee

Joe Kloss

Panasas

Jeremiah Lott

Avere Systems

Jeffrey Meegan

isilon

Navdeep Parhar

navdeep / chelsio

Luis Otavio O Souza

Alfred Perlstein

Norse

Luigi Rizzo

luigi / Univ. Pisa

Jonathan Saks

xinuos

Hiroki Sato

Mike Silbersack

silby

Gleb Smirnoff

glebius

lightweight refcounters

Matt Smith

netgate

Lawrence Stewart

lstewart / netflix

backpressure

Jim Thompson

netgate

Bryan Venteicher

bryanv

mbuf ext, vlan, vxlan

Claudia Y

Norse

Bjoern Zeeb

bz

Notes

https://etherpad.wikimedia.org/p/123www (dumped below)

Network Session BSDCan May 2014

Batching

Can we use a flag in the mbuf to handle batching? This may relate to scatter/gather lists (which are going to be discussed in the mbuf revisiting) At the moment there are no outstanding patches. Netflix suggests they may work on this in the fall. Modification to ether_input Drivers would have to know about ether_input changes — short list of ifp functions need a batched version added

Parallelism

TCP_INFO lock is a major point of contention for these. Two patches accepted so far: don’t lock from accept(); hold for shorter interval when handling timeouts. Verisign tried the pcbgroups lock but that didn't scale (issues with shared code between TCP & UDP as well). There is an unshared patch from Verisign that changes how the read and write locks are taken (more often read than write). Adrian has a patch to use RSS hash everywhere to prevent the use of locks? http://lists.freebsd.org/pipermail/freebsd-net/2014-May/038702.html http://people.freebsd.org/~adrian/norse/20140514-tcp-rss-timers-1.diff http://people.freebsd.org/~adrian/norse/20140514-tcp-rss-timers-2.diff Verisign folks present their code on screen. Patch to follow, this is their internal github I believe. Switching inpcb lock to reader/writer has doubled the connection setup/teardown performance (connections per second)

alternate FIBs (routing tables)

Luigi will put up a projects branch so people can do an API review TCP/UDP segmentation (and reassembly), checksums

Luigi can get a patch put up. Mike Karels asks about perf measurements. Luigi says that with their implementation in software they can get up to 9.2Gbps on hardware with segmentation turned off.

ifnet abstraction - see previous emails from Juniper ? The code is fully working and being tested for deployment at Juniper. gnn has patches (and marcel has said he'll share it general) One thing to add is inlining of a lot of code. Makes calls to ifnet procedural but can be inlined.

ifnet extensions for multiqueue / ethtool support

Navdeep says he is interested in having the ifnet layer be aware of multiqueue Eric Davis (broadcom) - Highly configurable filtering engines are being pushed down into hardware. As stacks are pushed into user space it is important to be able to have control over NIC rings to steer traffic as needed. With multiqueue support we envision three types of demuxing: traditional RSS, filtered direction, and load balancing (single flow spread across queues).

mbuf revisiting

'ephemeral' mbufs (which must be consumed inline, e.g. allocated on the stack or otherwise owned by the caller). These have a large number of uses. Navdeep says that proper use of the M_FREE bit provides this functionality already. in-mbuf space for mtags (largely used for dummynet and perhaps other things). arrays vs linked lists of mbufs (trying to reduce the cost of chain traversals) Issue at Netflix. Takes a lot of CPU time to walk mbuf chains (many cache misses). Navdeep says there was a similar mechanism in FreeBSD 7.0 or 7.1 time frame, but it was removed [no consumers or didn’t work]. Can look at that. L2/L3/L4 pointers in the pkthdr to avoid repeated packet parsing Might add a lot of space to each packet. Not concerned about that, but is the time it would take to populate these pointers really useful?

backpressure

Lawrence: As I understand it, Linux also defers socket updates until transmission time, e.g. there is implicit backpressure because they do not decrement the socket buffer count until the packet is actually put on the wire. You still have the ability to send packets which get gratuitously dropped below you. Waiting RTT to retransmit really sucks. Does anyone have any experience in a custom BSD stack implementing any sort of backpressure? Discussion about different ways to detect the problem, e.g. by probing the driver to see if it still has space. The cause here is simply heavy load (the driver’s hardware is full). Another issue is the “thundering herd” problem — really want a scheduler rather than simply waking every socket with data ready to transmit when the driver is ready to accept data again. Juniper has dealt with this, in part by having a “quench” mechanism for an interface, but the code may not be shareable (and may not work with FreeBSD network stack). A comment from (Scott?) — In the storage world, we never fail an I/O at entry, we always queue it up (infinitely long queue if necessary). Don’t signal back up the stack with the resulting thundering herd problem. Rather, just process all I/O operations in order when hardware resources are available. Navdeep points out that in networking, there is no hard cap on the amount of memory in the queue. Agreement that this needs to be dealt with, perhaps by notifying the application.

VLAN QinQ

Can kind of do nested VLAN with netgraph today, but it would be better to support it directly, especially if it is a small patch. pfsense has support for priority bits already…

vxlan

Lightweight refcounters

Gleb Smirnoff showed some slides from his presentation for context. Questions from Marcel @ Juniper — (a) x86-specific? yes, at least for now; other architectures can use critical sections. (b) code is not inlineable since it depends on checking PC @ interrupt time? correct. Patch was done about a month ago, some experiments done; no real data yet. Not API-compatible with RCU; can’t take a driver built for RCU and just have it call this instead.

DCTCP Patch for Review

http://simula.stanford.edu/~alizade/Site/DCTCP.html

MPTCP Lawrence will get us patches

Attending


some raw notes


Batching.

Parallelism.

Alternative FIBs.

TCP/UDP segmentation/reassembly & checksums.

ifnet abstraction.

ifnet extensions for multiqueue/ethtool

mbuf revisiting.

backpressure.

vlan QinQ.

vxlan.

lightweight refcounters.

Navdeep has run into problems with bufring, particularly on Haswell CPUs, where the queue thinks it is full when it is not. (Probably missing a memory barrier?) Have others seen this?

Results

(Add a list or attach slides detailing the achieved results here.)

201405DevSummit/NetworkStack (last edited 2014-05-26 18:01:52 by AlfredPerlstein)