DevSummit/201806/Transport

FreeBSD Developer Summit: Transport Working Group

June 7, 2018 (Thursday), 13:00-17:00

DMS 1160

Overview

We will discuss ongoing work, and ideas for improvements to, the transport protocols in the FreeBSD kernel.

There is a group that meets regularly to discuss transport work in the kernel. Notes of the group's work can be found in TransportProtocols. The face-to-face time will allow us to whiteboard and discuss complex topics in an extended time. It will also allow us to include participants who are not able to make the regular meetings.

If you would like to participate, contact the working group chairs below and CC devsummit@. You will be then added to this page. Please include a list of things you want to talk about or the areas you are interested in. This helps us in planning the session and to bring people together with common interests.

It may be possible to bring in people who cannot attend in person via video conference or chat tools. Notes during the session will be published later on for the whole community to see what we discussed.

Goals

In general, there are two areas we would like to cover:

Discussions of ongoing work that is complex enough, requires coordination, or requires architectural decisions that would benefit from face-to-face discussion among a larger group.
Exchange of ideas for upcoming work to gauge community interest, solicit feedback, look for conflicts/overlap, and generally keep everyone informed.

In particular, we may (or may not) cover the following suggested topics. This is not an exhaustive list and if you feel there is something missing that you want to talk about, contact one of the session chairs and we will include your topic here. Note that the numbering of the topics does not represent an ordering or importance indication of any kind, but rather a reference to the second table with the "topic of interest" column.

The final agenda will be guided by the interest the attendees express, so we may not even talk about any topics listed below if it appears there is little to no interest in the topic among the attendees. Therefore, if you feel strongly that we should discuss a topic, please communicate that to the chairs.

Topics

Note: At the moment, these are mostly just suggestions the chair has gleaned from ongoing conversations. Please email me with suggestions for better topics. :-)

#	Topic Description
1	RACK, BBR (RandallStewart)
2	Alternate stacks: How do we maintain them? What is the support expectation? How do we minimize code duplication? Etc.
3	Alternate stacks: Building the "default" TCP stack as a module (and renaming it ) (JonathanLooney)
4	Software packet pacing / burst mitigation
5	(iflib status??)
6	Unified API for hardware/software packet pacing
7	UDP option support
8	In-line TLS API(?)
9	Kernel TLS
10	More efficient mbuf design for sendfile and friends (DrewGallatin)
11	High-precision timestamps D15337 (MattMacy)

Note: General presentations about work you have done that does not require further discussions will generally receive lower priority than work which would benefit from further face-to-face feedback. It may be worth seeking other forums for these discussions.

Suggested Agenda

TCP Stacks
- RACK, BBR (RandallStewart)
- Alternate stacks: How do we maintain them? What is the support expectation? How do we minimize code duplication? Etc.
- Alternate stacks: Building the "default" TCP stack as a module (and renaming it ) (JonathanLooney)
UDP option support
iflib status??
High-precision timestamps D15337 (MattMacy)
12.0 Preparation
- ABI padding check.
- D15386 (BrooksDavis)
Software packet pacing / burst mitigation
- Unified API for hardware/software packet pacing?
TLS-related stuff:
- Kernel TLS
- In-line TLS API
- More efficient mbuf design for sendfile and friends

Attending

In order to attend you need register for the developer summit as well as by email for the session and be confirmed by the working group organizers. Follow the guidelines described on the main page or what you received by email. For questions or if in doubt ask the session chairs.

Please do NOT add yourself here. Your name will appear automatically once you received the confirmation email. However, you will also need to register for the developer summit by adding your name to the general developer summit attendees list.

#	Name	Username / Affiliation	Topics of Interest	Notes
1	JonathanLooney	jtl@		Session chair
2	RandallStewart	rrs@
3	MichaelTuexen	tuexen@
4	LawrenceStewart	lstewart@
5	MikeKarels	karels@
6	NavdeepParhar	np@
7	MariusStrobl	marius@
8	JohnBaldwin	jhb@		Second half only

Results

Alternate TCP Stacks

RACK to be committed soon unless traffic on review
- + Tail Loss Probe,
- Used in prod. by Netflix
- RACK committed during session
BBR has open issue that is being worked with Google
- Netflix: Bandwidth measurements need tweaking (RACK outperforms it for them)
- Not going to make 12.0-RELEASE
setsockopt() can choose TCP stack for an individual connection
- Can be changed on an in-flight socket
example stack to be ripped out after RACK is in
Pluggable Congestion Control vs. Separate Stack
- Ideally RACK and BBR would be pluggable cc modules
- In practice, existing abstractions are insufficient
  - Unclear what abstractions are correct
- RACK would really require pluggable loss recovery
- Goal is to eventually get back to one TCP stack + plugins for RACK, BBR, etc
Alternate stacks aren't being built by default
- Do it in Jenkins?
- LINT already does it?
Building Default Stack as a kld module
- Significant code movement, will cause merge headaches
- Allows on-the-fly TCP upgrades via loading a new module
"default" stack is a poor name
Currently require at least one TCP stack
- jtl hopes to fix this for 13-CURRENT
Need to fix the module penalty (lock inlining, etc)
- jtl: current inline atomic primitives might be slower than function call (due to memory barrier)

UDP Options

Tom Jones: new committer, working on IETF draft for UDP options
UDP processing path to change (ip_len - sizeof(struct udphdr) != payload length)
Review imminent for notification purposes
Won't be pushed into tree until IETF draft finalized

iflib Status

ixl(9) iflib driver almost ready to be committed to head
Benno working on Aquantia driver
Some netmap problems still
Invaluable for vxlan development
Need feedback from driver maintainers on KPI + documentation
- erj feedback: Documentation is okay
- shurd: Got up and running pretty easily
LOR between netmap and iflib
- Needs somebody to look into and fix
e1000 regressions (CURRENT)
- em: errata workarounds missing for TSO + TSO on by default
  - Probably causing significant fraction of MAC hang complaints
  - Disable TSO by default in em? Workarounds have a significant perf penalty anyway
  - 10-STABLE, 11-STABLE seem fine
- Endian bugs on big-endian machines
  - POWER9 bringup might shake these out?
  - jtl: Regression in a major driver, probably would block 12.0-RELEASE
  - See PR 228811

High Precision Timestamps

Two separate issues: higher precision internal timers, and on-wire TCP timestamp format
Changes to on-wire format require standardization (ideally)
RACK calculates RTT based off of timer values maintained internally. The on-wire timestamps are not an input
Currently uses millisecond granularity (BBR goes up to microsecond)
Easiest path forward would be to increase the granularity and use the RACK stack. This removes the need to

12.0 Preparation

xtcpcb, x* compat broken last year
brooks has a review to add spares, remove dependencies on internal kernel data structures
netstat -M broken; should really fix it for RELEASE

lstewart

Reschedule the transport call?

Packet Pacing / Burst Mitigation

Can be done in s/w, can be quite expensive
At least two NIC vendors support h/w pacing
Need a unified API
RACK
- Attempts to not send "large" bursts at once (defaults to 40 * MSS)
- If RACK tries to exceed this, it queues and delays for "a little while"
- Burst mitigation is not done in at-clock mode
BBR
- Tries to find the optimal TSO size based on the measures bandwidth
- Will pace its output rate to the measured bandwidth
Chelsio h/w pacing
- Pacing is configured per-flow
- The NIC will accept as many packets as you send and pace them out
- Leads to buffer bloat, needs backpressure
MLX h/w pacing
- CX-4: can only pace to a certain set of rates (12-14 rates), pacing is done per-interface
- CX-5: Can pace to larger set of rates (10k rates w/ configurable ), pacing can be done per-flow
s/w pacing is limited to 50us granularity
There is currently a unified API, but there is no way to query capabilities
- Especially rate values supported
Two use cases: automatic tuning of rate values, or rates specified by user and stack sw paces with hw pace taken into account
Decision of how much sw pacing to do is a policy decision; need to offer users some level of control here
- it may be better to fail to pace than burn CPU on sw pacing
MLX: expect that with future NICs, h/w pacing should have enough granularity in its configuration to make h/w-only pacing possible
- eg in the future you may be either h/w paced or s/w based, but not a combination
What about pacing on a middlebox, rather than an endpoint?
- Requires a pacing API to work without an associated pcb
- e.g. dummynet would tag packets with a pacing queue id rather than doing it in ip_output()
- Would allow dummynet to do per-flow pacing
"Easy-mode" API: user can specify rate only + some kind of tolerance (or a lower bound + upper bound)
- Allow to choose worst implementation you'll accept (ie HW-only, HW-SW combo, SW only)
- Defer discussion of how to handle routing changes
  - It gets really bad if you fail over to a NIC that has a worse implementation that what you currently have

NIC Configuration

MLX has 2000 sysctl nodes for configuration
- f/w version, stats, steering

TLS

Netflix has in-kernel TLS
Requires encryption to be in the kernel
Key negotiation is done in userland
Framing is done in-kernel
Gets openssl out of the rx/tx path
Netflix to upstream "soonish", probably not in time for 12.0-RELEASE
"Unmapped mbuf", which contains an array of physical addresses
Encrypt from an unmapped mbuf, or into an unmapped mbuf
Unmapped mbuf is independent of kernel TLS, but kernel TLS depends on it
Special extra step where plaintext from sendfile is encrypted and then enqueued for tx
h/w crypto offload:
- Needs a way to locate session key
- TCP needs special knowledge about encrypted packets (i.e. don't copy them and lose the encryption metadata!)
- mbufs can share TLS frames
- For rexmits, the cipher algorithms are much simpler if you start from the beginning of the TLS frame
- Chelsio can re-encrypt the start of the frame but not re-transmit the ACK'ed data (MSS granularity)
- A bunch of rough edges to sand down before committing
re-keying not supported for in-kernel, but doesn't seem to work for vanilla openssl anyway

Actions

sbruno: ensure alternate stacks are built in Jenkins

DevSummit/201806/Transport (last edited 2018-06-08T18:59:29+0000 by MarkLinimon)