FreeBSD Developer Summit: Transport Working Group
June 7, 2018 (Thursday), 13:00-17:00
DMS 1160
Overview
We will discuss ongoing work, and ideas for improvements to, the transport protocols in the FreeBSD kernel.
There is a group that meets regularly to discuss transport work in the kernel. Notes of the group's work can be found in TransportProtocols. The face-to-face time will allow us to whiteboard and discuss complex topics in an extended time. It will also allow us to include participants who are not able to make the regular meetings.
If you would like to participate, contact the working group chairs below and CC devsummit@. You will be then added to this page. Please include a list of things you want to talk about or the areas you are interested in. This helps us in planning the session and to bring people together with common interests.
It may be possible to bring in people who cannot attend in person via video conference or chat tools. Notes during the session will be published later on for the whole community to see what we discussed.
Goals
In general, there are two areas we would like to cover:
- Discussions of ongoing work that is complex enough, requires coordination, or requires architectural decisions that would benefit from face-to-face discussion among a larger group.
- Exchange of ideas for upcoming work to gauge community interest, solicit feedback, look for conflicts/overlap, and generally keep everyone informed.
In particular, we may (or may not) cover the following suggested topics. This is not an exhaustive list and if you feel there is something missing that you want to talk about, contact one of the session chairs and we will include your topic here. Note that the numbering of the topics does not represent an ordering or importance indication of any kind, but rather a reference to the second table with the "topic of interest" column.
The final agenda will be guided by the interest the attendees express, so we may not even talk about any topics listed below if it appears there is little to no interest in the topic among the attendees. Therefore, if you feel strongly that we should discuss a topic, please communicate that to the chairs.
Topics
Note: At the moment, these are mostly just suggestions the chair has gleaned from ongoing conversations. Please email me with suggestions for better topics. :-)
# |
Topic Description |
1 |
RACK, BBR (RandallStewart) |
2 |
Alternate stacks: How do we maintain them? What is the support expectation? How do we minimize code duplication? Etc. |
3 |
Alternate stacks: Building the "default" TCP stack as a module (and renaming it ) (JonathanLooney) |
4 |
Software packet pacing / burst mitigation |
5 |
(iflib status??) |
6 |
Unified API for hardware/software packet pacing |
7 |
UDP option support |
8 |
In-line TLS API(?) |
9 |
Kernel TLS |
10 |
More efficient mbuf design for sendfile and friends (DrewGallatin) |
11 |
Note: General presentations about work you have done that does not require further discussions will generally receive lower priority than work which would benefit from further face-to-face feedback. It may be worth seeking other forums for these discussions.
Suggested Agenda
- TCP Stacks
RACK, BBR (RandallStewart)
- Alternate stacks: How do we maintain them? What is the support expectation? How do we minimize code duplication? Etc.
Alternate stacks: Building the "default" TCP stack as a module (and renaming it ) (JonathanLooney)
- UDP option support
- iflib status??
- 12.0 Preparation
- ABI padding check.
- Software packet pacing / burst mitigation
- Unified API for hardware/software packet pacing?
- TLS-related stuff:
- Kernel TLS
- In-line TLS API
- More efficient mbuf design for sendfile and friends
Attending
In order to attend you need register for the developer summit as well as by email for the session and be confirmed by the working group organizers. Follow the guidelines described on the main page or what you received by email. For questions or if in doubt ask the session chairs.
Please do NOT add yourself here. Your name will appear automatically once you received the confirmation email. However, you will also need to register for the developer summit by adding your name to the general developer summit attendees list.
# |
Name |
Username / Affiliation |
Topics of Interest |
Notes |
1 |
jtl@ |
|
Session chair |
|
2 |
rrs@ |
|
|
|
3 |
tuexen@ |
|
|
|
4 |
lstewart@ |
|
|
|
5 |
karels@ |
|
|
|
6 |
np@ |
|
|
|
7 |
marius@ |
|
|
|
8 |
jhb@ |
|
Second half only |
Results
Alternate TCP Stacks
- RACK to be committed soon unless traffic on review
- + Tail Loss Probe,
- Used in prod. by Netflix
- RACK committed during session
- BBR has open issue that is being worked with Google
- Netflix: Bandwidth measurements need tweaking (RACK outperforms it for them)
- Not going to make 12.0-RELEASE
- setsockopt() can choose TCP stack for an individual connection
- Can be changed on an in-flight socket
- example stack to be ripped out after RACK is in
- Pluggable Congestion Control vs. Separate Stack
- Ideally RACK and BBR would be pluggable cc modules
- In practice, existing abstractions are insufficient
- Unclear what abstractions are correct
- RACK would really require pluggable loss recovery
- Goal is to eventually get back to one TCP stack + plugins for RACK, BBR, etc
- Alternate stacks aren't being built by default
- Do it in Jenkins?
- LINT already does it?
- Building Default Stack as a kld module
- Significant code movement, will cause merge headaches
- Allows on-the-fly TCP upgrades via loading a new module
- "default" stack is a poor name
- Currently require at least one TCP stack
- jtl hopes to fix this for 13-CURRENT
- Need to fix the module penalty (lock inlining, etc)
- jtl: current inline atomic primitives might be slower than function call (due to memory barrier)
UDP Options
- Tom Jones: new committer, working on IETF draft for UDP options
- UDP processing path to change (ip_len - sizeof(struct udphdr) != payload length)
- Review imminent for notification purposes
- Won't be pushed into tree until IETF draft finalized
iflib Status
- ixl(9) iflib driver almost ready to be committed to head
- Benno working on Aquantia driver
- Some netmap problems still
- Invaluable for vxlan development
- Need feedback from driver maintainers on KPI + documentation
- erj feedback: Documentation is okay
- shurd: Got up and running pretty easily
- LOR between netmap and iflib
- Needs somebody to look into and fix
- e1000 regressions (CURRENT)
- em: errata workarounds missing for TSO + TSO on by default
- Probably causing significant fraction of MAC hang complaints
- Disable TSO by default in em? Workarounds have a significant perf penalty anyway
- 10-STABLE, 11-STABLE seem fine
- Endian bugs on big-endian machines
- POWER9 bringup might shake these out?
- jtl: Regression in a major driver, probably would block 12.0-RELEASE
See PR 228811
- em: errata workarounds missing for TSO + TSO on by default
High Precision Timestamps
- Two separate issues: higher precision internal timers, and on-wire TCP timestamp format
- Changes to on-wire format require standardization (ideally)
- RACK calculates RTT based off of timer values maintained internally. The on-wire timestamps are not an input
- Currently uses millisecond granularity (BBR goes up to microsecond)
- Easiest path forward would be to increase the granularity and use the RACK stack. This removes the need to
12.0 Preparation
- xtcpcb, x* compat broken last year
- brooks has a review to add spares, remove dependencies on internal kernel data structures
- netstat -M broken; should really fix it for RELEASE
lstewart
- Reschedule the transport call?
Packet Pacing / Burst Mitigation
- Can be done in s/w, can be quite expensive
- At least two NIC vendors support h/w pacing
- Need a unified API
- RACK
- Attempts to not send "large" bursts at once (defaults to 40 * MSS)
- If RACK tries to exceed this, it queues and delays for "a little while"
- Burst mitigation is not done in at-clock mode
- BBR
- Tries to find the optimal TSO size based on the measures bandwidth
- Will pace its output rate to the measured bandwidth
- Chelsio h/w pacing
- Pacing is configured per-flow
- The NIC will accept as many packets as you send and pace them out
- Leads to buffer bloat, needs backpressure
- MLX h/w pacing
- CX-4: can only pace to a certain set of rates (12-14 rates), pacing is done per-interface
- CX-5: Can pace to larger set of rates (10k rates w/ configurable ), pacing can be done per-flow
- s/w pacing is limited to 50us granularity
- There is currently a unified API, but there is no way to query capabilities
- Especially rate values supported
- Two use cases: automatic tuning of rate values, or rates specified by user and stack sw paces with hw pace taken into account
- Decision of how much sw pacing to do is a policy decision; need to offer users some level of control here
- it may be better to fail to pace than burn CPU on sw pacing
- MLX: expect that with future NICs, h/w pacing should have enough granularity in its configuration to make h/w-only pacing possible
- eg in the future you may be either h/w paced or s/w based, but not a combination
- What about pacing on a middlebox, rather than an endpoint?
- Requires a pacing API to work without an associated pcb
- e.g. dummynet would tag packets with a pacing queue id rather than doing it in ip_output()
- Would allow dummynet to do per-flow pacing
- "Easy-mode" API: user can specify rate only + some kind of tolerance (or a lower bound + upper bound)
- Allow to choose worst implementation you'll accept (ie HW-only, HW-SW combo, SW only)
- Defer discussion of how to handle routing changes
- It gets really bad if you fail over to a NIC that has a worse implementation that what you currently have
NIC Configuration
- MLX has 2000 sysctl nodes for configuration
- f/w version, stats, steering
TLS
- Netflix has in-kernel TLS
- Requires encryption to be in the kernel
- Key negotiation is done in userland
- Framing is done in-kernel
- Gets openssl out of the rx/tx path
- Netflix to upstream "soonish", probably not in time for 12.0-RELEASE
- "Unmapped mbuf", which contains an array of physical addresses
- Encrypt from an unmapped mbuf, or into an unmapped mbuf
- Unmapped mbuf is independent of kernel TLS, but kernel TLS depends on it
- Special extra step where plaintext from sendfile is encrypted and then enqueued for tx
- h/w crypto offload:
- Needs a way to locate session key
- TCP needs special knowledge about encrypted packets (i.e. don't copy them and lose the encryption metadata!)
- mbufs can share TLS frames
- For rexmits, the cipher algorithms are much simpler if you start from the beginning of the TLS frame
- Chelsio can re-encrypt the start of the frame but not re-transmit the ACK'ed data (MSS granularity)
- A bunch of rough edges to sand down before committing
- re-keying not supported for in-kernel, but doesn't seem to work for vanilla openssl anyway
Actions
- sbruno: ensure alternate stacks are built in Jenkins