Session Leader: jtl
https://hackmd.io/LvjwJwQ4RJ-WS1mm_KLBgg
BSDCam Transport Session
- Agenda Bashing
Linux NetDev Report from thj@
- Co-located with IETF event
- Not especially useful for FreeBSD people
- Things they are doing:
- tight vendor intergration for switch ASICs
- switchdev API, switch configurations
- Mellanox, Barefoot, and Cumulous
- FreeBSD likely to lag behind
- Barefoot: Intellectual Property in compiler
- Would be willing to open source spec for configuring ASIC
- Librification of netfilter tools (firewall rules in JSON)
- Write firewall config tools in higher level languages
- What do sysadmins want to have libs/JSON etc
- Demo of netfilter implemented in eBPF
- Have a "Tell developers what you think/want" session at MeetBSD
- Getting more feedback from users and sysadmins
- Have a FreeBSDCon, a devsummit focused on getting users to tell us about their needs/pains/desires
- Making IPv6 Suck Less
- Perform Better
- Missing RFCs
- thj is implementing RFC7112
Roaming WiFi: ipv4 renegotiates DHCP, but SLAAC doesn't get reset
- jtl's concern: the complexity of headers, cases where host may be instructed to do work.
- There are some measurements of what % of traffic gets dropped if it has extension headers. Cisco is apparently doing fresh stats on this.
- jtl would like a sysctl bitmask to ignore extension specific types of headers
- A bug with v6 fragments, if RSS enabled, counter of how many headers have been processed gets reset to 0
- Optimizations that have only been done to v4, may need to be replicated for v6
- v4 may be more strictly compliant, v6 is often less complaint
- v4 would not accept more than 16 fragments
- bz would liken us to be RFC8200 complaint
- Who wants to actually work on v6: thj, bz, gallatin@, left 1/2 of rrs@, right 1/2 of tuexen@
Old ipv6 todo page: IPv6 To Do
An equiv to the v4 RFC page: Transport Protocols TCP RFC compliance
- We need more test cases, both for things that work (so we don't break them), and for things that are broken (so we know when it is fixed)
- OpenBSD has a python based v6 test suite that works on FreeBSD
- tuexen@ has a set of test packages that are ready to be hooked up to CI
- Take Away: status reports on the bi-weekly transport call
- IP[46]/TCP Reassembly Bugs/Stuff
- Researcher found that 'walking linked list is slow, and bad'
- The kernel created long linked lists for out-of-order TCP segments and fragment chains.
- IPv6: Used to limit resources in very differently than v4, now uses the same vocabularity
- IPv6 fragments were not hashed into buckets, now they are
- Performance suffers too much when the list exceeds 100, this is the new limit
- Mostly just a workaround, papers over the problem. Needs an algorithmic fix
- If more than a trivial number of fragments, needs a better solution. glebius@ is working on an implementation of fragment processing code using red-black tree. Needs a security review. Is the performance impact acceptable.
- TCP: rrs@ working on collescing code
- Updated version coming to phabricator soon
- tuxen@ wrote test cases for reassmbly
- jhb@ and jtl@ have a todo list
- use queue.h
- v6 code requires changes in many places
- Need a modernization pass, remove #ifdef KAME etc
- Too much noise in the code, harder to read and reason about
- Need a regression suite
- Give it the FreeBSD stink(tm)
- bz@ may have old project in perforce that does some cleanup, likely applies fairly well
- Todo: pf
- brooks@ would prefer a cleanup of the IOCTLs
- TFO (TCP Fast Open)
- Who might have patches?
- Known interop problem with Windows
- TCP option alignment
- tuexen@ has test cases for this, need to extract them from him (with pliers)
- Limelight extension with shared secret
- Alternate Stacks
- Infrastructure
- Allow different TCP stacks concurrently (side-by-side)
- Use setsockopt() to assign individual sockets to the alternative stacks
- Requires that when you switch stacks you must update the common TCPcb
- A/B test stacks, route n% of traffic to the new stack, compare stats from the two stacks
- Can be used to different workloads
- Live-patching by loading newer version of stack without rebooting
- Allows much more active development, frees development from usual requirements (work across low cpu/ram count to high cpu/ram count)
- RACK
IETF draft: https://tools.ietf.org/html/draft-ietf-tcpm-rack-04
- Our code only supports draft -02.
- Netflix not driven to update at this time
- Recent ACK + Tail loss probe
- use RTT to predict when to try to keep transmitting
- use SACK to use RTT to predict when to retransmit
PRR Proportional Rate Reduction (https://tools.ietf.org/html/rfc6937), keep sending more data as you get ACKs instead of waiting for 1/2 of window
- Burst mitigation, high percision timing system
- Much better quality of experience
- Keeps a send map, how many times each segment has been sent, better than old SACK
- robert@ asks about reducing diff between base stack and RACK
- Improved recovery
- In head, higher cost to use
- Most all video traffic at Netflix uses RACK
- Even fill traffic will use RACK eventually
- Head is a bit different than what Netflix is using right now, head is considered far better
- Doing new tests to compare 2017 to 2018 stack
- BBR
- Experimental congestion control, but actually a different stack
- Builds on RACK
- Even higher than cost than RACK
- BBR v1.0 is controversial
- Netflix has enhanced this for their implementation
- In router small buffer scenarios it is unfair to newreno/cubic
- BBR v2.0 looks to improve this
- Netflix not necessarily sold on Google's ideas
- Assumes loss is not congestion based
- "Policer detection" to notice when you are being rate limited by a middlebox
- Infrastructure
- "Blackbox Recorder"
Volunteer to make ports?
- Came from Netflix
- Log state of TCPcb, the packet, timers, other data to ring buffer
- Can be dumped out to userspace
- Tooling exists, needs ports
- Writes out pcapng files
- Traceviewer provides visual interface
- Analysis daemon that runs continuously and runs tests againsts the data, in the form of assertions
- After panic, can extract the data from the ring buffer
- RACK and BBR development depended upon blackbox
- Extend wireshark to understand the metadata
Attend SharkFest to present FreeBSD work
- RCU "Locking"
- mmacy@ applied RCU to IP stack
- Requires a mindset change
- read-locks are not always "locks"
- Register your intention to read the data structure
ConcurrencyKit will not garbage collect the data while you are using it
- In 13 we should shift to using these more
- To date we only have a first pass
- More though about which data structures requires "full" locks
- Make engineering decisions to use the new CK features more
- Avoid "lock chains" that require acquiring many locks in a sequence
- Rethink locking from a more fundamental prespective
- Used to allow add/remove from list, while another process is walking through the list
- Netflix is committed to upstreaming and being good community citizens