FreeBSD Developer Summit: Transport Working Group

June 9, 2016 (Thursday), 13:30-16:30

DMS 1140

Overview

We will discuss ongoing work, and ideas for improvements to, the transport protocols in the FreeBSD kernel.

There is a group that meets regularly to discuss transport work in the kernel. Notes of the group's work can be found in TransportProtocols. The face-to-face time will allow us to whiteboard and discuss complex topics in an extended time. It will also allow us to include participants who are not able to make the regular meetings.

If you would like to participate, contact the working group chairs below and CC devsummit@. You will be then added to this page. Please include a list of things you want to talk about or the areas you are interested in. This helps us in planning the session and to bring people together with common interests.

It may be possible to bring in people who cannot attend in person via video conference or chat tools. Notes during the session will be published later on for the whole community to see what we discussed.

Goals

In general, there are two areas we would like to cover: 1. Discussions of ongoing work that is complex enough, requires coordination, or requires architectural decisions that would benefit from face-to-face discussion among a larger group. 2. Exchange of ideas for upcoming work to gauge community interest, solicit feedback, look for conflicts/overlap, and generally keep everyone informed.

In particular, we would like to cover the following topics. This is not an exhaustive list and if you feel there is something missing that you want to talk about, contact one of the session chairs and we will include your topic here. Note that the numbering of the topics does not represent an ordering or importance indication of any kind, but rather a reference to the second table with the "topic of interest" column.

Topics

#

Topic Description

1

Status of TCP modularity (jtl, randall)

2

DCP/MPTCP updates (nigel)

3

Packet Pacing (navdeep, hans, randall)

4

RFC 6675 support (hiren, jtl)

5

Congestion control related work?

6

iflib??

7

RACK, PRR (randall)

8

Route caching, FlowTable

9

N/W stack as a module (Steve k)

10

Time to extend tcpcb for 11? It is running out of padding

Note: General presentations about work you have done that does not require further discussions should be submitted for the FreeBSD Developer Summit track at BSDCan (see the general developer summit page).

Attending

In order to attend you need register for the developer summit as well as by email for the session and be confirmed by the working group organizers. Follow the guidelines described on the main page or what you received by email. For questions or if in doubt ask the session chairs. (Note: Please send email to both working group chairs to ensure your email is received despite holidays, etc.)

Please do NOT add yourself here. Your name will appear automatically once you received the confirmation email. You need to put your name on the general developer summit attendees list though.

#

Name

Username / Affiliation

Topics of Interest

Notes

1

JonathanLooney

jtl@

Session co-chair

2

HirenPanchasara

hiren@ / LLNW

Session co-chair

3

RyanStone

rstone@

4

KevinBowling

LLNW

5

JasonWolfe

LLNW

6

JasonEggleston

LLNW

7

Eric van Gyzen

vangyzen@

8

David A. Bright

DELL

9

Navdeep Parhar

np@

10

Patrick Kelsey

pkelsey@

11

Hans Petter Selasky

hps@ / mellanox

12

Drew Gallatin

gallatin@

13

Randall Stewart

rrs@

14

Jeremiah Lott

Avere Systems

15

Steve Kiernan

Juniper

Results

Session conducted by: jtl
Session notes by: hiren (apologies in advance for errors in notes :-))

Status of TCP modularity (jtl, randall)
-------------------------
Randall: You can now replace large portion of tcp stack with this work.

ToDos:
- Changes have to happen pretty early in the connection lifetime. i.e. an
app cannot change after a connection is established. rrs is going to add
this functionality.
- Currently we build all altnate stacks as part of buildkernel. That needs
to change.
- jtl is going to work on these makefile changes.
- This feature needs documentation. There is none right now.

MPTCP updates (nigel)
--------------
Nigel gave a brief intro to MPTCP and status update on the state of
development. He asked for help in reviewing.
To make it into the tree, he is only going to work on fail-over mode and
concurrency mode is going to be done later if/when time permits.
Concurrency patch can be developed in any other (project?) branch as it
may not be ready for the prime time.

jtl: Please add it to the phabric and add #transport as a reviewer.

None in the room explicitly expressed interest in reviewing the work
at this point in time.

3 Packet Pacing (navdeep, hans, randall)
---------------
Randall gave a quick intro to his software packet pacing that he developed as part
of RACK. (Slides are attached).  See also https://labs.ripe.net/Members/gih/bbr-tcp

High level comments from Randall:
- Till h/w pacing is standardized, s/w pacing can be very useful.
- Drain rate: largest cwnd / lowert rtt
- Timers are sloppy but we try to be sligtly more precise with SBT type timers.
- Calculation works out even if without pace.

jegg: Is it a problem if we only send each 1ms.
rss: slot 0 serves when no pacing and it works out in what I've seen so far.
hans: are you using more precise timers?
rrs: Yes, Lawrence changed our stack to use SBT type timers and I use that. It
has 1ms granularity more or less.
hans: are you calling tcp_output on each packet?
rrs: no, on each sending opportunity. Pacer wakes up, looks at its slot, pulls
the rack tcbs out and tells output to send.
jtl: no, its a connection pacing and not actual packet per ms.
rss: Yes, I'm pacing connections as best as I can.
hiren: how can we use more precise timers and is that something available in tree?
hans: Yes, you can toggle it with sysct kern.timecounter.alloweddeviation.
hiren: Of course it'd take more from cpu to get you that.

Discussion spurred around 1ms timer granularity. Randall clarified that 1ms seems to
work for Netflix workload and if data path is faster, we may need to change it.

Navdeep started h/w pacing discussion.

- A general solution which is a superset of pacing that we see today.
- Pacing at socket layer can be done with setsockopt but h/w can do much more.
  It can pace arbitrary flows based on traffic class or "tags" on the flows.
- We need firewall like rules for pacing that can be easily configured on
  transmit side. And we need kernel hooks to be able to do this in h/w.
- The way chelsio works is, it has a limited number of pacing rules which
  determine draining rate. A cookie specifies that and can be attached to a
  flow. H/w can also do 'aggregate pacing' to let connections from certain
  customer be paced that way. To make all this work in a generic fashion, we
  need at least 1 32bit field in mbuf.

Socket part is easy. But we need hooks in packet filters.

Hans: So for chelsio, a packet goes to h/w with a cookie and based on the
cookie, h/w puts packet into the "correct" pacing queue?
np: We have only limited set of queues i.e. traffic classes for pacing. Queues
only need to be processed in-order when they are serving a single connection.
Hans: In Mellanox's solution, we have one queue for each tcp stream and reuse
flowid.
np: For our solution, we'd need a field in mbuf.
drew: I'd avoid putting more fields to mbuf and that's why Mellanox's approach
seems better as they can reuse flowids and don't need an extra field.
np: In that case, for 30k connections, you'd need 30k queues. In chelsio's case
we can have all those in just 4 queues but need a field to do proper queueing
between those 4 queues.
drew: we should try and reuse an existing field in mbuf, if possible.
chelsio: a cookie and not a flowid.  rxifp
np: rxifp is not used on transmit today. Probably reuse that?
np: I think we should have a first-class pacing identifier in mbuf which is not
flowid. They are generally different.
drew: I am happy if we can do it without growing mbuf.
np: For FeeBSD 11 timeline, can we add a couple fields to mbuf?
drew: No, that'll mess with cache coherency.
<some discussion that I missed>
drew: Having a new "qos cookie" field should be sufficient for both chelsio and
Mellanox to use in their respective solutions. flowid would still be the same
and this new field would be used for pacing. Mellanox would need to adapt to
this new scheme but that seems doable.
General discussion: RSS and pacing both can stay. Pacing has a flowtype so
flowid is a queuenumber.
drew: what if we have both cards on the same box and both understand/parse this
field differently?
np: Should not be a problem
hans: Might be a problem
np: We need to store the policies about qos in general global namespace so any
solution can use that and work with it. Probably a new function pointer in
ifnet? This way footprint in ifnet is unchanged and kbi is stable.
hans: I am not quite qualified to decide if this is the correct approach or not
but we should discuss.
drew: Mellanox should try and adapt Chelsio's approach instead of other-way
around. It'd be more efficient also.

rrs: I need 100k for this connections so give me a cookie and I can put that on this.
np: yes, correct.
rrs: we'd also need a flag to say if the interface/card supports this feature.
np: You don't need additional flag. If the "cookie" field in ifnet is non-zero,
it means this feature is available to use. This field in the efnet would point
to a set of "operations" available.

drew: any other h/w vendor want to chime in?
david: how many traffic classes?
np: Chelsio has 16 classes. But you can bind them.

jegg: When you say 'qos', you really mean 'qos' or 'pacing'?
sjg: Pacing is a subset of qos. Solving qos problem will make pacing easy.

david: how does it  work:
np: when you setup a connection to do flow-pacing, the allocation function from
ifnet's pointer is called and give you an opaque cookie corresponding to your
request which you can use.

rss: someone should summarize this and post the discussion to keep all vendors
in the loop.
hiren: np already has sent out a review.
np: Yes, I will resurrect that with the update from today's discussion.
But this cannot work if we don't have proper back-pressure in stack and we
don't have it yet.
np: Today tcp tries to send a packet and if it can't be sent from if, it returns
ENOBUF, congestion window is brought down to 1mss and we take a timeout. We need
a per flowid feedback mechanism to let upper layers know about the "space left"
or "% busy" indicator so they can slow down.

ryan: mmacy has done this.
np: it seems incomplete.
hiren: afaik, its replaying packets when ENOBUFS is returned.
rrs: We need some way of saying my queue is x% full so tcp should stop for a
while, set a timer and come back in a bit. It's rather easier in s/w pacing but
unsure about h/w pacing.

hans: TSO becomes a problem when we have multiple connections sharing the queue.
drew: we should signal when we have 1 max tso side left in the queue.
rrs: return number of bytes that a queue can take as a feedback.
drew: That's what linux does.

How to do back-pressure right?
------------------------------
drew: Biggest problem is upstream has no notion of "queue".
jtl: A queue in trouble may share many sessions, how to notify them also.
rrs: You cannot do that without a very hacky locking trouble.
jegg: the buffer space used is shared so how to let individual connections know.
sjg: Each layer of the stack has different notion/meaning of backpressure so a
general solution may not work.
drew/rrs: Linux maintains the queue-mgmt/pacing layer above drivers and let drivers
do that with function pointers. So you can take a flowid and ask the card for
capacity left to send. That may have high overhead so need to measure that. It
may depend on how we store that information. i.e. in what struct. But thats the
only sane way to do this without changing a whole lot of places.
pkelsey: why not h/w do the scheduling.
sjc: thats a lot of callbacks and not a real back-pressure.
hiren: thats just delaying the problem of letting the sender know.
pkelsey: We many need to think about this out side tcp too as this is a general
problem.
jlt: This approach is more or less adding 1 more layer of caching. Do we need
that - is the real question.

RFC 6675 support
----------------
hiren: Working on a few changes to pipe calculations after discussions with jtl
jtl: Our SACK implementation is incomplete. We need to improve it so we do better
tracking per sack hole and can recover better/faster. Basically tracking acks
per segment and maintaining better scoreboard.

iflib status
------------
rston explained basics of iflib: A common framework for ethernet drivers so we can
reuse the code and copy-pasta errors can be reduced.

drew: 15% idle cpu gained by Netflix.
drew: its committed in head.
hiren: only lib is and no drivers are committed.
drew: what about Broadcom? bxe(10)
Someone: MMacy got annoyed while trying to port bxe to iflib as the quality of
the drivers was pretty bad.
np: 64-bit atomics is a problem with iflib.
Someone: MMacy fixed it.
np: no, its a hack, i.e. using mutex so perf goes down.
ryan: OCE emulex that needs to be converted.

RACK
----
Randall explained RACK.
https://tools.ietf.org/html/draft-cheng-tcpm-rack-01

rstone: how imp is 1ms slop?
rrs: that's part of the draft.
wolfe: Should you include 3-dupack loss detection as a fall back mechanism?
rrs: quic does that but I've never seen that to be the case in my testing on the
Internet. Sometimes you don't want to retransmit as there is a reordering going
on and you want to wait. Current stack cannot distinguish between a false and
real retransmission, rack can.
jegg: How close are rack implementations (linux and FreeBSD) following the
draft.
rrs: I've not looked at linux but I am not following the draft closely as it was
not efficient.
hiren: Reason being code was written first at Google and then draft came out so
it tend to be more linux implementation friendly.
rrs: draft itself it pretty clear other than the tlp confusion which will be
fixed in the next rev.

TCP incast (rstone)
-----------------
MMacy and Ryan are working on incast congestion and drop at switches because of
that. We are trying to mitigate incast problems by retransmitting faster in such
low-rtt links.

jtl: whats the precise timer?
rstone: Matt is working on precise timers using sbt-wrap (?)
wolfe: can you not use latency info? i.e. if latency is not increasing, its
probably just drops at nics.
rstone: the current implementation timers granularity is just not usable for our
low-rtt networks.
hiren: Tried DCTCP Datacenter TCP?
rstone: Need to. First try was unsuccessful.
rstone: Trying to get more precise tcp timers out of the stack.
jtl: on low-rtt links, sometimes sender falsely retransmits a packet whose ack
is waiting to be processed.
rstone: I've seen something similar where LRO combines acks.
ryan: dctcp doesn't like LRO as it hides ECN marks. Fixed it in-house and
should be upstreamed.
np: packet batching would help a lot.

Route caching, FlowTable
------------------------
gnn: I will commit mikes next patch and then address melifaro@'s concerns.
gnn: We can remove flowtable if its compatible in terms of performance with the
new route-caching patches.
gleb: measure it first and remove it.
np: inpcb's reference in ifnet is a layering violation that's proposed by the
patch. Can I also do that?
gnn: It's usually not advised but this looks like a good exception
gleb: maintaining layering violations is a mess. flotables is a very
non-intrusive change while route-caching is a hard-wired layering violation and
thats why it was removed.
gleb: For just serving data with static configuration, this is great but for
multi-interface routers where interfaces come and go like netgate  and for
short-lived tcp connections like whatsapp, this can be problematic.
np: We need to hang ifnet specific cookie in something related to socket.

N/W stack as a module (Steve k)
---------------------
Slides are available.
You can build FreeBSD without n/w stack.
Easy with experimenting new n/w stacks.
Some issues with VNET and not sure if those are fixed. Need to ask bz@
A lot of layering violations in current stack.
Git repo with changes. and in ~marcel's home dir.
A lot of changes that could be great for general FreeBSD also.

rrs: make sure michael is aware of sctp.
stevek: yes, he is aware. We've been talking.
hiren: whatever you have is running in prod?
stevek: yes, for about a couple years.

Time to extend tcpcb for FreeBSD 11? It is running out of padding
-----------------------------------------------------------------
gleb/drew: We have spares throughout so why do we need to fix it?
jtl: but we need a careful redesign.
drew: who uses/accesses tcpcb?
netstat does direct copy.
gleb: we can fix this if its only netstat.
drew: a lot of things in ths struct is in ulong
jtl: yeah, not sure if we need that in ulong.
gleb: for 11,
systat, netstat, tcpdrop, ipfilter, trtp
General consensus: we neeed to do the reshuffle after 11 of tcpcb.
rrs: when we do it, drew should run vtune and see how it fixes things.


CategoryHistorical

DevSummit/201606/Transport (last edited 2018-03-18T15:45:38+0000 by MarkLinimon)