Network RSS

What is RSS?

In this context, it's "Receive Side Scaling". It started at Microsoft.

Start here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff570736(v=vs.85).aspx

Then, read this: http://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx

The basic idea is to keep the code and data for each given TCP and UDP flow on a CPU, which aims to:

Where's the code?

Robert's initial RSS work and the PCBGROUPS work is in -HEAD.

I'm committing things to -HEAD as they get tested.

I do development in this branch - but right now everything is in -HEAD:

RSS support overview

The RSS code is in sys/netinet/in_rss.[ch]. It provides the framework for mapping a given mbuf and RSS hash value to an RSS bucket, which then typically maps to a CPU, netisr context and PCBGROUP. For now the RSS code treats the RSS bucket as the netisr context and PCBGROUP value.

The PCBGROUP code in sys/netinet/in_pcbgroups.[ch] creates one PCB table per configured "thing". For RSS, it's one per RSS bucket - not per CPU. The RSS bucket -> CPU mapping occurs separately.

The RSS hash is a 32 bit number calculated from a topelitz hash and the RSS key. The microsoft links above describe how the RSS hash is calculated for each frame.

The RSS code calculates the number of RSS bits to create buckets for based on the number of CPUs - typically twice the number of CPUs so rebalancing is possible. The maximum number of RSS buckets is 128, or 7 bits.

The sysctl "net.inet.rss.bits" is currently the only tunable (set at boot time) and controls how many RSS bits above to use when creating the RSS buckets.

Some NICs (eg the Intel igb(4) NICs) only support 8 RSS queues (and some earlier NICs only support 4) - but the RSS code doesn't yet know about this. net.inet.rss.bits may need to be capped to the maximum number of RSS queues the NIC(s) in use supports.

The RSS code then will allocate CPUs to each RSS bucket in a round-robin fashion.

Userland can query the RSS bucket to CPU mapping in "net.inet.rss.bucket_mapping" - it's a string of bucket:CPUID queue values. Userland can use this to create one worker thread per RSS bucket and then bind it into the right CPU.

The RSS kernel calls are as follows:

TODO list

Here's the list of things to do for basic RSS support.

Current work

The current RSS work aims to finish up RSS awareness in the UDP path and tidy up the immediate loose ends around IPv4 and IPv6 fragment handling.

Completed work

Later work

What isn't required for basic RSS but would be nice:

Drivers

Work is being done using igb(4) and ixgbe(4), as that's what AdrianChadd has on hand.

TODO - hopefully also cxgbe

Each RSS aware driver needs two things:

RSS support in NICs typically comes with:

Then if RSS is enabled:

UDP RSS

UDP is mostly the same as TCP, except where it isn't.

Some NICs support hashing on IPv4 / IPv6 UDP information. It's not part of the microsoft specification but the support is there.

The UDP transmit path doesn't assign a flowid / flowtype during udp_append(). The only time this occurs is when udp_append() calls ip_output() - if flowtable is enabled, the flowtable code will assign the flowtable hash to the mbuf flowid.

The ip_output() path has some code that inspects the inp for flowid/flowtype and assigns that to the mbuf. For UDP this won't really work in all cases as the inp may not actually reflect the actual source/destination of the frame. So the inp can't just be populated with the flow details.

.. which isn't entirely true - it can be, but then during udp_append() the inp flowid/flowtype can only be used if the source/destination matches an exact match, not a wildcard match. (Ie, it's a connected socket with a defined local and remote address/port - then send / recv will just use those IPs. If it's sourcing from INADDR_ANY or it's being overridden by sendto / sendmsg, the flowid will need re-calculating.)

So for transmit:

There's also the problem of IP fragments - see below.

The PCBGROUPS/RSS code doesn't know about UDP hashing. By default the UDP setup code configures the PCB info table as a two-tuple hash, rather than a four-tuple hash. The default igb/ixgbe NIC RSS hash however configures hashing on UDP 4-tuple. So unless a few pieces are updated to treat UDP hashing as a four-tuple, the NIC provided hashing won't line up with the expected hash types for RSS/PCBGROUPS, and things won't work.

So, the TODO list looks something like this:

It's possible that for now the correct thing to do is to change the NIC hash configuration to not hash on UDP 4-tuple and to treat UDP as a straight IPv4 or IPv6 packet. It means that communication between any given hosts will be mapped to the same CPU instead of distributed, but it avoids worrying about IP fragment hash handling.

So to finish it off:

Handling IP Fragments

This is especially a problem with UDP.

For the IPv4 path, it's handled via ip_reass() in sys/netinet/ip_input.c.

For the IPv6 path, IPv6 fragments are a different IPv6 protocol. They're effectively treated as a tunnel - re-encapsulation is performed and the complete frame is then re-parsed as a full IPv6 frame. That's handled in sys/netinet6/frag6.c.

So, once the fragments have been received and reassembled a 2-tuple or 4-tuple RSS hash is required. This depends upon the protocol type (TCP, UDP, other) and whether TCP/UDP hashing is enabled in RSS.

The easiest solution would be to just recalculate the hash on the completed frame. The more complicated solution is to check if the hash type has changed (eg IPv4 TCP frame, which the hardware stamped with a 2-tuple hash type) and if it still has the correct hash type, treat it as valid.

Another amusing thing is where to reinject the completed packet. The destination RSS bucket is likely different to the CPU which reassembled the frames. Somehow it needs to be re-injected into the correct netisr queue and handled appropriately.

NetworkRSS (last edited 2014-12-14 20:02:28 by AdrianChadd)