FreeBSD 8.x TCP Parallelism Project
FreeBSD 5.3 introduced an MPSAFE network stack.
FreeBSD 6.0 introduced significant performance optimizations and stability improvements.
FreeBSD 7.0 introduced further performance optimizations, lock granularity improvements, and strengthened invariants.
By FreeBSD 7.0, all transmit paths allowed fully parallel execution for independent TCP connections on different processors, although during bind()/connect() acquired a global lock in order to manage TCP connections. However, the input path relies heavily on the global tcbinfo lock to protect global connection lists, but also to protect against potential state changes that might manipulate global lists. In practice, this means a brief exclusive global lock acquire for regular data/ACK segments, but holding the global lock for all processing involving SYN/FIN/RST segments. Also, all timers and connection-oriented paths (such as sendto() with an address to connect to) acquire the tcbinfo lock. With a single input thread for network processing, this isn't horrific, although a high rate of connections being opened/closed without much data sent may lead to high contention. With a parallel input path, such as with multi-queue network cards, multiple input interfaces, or parallel loopback processing, this presents a series problem for scalability.
The Plan: TCP
The objective is to expand support for, and opportunities for, fully parallel processing of independent TCP connections on multiple CPUs, as well as to continue to support pipelined processing of input data across CPUs.
The plan is to refactor TCP global data structures and the tcbinfo lock as follows:
- Break out tcbinfo and its lock into a series of data structures associated with independent connection groups, each with independent global locking. These groups may be accessed from any thread or CPU, but we will try to use connection affinity to discourage collisions (contention).
- Convert the tcbinfo lock to an rwlock, allowing many acquisitions (i.e., data/ACK segments) that will not lead to potential state changes (and hence global list changes) to be run at a time for a single connection group.
Improve handling of timers and connection shutdown in order to avoid several race conditions that will otherwise be exposed by this work, as well as reduce overhead by going from multiple timeouts to a single timeout (work previously explored by AndreOppermann).
- Possibly add a reference count to struct inpcb so that when connection locks are dropped in order to reacquire global locks (which precede connection locks in the lock order), we don't lose the inpcb. Currently we are unable to drop global locks early because we *may* later need a global lock and can't upgrade.
The Plan: UDP
Unlike TCP, UDP is not stateful for most operations, and so synchronization requirements are even weaker. Much of the infrastructure is shared with TCP, so all TCP plans apply, but we also plan to:
- Convert the inpcb lock to an rwlock and acquire the lock read-only for most send/receive operations.
- Use only read-only acquisitions during input processing since input cannot lead to connection state changes (and hence global changes).
Progress and Results
All TCP work is currently in the planning stages. A new benchmark, tcpp, has been developed to explore the impact of parallelism and concurrency on TCP performance. We are also interested in benchmarks involving web serving, web proxying/caching, NFS over TCP, and database sessions over TCP.
The conversion to rwlocks for pcbinfo and inpcb has been completed in the rwatson_udp branch, with most UDP operations converted to read-only operations. This leads to significant performance improvements for nsd and bind, which make extensive parallel use of a single socket from many threads.