This article outlines some generic routing-related problems in our network stack and describes the solutions proposed in projects/routing

Current problems

Performance issues

Many people considered routing lookups slow (and created various mechanisms of avoiding such lookups, like route caching in tunnels (or even protocols), or having per-flow hash record (FLOWTABLE)).

Typical packet processing (forwarding for router, or output for web server) path consists of:

Actually, given number of locks is roughly like having single Giant lock: single-cpu performance (600-1mpps) is very similar to the results you can get using modern dual-cpu system with 16/32 cores. The most expensive operation is acquiring struct rtentry mutex (think about default route delivering HUGE contention on single lock, which is also hold for plenty of time). However, other locks do heavily impact processing path.

Routing code itself

Dealing with performance issues revealed number of problems which also have to be dealt with:

LLE code

While a really good job of decoupling L3/L2 was done in 8/, it was not completed. LLE code suffers from the same problems:

Proposed solution

Most of the performance issues described in "Problems" sections were addressed in projects/routing branch (some already merged to HEAD). We managed to remove rwlocks/mutexes from hot path, so these changes permits us to scale close-to linear based on number of cores. In fact, we were able to reach close to 12MPPS for IPv4 forwarding (2xE2660, 8 cores) which is 5-10 times faster than current benchmarks.

Some benchmarks

Forwarding benchmark

ipv4_routing_1.png

Forwarding/locking benhmark

lock_fwd.png

High-level overview

The key concept is hiding all routing internals and providing pre-computed easy-to-use result to all consumers with minimal locking. We provide separate functions for each address family (fib4/fib6/rib). We also provide several different (in terms of returned info and relative overhead) methods of retrieving routing data. Typically, lookup result is pre-computed next hop data, copied to caller-provided 64-bytes nhop_* structure, with (optionally) all necessary data _including_ the data that needs to be prepended to mbuf (currently - pre-computed L2 header).

Locking changes

New api/structures are described separately in API page.

More details on routing

Routing customers can be split in several categories:

  1. check-rtable users (firewalls doing uRPF, various prefix checkers).
    • Here we don't need to do ANY refcounting, we need to provide very basic info on most specific match using optimised fib table lookup.
  2. L4 level code (like TCP, trying to extract some useful info like MTU)
    • Here we might need to provide things like MTU or some L4-specific data saved in given rtenry.
  3. Forwarding code (_output() routines)
    • Here again we need very basic info, so we can use optimised fib table lookups.
  4. Control plane code (rtsock, address routines)
    • Use slower rib-based procedures returning individual rtentries

More details on LLE

Regressions

Merge plan

Basic plan looks like the following:

  1. Merge most of LLE changes not related to locking/timers (e.g. all new lltable callbacks/enumerators/etc)

  2. Discuss/merge LLE locking/timers stuff D3688 D3780

  3. Perform one-by-one conversion of non-critical route(4) consumers [PARTIALLY DONE]
  4. Discuss/merge new routing lookup API for data path
  5. Discuss/merge lltable locking (afdata rwlock -> lltable rmlock+rwlock)

  6. Discuss/merge routing locking changes

Status

Current conversion status is available here.

Considerations/Open questions

  1. Wider use of rmlock in kernel. Rmlock has side effect of "pinning" current thread which might lead to more severe consequences when using (contested) mutexes/rwlocks.
    • Here we try to avoid acquiring any additional locks as much as possible (no locks for routing lookups, rare lookups for LLE stuff). However, more stuff could be done (for example, explicitly avoid acquiring lock when sending LOTS of traffic to single dead host with no L2 lle info). Additionally, write lock costs a LOT more, so some sort of batching should be considered for inserting kernel routes/lle entries.
  2. Could we teach PFIL to start looking at mbuf starting from certain offset? (eliminate 2x LLE header copy)
  3. "slowpath" netisr queue: we'd better delay handling of some packets (like first packet being sent to host, triggering creation of LLE record) to separate netisr queue to batch such cases (since LLE link requires WLOCK)
  4. Should be insert link-local routes into RIB?
    • Typical routing lookup now looks like the following: "What is egress interface/gw for link-local address X and embedded scope Y?" "You know, it is interface route and the egress interface is Y!"

Future plans

ProjectsRoutingProposal (last edited 2016-01-18T19:23:05+0000 by glebius)