Networking session
Topics we are interested
- mbuf allocation improvements - variable size mbufs (*)
- TCP - work deferral Van Jacobson bet channels (*)
- Connection Groups with RSS [ ]
- supporting 2 (on all groups) vs. 4-tuple currently supported
- patches exist in industry, people trying to extract them
- TODO:
- get the patch
- affinity APIs for sockets - (1) find a real application (2) layer - driver vs. TCP vs. socket
- nginx as a load balancer
- pps benchmark
- memcached (?) - cpuset?
- CPUIDs vs. sets - API friendliness
- Batching vs. Latency (eg. LRO, TSO) (discuss)
- Profiling
- NUMA - DMA
- "config lock": rmlocks - RCU - giant lock [ ]
- upsides:
- many fewer lock ops
- less line contention
- cache locality
- downsides:
- update latency - loss of granularity
- pinning - priority inversions
- Questions:
- what goes under rmlocks?
- what uses RW/etc.?
- What is suitable per write rate for config lock?
- What is read time distribution
- What is priority propagation cost?
- Profile paths followed under read + write
- WITNESS?
- Profiling
- Entry points to possibly acquire global rmlock?
- netisr
- socket entry
- protocol callouts/ioctls
- device-driver ithread - BPF
- device-driver task queues
- device-driver ioctls
- firewall / dummynet / altq
- link-layer (e.g 802.11)
- interface cloning/destory
- netgraph (eg. ng_tty)
- sysctls
- VIMAGE/SYSINITS
- upsides:
- route caches, flowcache, ... [ ]
- vertical lock coalescing [ ] (discuss)
- affinity-APIs - user space
- netmap changes
Legend:
- - (*) Robert can tell us about - [ ] make decisions - (discuss) well, discuss
Measurements (we want to see)
- Config lock-related
- ioctls/etc.
- measure read/write lock ratio
- global ifnet
- ifnet address locks + ifaddr
- lltable + llentry
- firewall lists
- firewall rules
- host cache - IPv6
- |? routing table + rtentries
- || inpcbinfo
- UDP
- TCP
- syncache
- nd6 prefix list, etc.
- || connection groups
- ethernet bridge config
- SNMP-monitored things
- netisr
- BPF
- new connection setup
- investigate Nicholais file descriptor optimisation = non-linear fd allocation (possibly not a big deal for real http)
- Lock traces for common paths, stats
- Cache-line contention measurements
- packet narrative traces
- measurement scenarios:
- classic web
- web w/ lagg
- varnish
- mySQL
- tunnel services / VPN / L2TP
- (full) BGP actively changed (peak 25/s ?)
- NFS scenarios
- different LAN sites
- dynamic firewall rules
- ethernet bridge
- DNS-like (heavy UDP)
Research TODO
- HTM / STM (TM == transactional memory) features
Optimisations to try out
- 1 RSS aligned PCBGROUPS (whole stack) (adrian) 1 Global config lock (bz) 1 Van Jacobson TCP deferral (rwatson) 1 pmcstat + netstat hybrid (adrian) 1 mbuf allocator (rwatson) 1 mbuf abstractions hiding (adrian) 1 deferred cluster free for buffer cache (adrian) 1 route cacheing restore (bz) 1 remove flowtable? (bz) 1 splice? (adrian) 1 vertical lock coalescing (rwatson, in future)
Chat
- high connection accept rates with pcbgroups, not all HW can do SYN on multiple queues, etc.; in the UDP/DNS case use multiple IPs to push the steering down and do at a different layer. ... vs. affinity, ...
- high accept rate: contention in syncache processing and new socket creation for short lived (TCP) connections. pcbgroups explanations, with two lookup tables, more work, ... toeplitz problems