Networking session

Topics we are interested

mbuf allocation improvements - variable size mbufs (*)
TCP - work deferral Van Jacobson bet channels (*)
Connection Groups with RSS [ ]
- supporting 2 (on all groups) vs. 4-tuple currently supported
- patches exist in industry, people trying to extract them
- TODO:
  - get the patch
  - affinity APIs for sockets - (1) find a real application (2) layer - driver vs. TCP vs. socket
    - nginx as a load balancer
    - pps benchmark
    - memcached (?) - cpuset?
  - CPUIDs vs. sets - API friendliness
Batching vs. Latency (eg. LRO, TSO) (discuss)
Profiling
NUMA - DMA
"config lock": rmlocks - RCU - giant lock [ ]
- upsides:
  - many fewer lock ops
  - less line contention
  - cache locality
- downsides:
  - update latency - loss of granularity
  - pinning - priority inversions
- Questions:
  - what goes under rmlocks?
  - what uses RW/etc.?
  - What is suitable per write rate for config lock?
  - What is read time distribution
  - What is priority propagation cost?
  - Profile paths followed under read + write
- WITNESS?
- Profiling
- Entry points to possibly acquire global rmlock?
  - netisr
  - socket entry
  - protocol callouts/ioctls
  - device-driver ithread - BPF
  - device-driver task queues
  - device-driver ioctls
  - firewall / dummynet / altq
  - link-layer (e.g 802.11)
  - interface cloning/destory
  - netgraph (eg. ng_tty)
  - sysctls
  - VIMAGE/SYSINITS
route caches, flowcache, ... [ ]
vertical lock coalescing [ ] (discuss)
affinity-APIs - user space
netmap changes

Legend:

- (*) Robert can tell us about - [ ] make decisions - (discuss) well, discuss

Measurements (we want to see)

Config lock-related
- ioctls/etc.
- measure read/write lock ratio
  - global ifnet
  - ifnet address locks + ifaddr
  - lltable + llentry
  - firewall lists
  - firewall rules
  - host cache - IPv6
  - |? routing table + rtentries
  - || inpcbinfo
    - UDP
    - TCP
    - syncache
  - nd6 prefix list, etc.
  - || connection groups
  - ethernet bridge config
  - SNMP-monitored things
  - netisr
  - BPF
- new connection setup
- investigate Nicholais file descriptor optimisation = non-linear fd allocation (possibly not a big deal for real http)
- Lock traces for common paths, stats
- Cache-line contention measurements
- packet narrative traces
- measurement scenarios:
  - classic web
  - web w/ lagg
  - varnish
  - mySQL
  - tunnel services / VPN / L2TP
  - (full) BGP actively changed (peak 25/s ?)
  - NFS scenarios
  - different LAN sites
  - dynamic firewall rules
  - ethernet bridge
  - DNS-like (heavy UDP)

Research TODO

HTM / STM (TM == transactional memory) features

Optimisations to try out

1 RSS aligned PCBGROUPS (whole stack) (adrian) 1 Global config lock (bz) 1 Van Jacobson TCP deferral (rwatson) 1 pmcstat + netstat hybrid (adrian) 1 mbuf allocator (rwatson) 1 mbuf abstractions hiding (adrian) 1 deferred cluster free for buffer cache (adrian) 1 route cacheing restore (bz) 1 remove flowtable? (bz) 1 splice? (adrian) 1 vertical lock coalescing (rwatson, in future)

Chat

high connection accept rates with pcbgroups, not all HW can do SYN on multiple queues, etc.; in the UDP/DNS case use multiple IPs to push the steering down and do at a different layer. ... vs. affinity, ...
- high accept rate: contention in syncache processing and new socket creation for short lived (TCP) connections. pcbgroups explanations, with two lookup tables, more work, ... toeplitz problems

DevSummit/201308/Networking (last edited 2021-04-25T07:13:08+0000 by JethroNederhof)