802.11 driver/stack TX path notes

The 802.11 TX path is quirky. Here's some notes about it.

TX path

There's a few TX path entry points:

Normal data

Fragments

This is the dirty bit. It occurs via ieee80211_fragment(). This breaks the mbuf into a chain of mbufs, linked via m->m_nextpkt.

However the moment one calls if_start or if_transmit with this chain of mbuf fragments, the m_nextpkt pointer gets cleared and only the first part of the frame is transmitted. The rest of the fragments aren't transmitted; the non-first fragments themselves are silently leaked (as m_nextpkt gets cleared rather than freeing the mbuf.)

In the past, the driver would get the 802.3 frame and it would then do the net80211 encapsulation, fragmenting and encryption. So it could loop over a list of fragments and transmit those.

This changed as part of the work to enable VAP transmission, where frames go into the VAP first and encapsulation happens at this layer. This includes fast-frames handling and fragmentation.

Raw (management) frames

This path is rather annoying - if it's a "raw" frame, it goes straight to the driver. Otherwise it goes via if_transmit() to get encapsulated.

There's also an ic_raw_xmit() function; it gets used directly by some management frame transmission routines.

Power save queue

Power save queue packets get pushed onto the queue here from the net80211 TX path. It gets drained (normally) by the receive path - either on a ps-poll or a data frame with the power management bit cleared.

Fast-frames staging queue

This happens via ff_transmit(); which can be called by ieee80211_ff_check() and ff_flush().

This gets called by a variety of contexts - the TX path, the TX completion path in ath(4) calls the FF flush routine; and the ath(4) RX path also does this.

Serialisation issues

The big thing to realise here is that if_transmit() and if_start() do not have any kind of serialisation guarantees. Even if_start(), which uses the ifnet send queue, can and will have multiple threads calling the stack and driver start method at the same time. Overlapping _and_ preempting. Keep that in mind.

The first problem here is the lack of TX serialisation. Specifically, TX can occur from multiple thread contexts and can be preempted. This includes multiple TX sending threads (eg multiple iperf processes), the RX context (eg an RX of an ACK kick-starts the TX of the next frame), various timers/callouts (eg BAR retransmission, etc.)

There's also the raw versus non-raw TX path. Some raw frames (eg EAPOL frames, as part of WPA/WPA2 handling) have sequence numbers assigned to them and are encrypted. These need to be pushed into the TX path in the "right" order; any out of order handling here will cause the receiver to drop frames.

The next problem is the disconnect between net80211 TX handling and encryption. The encryption is done in the driver, not the net80211 TX path. Thus the frames pushed into the driver has to be handled in order by the driver, otherwise there may be a slight disconnect between the 802.11 sequence numbers and the CCMP IV sequence number.

Solutions?

The linux solution is "hold a lock in the TX code, so that frames get handled in-order from the point where it's grabbed by mac80211 and processed (sequence number, 802.11 state) all the way through the driver, including encryption handling.

The "better" solution is to serialise both the net80211 TX path and the driver TX path so there's a TX queue that frames get pushed into, then either:

The side-issues:

There are significant issues that creep up when you do this:

The problem with taskqueues

I've tried the TX taskqueue thing with ath(4), out of curiousity. (The real solution involves doing it with net80211 too, and either having a separate taskqueue for ath(4), or having the net80211 TX taskqueue call the driver direct and have the driver TX _only_ ever called from that context.)

The initial problem is "dramatic TX throughput reduction."

After tracing things a bit, I discovered the cause: when TX isn't direct dispatch, there is significant delays between TXing a frame and scheduling/running the taskqueue. Yes, even if it's a separate taskqueue (rather than a task in the same taskqueue as the RX task.)

The normal case:

The shared RX/TX/TX completion taskqueue case:

Now, there's added latency between the TX occuring and the TX tasklet running because they run in the same shared taskqueue context. That extra latency causes an noticable performance dip. Very noticable.

The shared RX/TX taskqueue case:

This is a similar problem to the shared TX/RX taskqueue. The higher priority means that the RX tasklet (and TX completion tasklet) will get preempted.

However, it's not a magic bullet: Specifically, the TX taskqueue will get woken up on every packet and if it can't do anything (eg the hardware TXQ is filled) it'll just software queue the frame and go to sleep. This is a lot of scheduler thrashing.

Now, my ideal solution could be:

This is still suboptimal:

The last point could be addressed by tracking how deep the hardware queue was about to be, rather than how deep the hardware queue _is_. That's just playing with fire though (if you get all of that wrong, you may screw up the queue depth figure and never handle things.)

.. And this is all just for the ath(4) TX serialisation. This doesn't at all address the net80211 TX serialisation, which includes many of the same issues here.

WiFi/TxNotes (last edited 2018-04-05T23:31:41+0000 by MateuszPiotrowski)