IP, TCP/UDP/SCTP checksum and TCP segmentation offloading plan
This document describes a vision developed together by several developers (glebius@, tuexen@, Peter Lei, pouria@, Timo Völker ... feel free to add yourself) on the direction the network stack should move with regards to the checksum offloading. You may treat this document as a request for comments. Feel free to participate in the discussion in email, or joing the bi-weekly transport group video calls.
Contents
Introduction
Goals
The following goals should be achieved by an implementation
1. Hardware offload capabilities should be used whenever possible.
2. For packets not leaving the physical machine, no CPU resources should be spent for validation.
Historical review
FreeBSD's network stack is direct descendant of 4.3BSD developed in the 80-ies, when there were no a single device with hardware checksum offloading. As result, the stack was developed without this feature in mind, and when very first NICs with checksum offloading appeared, they were treated like a new rare kind of hardware, thus support for offloading was added on top of the existing stack. Here is the problem in a nutshell: the stack is designed with assumption that by default NICs can't do offloading, and in case we have onboard a NIC that can do it, we do special maneuvers to get it working.
Current State (March 2026)
A network interface has some offload capabilities. By setting flags in ifp->if_capabilities or ifp->if_capenable, the driver indicates which capabilities are supported or enabled, respectively. The user can usually enable or disable capabilities with ifconfig.
Transmit
These are capabilities relevant for transmission checksum/segmentation offloading.
Flag |
ifconfig arg |
Semantic |
IFCAP_TXCSUM |
txcsum |
Perform transmit checksum offload for IPv4 packets as described in if_hwassist |
IFCAP_TXCSUM_IPV6 |
txcsum6 |
Perform transmit checksum offload for IPv4 packets as described in if_hwassist |
IFCAP_VLAN_HWCSUM |
vlanhwcsum |
Inherit IFCAP_TXCSUM on derived VLAN interfaces |
IFCAP_VXLAN_HWCSUM |
vxlanhwcsum |
|
IFCAP_TSO4 |
tso4 |
Perform TSO for TCP/IPv4 packets |
IFCAP_TSO6 |
tso6 |
Perform TSO for TCP/IPv4 packets |
IFCAP_VLAN_HWTSO |
vlanhwtso |
|
IFCAP_VXLAN_HWTSO |
vxlanhwtso |
|
IFCAP_TXTLS4 |
txtls |
|
IFCAP_TXTLS6 |
txtls |
|
If transmission checksum offloading (txcsum and/or txcsum6) is enabled, the driver sets flags in ifp->if_hwassist to indicate for which protocols or protocol combinations the interface can compute and insert the checksum. These are the relevant flags.
Flag |
Semantic |
Support |
CSUM_IP |
IPv4 header checksum will be computed |
|
CSUM_IP_UDP |
UDP checksum will be computed for UDP/IPv4 |
|
CSUM_IP_TCP |
TCP checksum will be computed for TCP/IPv4 |
|
CSUM_IP_SCTP |
SCTP checksum will be computed for SCTP/IPv4 |
|
CSUM_IP_TSO |
TSO can be performed for TCP/IPv4 |
|
CSUM_IP_ISCSI |
|
not in tree |
CSUM_INNER_IP6_UDP |
|
cxgbe, mlx5en, vxlan |
CSUM_INNER_IP6_TCP |
|
cxgbe, mlx5en, vxlan |
CSUM_INNER_IP6_TSO |
|
cxgbe, mlx5en, vxlan |
CSUM_IP6_UDP |
UDP checksum will be computed for UDP/IPv6 |
|
CSUM_IP6_TCP |
TCP checksum will be computed for TCP/IPv6 |
|
CSUM_IP6_SCTP |
SCTP checksum will be computed for SCTP/IPv6 |
|
CSUM_IP6_TSO |
TSO can be performed for TCP/IPv6 |
|
CSUM_IP6_ISCSI |
|
not in tree |
CSUM_INNER_IP |
|
cxgbe, mlx5en, vxlan |
CSUM_INNER_IP_UDP |
|
cxgbe, mlx5en, vxlan |
CSUM_INNER_IP_TCP |
|
cxgbe, mlx5en, vxlan |
CSUM_INNER_IP_TSO |
|
cxgbe, mlx5en, vxlan |
CSUM_ENCAP_VXLAN |
|
cxgbe, mlx5en, vxlan |
If the FreeBSD network stack sends a packet, it uses the same flags in mbuf->m_pkthdr->csum_flags to indicate which checksums need to be computed (by the interface). Please note that TCP or UDP checksum computation assumes that the pseudo header checksum has been computed in software and is stored in the TCP or UDP checksum field. The following figure shows the relation between the fields.
IP packet csum_flags
^
\
\ defines
User ----+ enables/ \ usable
\ disables \ flags
\ \
v \
Interface caps -------> hwassist
driver setsThe following three examples aim to give a better understanding of how csum_flags and if_hwassist are used.
Example: Checksum offloading with TCP over IPv4 using a physical interface
tcp_output() computes the pseudo IP header checksum, inserts that value in the TCP checksum field, and sets CSUM_IP_TCP in mbuf->m_pkthdr->csum_flags.
ip_output() first sets CSUM_IP in mbuf->m_pkthdr->csum_flags and then looks if the interface supports TCP checksum offloading and IPv4 header checksum offloading by checking whether the same flags are set in ifp->if_hwassist. If not, it computes and inserts the IP header and/or TCP checksum (it does so for TCP independently of ifp->if_hwassist in case of IP fragmentation).
Example: Checksum offloading using VLAN
For the network stack sets mbuf->m_pkthdr->csum_flags in the same way as without a VLAN. However, here the VLAN interface is the outgoing interface, and thus, ifp->if_hwassist from the VLAN interface is checked. The VLAN interface sets the same if_hwassist IPv4 or IPv6 flags as its parent interface if the parent interface has capability IFCAP_VLAN_HWCSUM enabled and the VLAN interface has capability IFCAP_TXCSUM or IFCAP_TXCSUM_IPV6 enabled, respectively.
Example: Checksum offloading using VxLAN
The VxLAN interface sets the if_hwassist IPv4 or IPv6 CSUM_* flags if the parent interface has the corresponding CSUM_INNER_* flags set and has capability IFCAP_VLAN_HWCSUM enabled, and the VxLAN interface has capability IFCAP_TXCSUM or IFCAP_TXCSUM_IPV6 enabled, respectively.
The following example shows how csum_flags are set for a TCP packet sent over a VxLAN interface whose parent interface's if_hwassist includes the flags CSUM_IP, CSUM_ENCAP_VXLAN, CSUM_INNER_IP, and CSUM_INNER_IP_TCP.
tcp_output()
| | TCP | Data ... |
| csum_flags = CSUM_IP_TCP
v
ip_output()
| | IPv4 | TCP | Data ... |
| csum_flags = CSUM_IP | CSUM_IP_TCP
v
if_output()
== ether_output()
| | Eth | IPv4 | TCP | Data ... |
| csum_flags = CSUM_IP | CSUM_IP_TCP
v
if_transmit()
== vxlan_transmit()
| | UDP | VxLAN | Eth | IPv4 | TCP | Data ... |
| csum_flags = CSUM_ENCAP_VXLAN | CSUM_INNER_IP | CSUM_INNER_IP_TCP
v
ip_output()
| | IPv4 | UDP | VxLAN | Eth | IPv4 | TCP | Data ... |
| csum_flags = CSUM_IP | CSUM_ENCAP_VXLAN | CSUM_INNER_IP | CSUM_INNER_IP_TCP
v
if_output()
== ether_output()
| | Eth | IPv4 | UDP | VxLAN | Eth | IPv4 | TCP | Data ... |
| csum_flags = CSUM_IP | CSUM_ENCAP_VXLAN | CSUM_INNER_IP | CSUM_INNER_IP_TCP
v
if_transmit()
by the driver
Interface inserts outer and inner IPv4 header checksum and inner TCP checksum
Example: TSO with IPv4
On an incoming segment with the SYN bit set, tcp_input() checks if the interface supports TSO. If that's the case, it uses the TCP control block tp to set the TF_TSO flag and TSO parameters.
tp->t_flags |= TF_TSO; tp->t_tsomax = cap.tsomax = ifp->if_hw_tsomax; tp->t_tsomaxsegcount = cap.tsomaxsegcount = ifp->if_hw_tsomaxsegcount; tp->t_tsomaxsegsize = cap.tsomaxsegsize = ifp->if_hw_tsomaxsegsize;
tcp_output() uses TSO if TF_TSO is set and if appropriate. If so, tcp_output() sets CSUM_IP_TSO and CSUM_IP6_TSO (always both) in mbuf->m_pkthdr.csum_flags and the MSS in mbuf->m_pkthdr.tso_segsz.
ip_output() refrains from doing IP fragmentation if CSUM_IP_TSO or CSUM_IP6_TSO is set in mbuf->m_pkthdr.csum_flags.
Note that the use of CSUM_IP6_TSO is incorrect/unnecessary. There was only one CSUM_TSO flag for both IP versions, which was later spit into CSUM_IP_TSO and CSUM_IP6_TSO. However, tcp_output() and ip_output() still use CSUM_TSO, which translates to (CSUM_IP_TSO | CSUM_IP6_TSO).
Receive
These are capabilities relevant for receive checksum or large receive offloading.
Flags |
ifconfig arg |
Semantic |
IFCAP_RXCSUM |
rxcsum |
Perform receive checksum offload for IPv4 packets and report result in m_pkthdr.csum_flags |
IFCAP_RXCSUM_IPV6 |
rxcsum6 |
Perform receive checksum offload for IPv6 packets and report result in m_pkthdr.csum_flags |
IFCAP_LRO |
lro |
Perform LRO (Hardware LRO?) |
If receive checksum offloading (rxcsum and/or rxcsum6) is enabled, the interface may validate the checksum, and the driver sets flags in mbuf->m_pkthdr->csum_flags to indicate if validation was performed and if the checksum is valid. These are the relevant flags.
Flag |
Semantic |
CSUM_L3_CALC |
IPv4 checksum validation was performed |
CSUM_L3_VALID |
IPv4 checksum is valid |
CSUM_L4_CALC |
SCTP/TCP/UDP checksum validation was performed |
CSUM_L4_VALID |
SCTP/TCP/UDP checksum is valid |
CSUM_L5_CALC |
|
CSUM_L5_VALID |
|
CSUM_INNER_L3_CALC |
|
CSUM_INNER_L3_VALID |
|
CSUM_INNER_L4_CALC |
|
CSUM_INNER_L4_VALID |
|
Problems with Current State
As mentioned in the section Introduction, the FreeBSD network stack was developed without offloading features in mind that are common today. Besides the conceptual change to a network stack that expects a specific set of offloading features, changes are necessary to address problems.
Checksum Offloading with Virtual Interfaces
Occasionally bugs were reported where packets were sent without a valid checksum. In other cases, checksums were computed in software where the outgoing interface supports checksum offloading, which is a common problem with virtual interfaces like epair.
epair is often used to connect a Jail to the FreeBSD host.
Host
+---------------+ Jail
| | +---------+
| bridge0 | | |
| | | | | |
| | +-----*--------------------* |
| | | epair0a epair0b | |
+-----*---------+ +---------+
igb0If the network stack in the Jail sends a packet, it checks ifp->if_hwassist of epair0b, which is probably different from igb0's hwassist.
In epair's hwassist, no flags were set. Thus, the network stack in the Jail always computed the checksums, which was most often unnecessary. Now, epair's hwassist has common checksum offloading flags set (IPv4 header, TCP, UDP). However, cases exist where the outgoing interface does not support all common checksum offloading features and then sends out a packet from the Jail with an incorrect checksum.
TSO
The same problem with virtual interfaces exists for TSO. However, using TSO independently of the outgoing interface's hwassist has the potential to improve efficiency in general. Segmentation could be done at a lower layer (driver or right before the driver) in case the interface hardware does not support it.
csum_flags bits exhausted
In total, 32 CSUM_* flags are defined in sys/sys/mbuf.h, which are set as bits in ifp->if_hwassist and mbuf->m_pkthdr.csum_flags.
While ifp->if_hwassist is 64 bits long, mbuf->m_pkthdr.csum_flags is only 32 bits long and, thus, has no bit available for a new CSUM_* flag (e.g., to indicate TCP/UDP pseudo header offloading).
Other than ifp->if_hwassist (for one interface), a number of CSUM_* flag combinations make no sense in mbuf->m_pkthdr.csum_flags (for one packet; e.g., a combination of CSUM_IP_TCP and CSUM_IP_UDP). This fact allows storing the same information in mbuf->m_pkthdr.csum_flags with less than 32 bits.
A perfect implementation
If we design a network stack from scratch today, we would default to all NICs having offloading. And those historical drivers that can't would need to call into software checksum routines to meet the criteria. We would develop a stack with modern high performance NICs in mind, leaving crutches and hacks to the slow legacy code.
Vision
In a perfect implementation:
In struct ifnet there will be no IFF_*CSUM IFCAP_*CSUM flags, and no if_hwassist field.
- In struct mbuf we would have a single bit, that describes the packet checksum state:
- If set, the packet checksum is known to be correct, e.g. a packet was received by a driver and forwarded to the stack. It is driver's business if checksum was checked in hardware or in software.
- If not set, the packet checksum is not yet calculated, but the packet is known to be good. This applies to packets that were generated locally. If such a packet to be transmitted out a wire, the driver shall take care of calculating the checksum.
- The main network stack, ip_output()/ip_input()/ip_forward() and the respective IPv6 siblings do not do any checksum checks or calculations, neither check any flags on an outgoing interface.
- The so called "network stacklets", e.g. alternative packet forwarding engines, e.g. bridge(4), netgraph(4), etc shall follow the same logic as the main network stack.
- The software checksum routines in_cksum() and others remain as a kernel library for the sake of legacy drivers.
Issues
During the planning phase for implementation, some problems with the initial vision became apparent.
Without IFCAP_*CSUM the user cannot control what the interface shall do in hardware. However, in the past this was sometimes used, for example, if a bug in the hardware was suspected.
One bit seems to be too little. If an interface receives a packet with an incorrect TCP checksum, it would set this one bit to 0, and tcp_input() would drop that packet. If this packet is not for the local host and will be forwarded instead (due to a bridge or IP routing), the outgoing interface sees that this one bit is 0 and will insert the correct checksum. Two bits per checksum (IPv4 header, SCTP/TCP/UDP, and the inner siblings) might be enough. For example, for TCP this could be the semantics of each value.
00 Checksum still needs to be computed (rx (on a virtual interface) and tx).
01 Checksum still needs to be computed, but pseudo header checksum is present (rx (on a virtual interface) and tx).
10 Checksum present but incorrect (rx only).
11 Checksum present and correct (rx or tx).
- In case of IP fragmentation, the main network stack still needs to compute the transport protocol checksum because the interface would have to insert the checksum in the first fragment while it might not have the remaining fragments.
New Concept
The idea for the new concept can be summarized as follows.
Define a set of offloading features (i.e., a set of CSUM_* flags) every interface (driver) must provide. Thus, the main network stack can expect their support and does not have to check it.
The interface driver still sets the enabled offloading features in ifp->if_hwassist.
When a network interface goes up, sys/net/if.c checks if the driver has all the expected offloading features enabled. If not, it stores the current ifp->if_transmit function pointer in a new ifp->if_transmit_drv field and sets ifp->transmit to a new function, if_offload_transmit().
if_offload_transmit() performs all the expected offloading features the interface (driver) won't do. This means it might insert checksums or do TSO in software. After that, it calls the if_transmit function the driver originally set (ifp->if_transmit_drv).
Packet Information
How efficiently if_offload_transmit() can be implemented depends on the provided information about the packet. The main information is what needs to be done. Depending on what needs to be done, there is additional information that would be helpful.
Information |
Why |
How to determine |
IPv4 Header Checksum |
||
Start of IP header |
To find the header. |
|
Length of IP header |
To know how many bytes to sum up. |
IPv4: Header length field; IPv6: Fixed header length + length of extension headers. |
TCP/UDP Pseudo Header Checksum |
||
Start of IP header |
To find the header. |
|
IP version |
To correctly read the parts of the IP header required for the checksum. |
IP version field |
Protocol (TCP or UDP) |
Checksum includes protocol. |
Without UDP encap.: protocol/next header in the IP header |
Length of the TCP/UDP packet |
Checksum includes the length. |
Without UDP encap.: m->m_pkthdr.len - Start of IP header - IP header length |
Start of TCP/UDP checksum field |
To know where to insert the checksum. |
Without UDP encap.: Start of IP header + IP header length + TCP/UDP's checksum field offset |
SCTP/TCP/UDP Checksum |
||
Start of SCTP/TCP/UDP header |
To find the header. |
|
Protocol (SCTP, TCP, or UDP) |
To know how to compute the checksum. |
|
Start of SCTP/TCP/UDP checksum field |
To know where to insert the checksum. |
Check the protocol and determine the offset to the checksum field. |
TSO |
||
Start of IP header |
To update in each segment ID (IPv4), length, header checksum (IPv4, if present), and pseudo header checksum (if present). |
|
IP version |
To correctly update the values. |
IP version field |
IPv4 header checksum present |
To know if an update of the header checksum is required. |
|
TCP Pseudo header checksum is present |
To know if an update of the pseudo header checksum is required. |
|
Start of TCP header |
To update in each segment: flags and sequence number, and insert the TCP checksum if required. |
Without UDP encap.: Start of IP header + IP header length |
Start of TCP checksum field |
To insert the updated pseudo header checksum. |
Start of TCP header + offset to checksum field |
Start of TCP payload |
To know where the data starts (everything before will be copied in each segment). |
Start of TCP header + TCP data offset field value |
MSS |
To know the max. bytes of data to insert in each segment. |
mbuf->m_pkthdr.tso_segsz |
Provided Packet Information and Encoding
To store all the information in the mbuf, a list with elements for each used offload feature would be necessary (e.g., one element for IPv4 header checksum with start of IP header and header length, and one element for SCTP/TCP/UDP checksum with start of the header, protocol, and start of checksum field). However, instead of providing all information, it seems a practical solution to provide only information that is easy to provide and store in the mbuf, and let if_offload_transmit() find out the rest.
The current format of mbuf->m_pkthdr.csum_flags already contains this information:
- What to do
- IP version
- Transport protocol
- Encapsulation protocol (currently only VxLAN)
- Inner IP version
- Inner transport protocol (without SCTP)
Another csum field is mbuf->m_pkthdr.csum_data, which contains the offset from the start of the transport protocol header to its checksum field.
Proposal
For the start, leave the format of mbuf->m_pkthdr.csum_flags as is. The essential information missing is where the header starts.
The field mbuf->m_pkthdr.csum_data is 32 bits long. Since 16 bits are enough to store the offset, the other 16 bits could be used to store the offset from the start of the mbuf to the IP or transport header. To which header it points could depend on what to do.
- Only transport header checksum: offset to the transport header.
- Otherwise: offset to the IP header.
For example, if the IPv4 header and TCP checksum need to be inserted, it would be the offset to the IP header. This means if_offload_transmit() needs to parse the IP header to find the start of the transport header (by skipping the UDP header in case of UDP encapsulation).
Implementation Plan
The plan is to iteratively implement the new concept in the following steps.
Determine how to provide and encode the required packet information. For each checksum that needs to be computed and for TSO, the information described in the section Packet Information should be made available in some form. This could be a minimal form as described in the section Provided Packet Information and Encoding
Create a new sys/net/if_offload.c and .h with the if_offload_transmit() function and subroutines that compute and insert checksums if the interface (driver) won't do it. Start with a moderate set of expected offloading features (CSUM_IP, CSUM_IP(6)_SCTP, CSUM_IP(6)_TCP, CSUM_IP(6)_UDP). If an interface goes up, set ifp->if_transmit in sys/net/if.c to if_offload_transmit() if the driver has not enabled all expected offload features.
Extend sys/net/if_offload.c by a function that computes and inserts the TCP/UDP pseudo header checksum. Use (and rename) the CSUM_IP(6)_ISCSI flags and add them to the set of expected offload features.
Extend sys/net/if_offload.c by a function that does TSO in software and add CSUM_IP(6)_TSO to the set of expected offload features.
Change the format of mbuf->m_pkthdr->csum_flags to save space(?)