High-performance TCP/IP networking for bhyve VMs using netmap passthrough

Original project proposal

Netmap passhthrough (ptnetmap) has been recently introduced on Linux/FreeBSD platforms, where QEMU-KVM/bhyve hypervisors allow VMs to exchange over 20 Mpps through VALE switches. Unfortunately, the original ptnetmap implementation was not able to exchange packets with the guest TCP/IP stack, it only supported guest applications running directly over netmap. Moreover, ptnetmap was not able to support multi-ring netmap ports.

I have recently developed a prototype of ptnet, a new multi-ring paravirtualized device for Linux and QEMU/KVM that builds on ptnetmap to allow VMs to exchange TCP traffic at 20 Gbps, while still offering the same ptnetmap performance to native netmap applications.

In this project I would like to implement ptnet for FreeBSD and bhyve, which do not currently allow TCP/IP traffic with such high performance. Taking the above prototype as a reference, the following work is required:

An overview of netmap and ptnetmap

Netmap is a framework for high performance network I/O. It exposes an hardware-independent API which allows userspace application to directly interact with NIC hardware rings, in order to receive and transmit Ethernet frames. Rings are always accessed in the context of system calls and NIC interrups are used to notify applications about NIC processing completion. The performance boost of netmap w.r.t. traditional socket API primarily comes from: (i) batching, since it is possible to send/receive hundreds of packets with a single system call, (ii) preallocation of packet buffers and memory mapping of those in the application address space.

Several netmap extension have been developed to support virtualization. Netmap support for various paravirtualized drivers - e.g. virtio-net, Xen netfront/netback - allows netmap applications to run in the guest over fast paravirtualized I/O devices.

The Virtual Ethernet (VALE) software switch, which supports scalable high performance local communication (over 20 Mpps between two switch ports), can then be used to connect together multiple VMs.

However, in a typical scenario with two communicating netmap applications running in different VMs (on the same host) connected through a VALE switch, the journey of a packet is still quite convoluted. As a matter of facts, while netmap is fast on both the host (the VALE switch) and the guest (interaction between application and the emulated device), each packet still needs to be processed from the hypervisor, which needs to emulate the device model used in the guest (e.g. e1000, virtio-net). The emulation involves device-specific overhead - queue processing, format conversions, packet copies, address translations, etc. As a consequence, the maximum packet rate between the two VMs is often limited by 2-5 Mpps.

To overcome these limitations, ptnetmap has been introduced as a passthrough technique to completely avoid hypervisor processing in the packet datapath, unblocking the full potential of netmap also for virtual machine environments. With ptnetmap, a netmap port on the host can be exposed to the guest in a protected way, so that netmap applications in the guest can directly access the rings and packet buffers of the host port, avoiding all the extra overhead involved in the emulation of network devices. System calls issued by guest applications on ptnetmap ports are served by kernel threads (one per ring) running in the netmap host.

Similarly to VirtIO paravirtualization, synchronization between guest netmap (driver) and host netmap (kernel threads) happens through a shared memory area called Communication Status Block (CSB), which is used to store producer-consumer state and notification suppression flags.

Two notification mechanisms needs to be supported by the hypervisor to allow guest and host netmap to wake up each other. On QEMU/bhyve, notifications from guest to host are implemented with accesses to I/O registers which cause a trap in the hypervisor. Notifications in the other direction are implemented using KVM/bhyve interrupt injection mechanisms. MSI-X interrupts are used since they have less overhead than traditional PCI interrupts.

Since I/O register accesses and interrupts are very expensive in the common case of hardware assisted virtualization, they are suppressed when not needed, i.e. each time the host (or the guest) is actively polling the CSB to check for more work. From an high-level perspective, the system tries to dynamically switch between polling operation under high load, and interrupt-based operation under lower loads.

The ptnet paravirtualized device

The original ptnetmap implementation required ptnetmap-enabled virtio-net/e1000 drivers. Only the notification functionalities of those devices were reused, while the datapath (e.g. e1000 rings or virtio-net Virtual Queues) was completely bypassed.

The ptnet device has been introduced as a cleaner approach to ptnetmap that also adds the ability to interact with the standard TCP/IP network stack and supports multi-ring netmap ports. The introduction of a new device model does not limit the adoption of this solution, since ptnet drivers are distributed together with netmap, and hypervisor modifications are needed in any case.

The ptnet device belongs to the classes of paravirtualized devices, like virtio-net. Unlike virtio-net, however, ptnet does not define an interface to exchange packets (datapath), but the existing netmap API is used instead. However, a CSB - cleaned up and extended to support an arbitrary number of rings - is still used for producer-consumer synchronization and notification suppression.

A number of device registers are used for configuration (number of rings and slots, device MAC address, supported features, ...) while "kick" registers are used for guest-to-host notifications. The ptnetmap kthread infrastructure, moreover, has been already extended to suppor an arbitrary number of rings, where currently each ring is served by a different kernel thread.

Deliverables

D1 (due by week 3)

Implement a ptnet driver for FreeBSD guests, which only supports native netmap applications. This new driver can be tested using Linux and QEMU-KVM as hypervisor, which already supports ptnetmap and emulates the ptnet device model. Since the datapath will be equivalent, we expect to have the same performance of the original ptnetmap (over 20 Mpps for VALE ports, 14.88 Mpps for hardware 10Gbit ports).

D2 (due by mid-term)

Extend the ptnet FreeBSD driver to export a regular network interface to the FreeBSD kernel. In terms of latency, we expect a performance similar to the the ptnet linux driver.

D3 (due by week 9)

Extend the ptnet FreeBSD driver to support TCP Segmentation Offloading (TSO) and Checksum offloading, by means of the virtio-net header, similarly to what is done in the linux driver. After this step we expect to have a TCP performance similar to the Linux one.

D4 (due by the end of project)

Implement the emulation of the ptnet device model in bhyve, starting from a bhyve version supporting netmap and ptnetmap, which is already available. At this point we expect FreeBSD guests over bhyve to see similar TCP/IP throughput and latency performance as Linux guests over QEMU-KVM (about 20 Gbps for TCP bulk traffic and about 40 thousand HTTP-like transactions per second between two guests running through a VALE switch).

Milestones

Start date: 2016/05/23

Estimated end dates: 2016/08/23

Timetable:

Final submission

Final code of my project is available at the following SVN repository:

which refers to FreeBSD head (11.0-CURRENT).

The complete list of my SVN commits can be obtained with the following command on the SVN repository

    $ svn log -r 302612:HEAD

However, please be aware that additional work has been done after the end of the GSoC project, so the code in the SVN repository is outdated. Part of these contributions (the modifications to netmap) have already been upstreamed to FreeBSD. The modifications to bhyve and VMM.ko are yet to be upstreamed, and are available at https://github.com/vmaffione/freebsd/tree/ptnet-10.3.

Instructions to extract patches of the subsystems I modified are reported below.

All the modifications I did to netmap (see below) have also been merged in the netmap GIT repository, and can be also found at https://github.com/luigirizzo/netmap .

My code modifications belong to two different subsystems:

   git diff --author="Vincenzo Maffione" 09936864fa5b67b82ef4a9907819b7018e9a38f2 master

   $ git diff b62280e683e2d7abd347a4549c51e086b1b8911a ptnet-10.3

A modified version of QEMU that supports ptnet (not developed in the context of this GSOC project) is available here: https://github.com/vmaffione/qemu/tree/ptnet

Short report of the work done during the project

As a first step, I implemented the ptnet driver for FreeBSD, using as a reference the driver that I had already developed on linux (https://github.com/luigirizzo/netmap/blob/master/LINUX/ptnet/ptnet.c).

I implemented the routines required to attach/detach the ptnet PCI device (ptnet_probe, ptnet_attach, ptnet_detach, ...). I developed the routines necessary to open and use a ptnet NIC in netmap mode (ptnet_nm_register, ptnet_nm_txsync, ptnet_nm_rxsync, ptnet_nm_config, ...), and to react to MSI-X interrupts. I managed to share some routines (ptnet_nm_krings_create, ptnet_nm_krings_delete, ptnet_nm_dtor) between the Linux and FreeBSD drivers, to improve code reuse.

To allow the ptnet driver to be used by the network stack (and so by socket applications), the FreeBSD network driver callbacks have been implemented (ptnet_init, ptnet_ioctl, ptnet_transmit, ptnet_qflush, ...). I also implemented a ptnet_poll() routine to support FreeBSD polling (DEVICE_POLLING).

Using as a reference the FreeBSD virtio-net driver (if_vtnet.c), I added support for hw offloadings (TSO and TCP/UDP checksum offloading), by means of the virtio-net header, which is prepended to each Ethernet frame. The virtio-net header can be enabled or disabled by configuring a register of the ptnet device. A boolean sysctl belonging to the netmap module (named ptnet_vnet_hdr) can be tuned for this purpose.

As a second step, I extended bhyve to emulate the ptnet device, using as a reference the emulation that I had already developed for QEMU: (https://github.com/vmaffione/qemu/blob/ptnet/hw/net/ptnetmap-netif.c).

As a preliminary step, I adapted previous work from by myself, my mentor Luigi Rizzo and Stefano Garzarella, that we did in the past to (i) add netmap and ptnetmap support for bhyve, and (ii) extend libvmm and the vmm module to allow netmap (kernel-space code) to inject interrupts in the guest and intercept guest I/O register accesses.

The emulation for ptnet is available here

which implements (i) the callbacks to manage accesses to the register of the ptnet device (PTCTL, CSBBAH, CSBBAL, VNET_HDR_LEN, ...); (ii) the setup necessary to let netmap inject MSI-X interrupts and intercept writes to the ptnet kick registers.

Finally, I reorganized and fixed the userspace netmap backend for bhyve, completing the work we did in the past. This new backend is intended to replace the old one that is currently in FreeBSD head. The code in pci_virtio_net.c has been refactored to be independent on the backed used, and this is particularly useful w.r.t the management of the virtio-net header. The backends (TAP, netmap, ptnetmap) are now implemented in a separate file:

Tests and examples

On my test platform (an 8-core i7 machine with 8 GB of memory), I run two (identical) bhyve VMs, and each VM is given a ptnet NIC. The backend of each ptnet NIC is a (different) port of the same VALE switch, so that the two VMs are on the same (virtual) LAN. With this testbed I could test the unidirectional throughput of netmap and socket network applications, with one VM being the sender and the other one being the receiver. To test netmap applications I used pkt-gen, and I measured about 25 Mpps at minimum packet size (60 bytes). To test socket applications, I used the popular netperf tool, and in particular the TCP_STREAM test, which reports about 20 Gbps.

Example on how to use various NICs (including ptnet) with bhyve VMs:

# bhyve -c 2 -m 1G -A -H -P \
        -s 31,lpc -l com1,stdio \
        -s 0:0,hostbridge \
        -s 1:0,virtio-net,tap1 \          # virtio-net NIC + TAP backend
        -s 2:0,virtio-net,vale0:2 \       # virtio-net NIC + netmap userspace backend
        -s 3:0,ahci-hd,freebsdimg.raw \
        -s 4:0,ptnet,vale1:1 \            # ptnet NIC + ptnetmap kernelspace backend
        -s 5:0,ptnetmap-memdev \          # ptnetmap memory device, needed by the ptnet NIC
        vm1

where the bhyve VM named "vm1" has three virtual NICs:

SummerOfCode2016/PtnetDriverAndDeviceModel (last edited 2016-11-23T18:14:17+0000 by VincenzoMaffione)