High Performance P4 Software Switch

Project description

Software switches are a key component of any cloud infrastructure, and even proposal focused on hardware targets, such as P4 (http://p4.org), cannot ignore software implementations and their performance.

Historically, software packet processors have been limited in performance by network I/O, but this is no more true with high speed frameworks such as netmap and DPDK.

In this project I would like to implement a modified version of the reference p4 switch (https://github.com/p4lang/behavioral-model) on FreeBSD, which uses netmap for faster packet I/O: the goal would be reaching the 1 Mpps order-of-magnitude speed (current reference implementation is limited to 150 Kpps for a simple l2 switch with 2 hosts).

This would enable P4 to be used in fast networking experimens as well as real environments.

Approach to solving the problem

The required steps would include:



Port the existing reference implementation code to FreeBSD. Develop a simplified P4 switch target which replaces pcap with netmap for packet I/O.


Now that the I/O is faster, find the new bottlenecks and address them. I will probably need to modify the target to use pre-allocated buffers for packets and process packets in multiple threads in a pipeline, and try to keep synchronization costs low.


Extend the version developed to support more advanced features used by common p4 programs (packet replication, learning, multicast groups), and evaluate the performance vs the reference implementation.


Develop an extended demo which implement the openflow protocol on top of the developed target, and evaluate the performance vs the reference implementation and other openflow software switches.


This is the expected timeline with the milestones, with the actual work done added by linking to relevant soc-status mailing list posts.

End of GSoC: State of the Project

I successfully implemented D1 and D2, improving the performance of the target by 2x-4x (depending on the p4 program).

Deliverable D3 was about improving the performance of advanced features like multicast and learning, but after many tries I didn't obtain significant speedup. The existing implementation is of course compatible with the improvements introduced by D1 and D2.

In order to allow the simple_switch target to use the newly developed lockless queue, I had to adapt it to more advanced behaviors like rate limiting and multiple producers (my original lockless queue was single producer, single consumer), so I used the time allocated for D3 to do that.

I couldn't set up the openflow demo for a performance comparison because i ran out of time (I lost almost two week because this summer I also had my thesis dissertation). I still have more basic performance comparisons that show the improvements.

The project as of now is structured this way:

How to test the improvements

To test the fast_switch target performance (which includes all the festures):

The Code

SummerOfCode2016/HighPerformanceP4SoftwareSwitch (last edited 2016-08-22 11:08:52 by YuriIozzelli)