MartaCarbone/profile - FreeBSD Wiki

The motivation for removing the giant switch was mostly readability, even though eventually it might help to optimize some common checks (e.g. by providing different dispatch tables for IPv4 vs IPv6 and so on).

The replacement was done as follows:

a first pass i removed all the 'goto' (and corresponding labels) from the body of ipfw_chk, adjusting the code and variables to support clean exit from the nested loops;
at this point, each 'case' of the switch was encapsulated in a function;
the body of the main switch was replaced by a call of the functions through a dispatch table. At the moment, there is only one table irrespective of the type of packet (ipv4, ipv6, layer2);
further cleanup was replacing the pullup macro with a function;
all changes were extensively commented.

The resulting code has a better readability: the huge switch is just a single line calling the dispatching table function; the main loops around rules parsing and microinstructions code are cleary visible.

Before and after this changes I've done some performance measurements in order to evaluate the impact of the changes done. Since the main architecture of the code was unmodified, I did not expect big changes, and the only unknown was the impact of the indirect call with respect to the direct jump using a switch(), and possible optimizations that might have been lost because the code is not inline anymore.

Instead of heavily instrumenting the code, i ran the tests doing a set of pings with different kernel sources (before and after the change), and with different ipfw configurations, and plotting the distribution of ping times. The resolution of the measurement is 1us, and the base level for the ping times is around 70us (at least for the machine and the 100Mbit/s switched network I the was working with).

To see the effect of the changes (which might be in the nanoseconds range), in most of the tests i forced the loops to be run many times, either with explicit ipfw configurations (e.g. 100 'count' instructions, possibly with multiple microinstructions for each rule), or wrapping the call to ipfw_chk() in an explicit loop which is run 100 times. This gives slightly better resolution without requiring heavier modifications to the code to read the TSC and report the values to userland. Experiments with the TSC will be done later.

In detail, the following test cases were considered:

A. 1, 10 or 100 simple rules with 1 microinstruction each, (count proto icmp);
B. 10 or 100 rules with 5 microinstruction each, (count proto icmp not proto tcp not proto udp not proto tcp not proto udp);

These have been repeated with three versions of the code:

switch the giant switch that is in HEAD
dispatch the dispatch table
dispatch100 the dispatch table and a wrapper around ipfw_chk()
- that calls the function 100 times on each packet;

The tests were run on RELENG_7, HEAD, and linux 2.6.28 using a 500 pings (ping -c 500 -i 0.05) from a computer connected by a 100Mbit full duplex switch. The distribution of the response times was then plotted and I took as a reference the values at 20% of the distribution.

Both client and machine under test were unloaded, so the distribution curves were mostly flat up to 80-90% of the samples.

Results are in the following table:

test case switch dispatch dispatch100 HEAD-A-1 79 (79) 79 (79) 98 - 98 HEAD-A-10 80 (80) 80 (80) 200 - 200 HEAD-A-100 88 (86) 89 (87) 1040 - 1050 HEAD-B-1 HEAD-B-10 81 (80) 81 (81) 316 - 315 HEAD-B-100 95 (94) 100 (97) 2203 - 2203 RELENG_7-A-1 74 RELENG_7-A-10 75 RELENG_7-A-100 82 RELENG_7-B-1 RELENG_7-B-10 76 RELENG_7-B-100 90 linux(*) linux-A-1 75 75 111 - 107 linux-A-10 76 76 205 - 207 linux-A-100 85 85 1038 - 1041 linux-B-1 75 75 113 - 106 linux-B-10 78 77 323 - 315 linux-B-100 99 96 2170 - 2170

Values reported on "()" are related to an amd64 system. (*) the linux tested system runs 2.6.28-11-generic (i686) linux kernel on Ubuntu. the values with the "-" are further tests done in the same conditions.

Looking at the curves (http://info.iet.unipi.it/~marta/ipfw/report2_plot/) one can see that the introduction of the dispatch table does not affect the execution time of the code. Test cases with few rules are almost the same, while slight differences arise while evaluate more rules with complex microinstructions. (See curves HEAD-B-10 and HEAD-B100 for each case)

Also from the above and other measurements (done against a linux system on the same pc) we can derive the time spent in each of the phases of ipfw_chk() processing, namely:

- entering ipfw_chk() and setting up variables for

the processing of instructions (1 RWlock); 100ns

- processing the rule header and action 50ns - processing a simple microinstruction. 12ns

As expected, results does not show significant differences in terms of performances between the switch and the dispatching table versions, but the resulting code is definitely more readable respect the old one.