The motivation for removing the giant switch was mostly readability, even though eventually it might help to optimize some common checks (e.g. by providing different dispatch tables for IPv4 vs IPv6 and so on).

The replacement was done as follows:

The resulting code has a better readability: the huge switch is just a single line calling the dispatching table function; the main loops around rules parsing and microinstructions code are cleary visible.

Before and after this changes I've done some performance measurements in order to evaluate the impact of the changes done. Since the main architecture of the code was unmodified, I did not expect big changes, and the only unknown was the impact of the indirect call with respect to the direct jump using a switch(), and possible optimizations that might have been lost because the code is not inline anymore.

Instead of heavily instrumenting the code, i ran the tests doing a set of pings with different kernel sources (before and after the change), and with different ipfw configurations, and plotting the distribution of ping times. The resolution of the measurement is 1us, and the base level for the ping times is around 70us (at least for the machine and the 100Mbit/s switched network I the was working with).

To see the effect of the changes (which might be in the nanoseconds range), in most of the tests i forced the loops to be run many times, either with explicit ipfw configurations (e.g. 100 'count' instructions, possibly with multiple microinstructions for each rule), or wrapping the call to ipfw_chk() in an explicit loop which is run 100 times. This gives slightly better resolution without requiring heavier modifications to the code to read the TSC and report the values to userland. Experiments with the TSC will be done later.

In detail, the following test cases were considered:

These have been repeated with three versions of the code:

The tests were run on RELENG_7, HEAD, and linux 2.6.28 using a 500 pings (ping -c 500 -i 0.05) from a computer connected by a 100Mbit full duplex switch. The distribution of the response times was then plotted and I took as a reference the values at 20% of the distribution.

Both client and machine under test were unloaded, so the distribution curves were mostly flat up to 80-90% of the samples.

Results are in the following table:

Values reported on "()" are related to an amd64 system. (*) the linux tested system runs 2.6.28-11-generic (i686) linux kernel on Ubuntu. the values with the "-" are further tests done in the same conditions.

Looking at the curves (http://info.iet.unipi.it/~marta/ipfw/report2_plot/) one can see that the introduction of the dispatch table does not affect the execution time of the code. Test cases with few rules are almost the same, while slight differences arise while evaluate more rules with complex microinstructions. (See curves HEAD-B-10 and HEAD-B100 for each case)

Also from the above and other measurements (done against a linux system on the same pc) we can derive the time spent in each of the phases of ipfw_chk() processing, namely:

- entering ipfw_chk() and setting up variables for

- processing the rule header and action 50ns - processing a simple microinstruction. 12ns

As expected, results does not show significant differences in terms of performances between the switch and the dispatching table versions, but the resulting code is definitely more readable respect the old one.

MartaCarbone/profile (last edited 2009-09-10T20:28:52+0000 by MartaCarbone)