IPFW NG (WIP)

This is a page for work-in-progress ipfw modernization project, IPFW NetGraph (or Next Generation if you insist, but that words are worn out). This page is more developer-oriented, but please keep in mind that CLI interface must be user-friendly (being more user-friendly at that level is oen of the goals).

Overview

Motivation

Our ipfw(4) is good, but outdated a lot. It's a crazy mix of shiny blocks, brilliant ideas and ancient squeaky farts. It's opcode-based structure, amongst some other cool things, has very great potential - it theory - which often stays still unimplemented in practice for many years. And everyone who had to manage rulesets of thousands rules knows it is manageable, but requires effort. In fact, in some cases even half-a-hundred-rules setup for doing something non-trivial and complex could be very hard to grok - imagine e.g. a NAT for several ext channels with "hairpinning" port redirects (from LAN to gw's ext addr so then again to LAN).

So this drives the motivation to do ipfw's... rewrite? No. Rewriting anything from scratch is almost always a very bad idea and loss of userbase accustomed to previous syntax etc. and angry due to new bugs. So this will be rework, preserving as many backwards compatibility for users (POLA) as possible. Given that ipfw already has many good ideas in it's internals, this is doable.

The first thing that simplifies ruleset management and was demanded by users, was ability to support multiple rulesets. That is, a thing like Cisco ACLs - one can have a separate ruleset for every interface/direction (in or out). Something slightly more good is a thing like pf anchors or iptables "chains" - when one can call a chain like a subroutine and then return from it. I've already implemented ipfw call/ipfw return actions (available in 8.3R and 9.0R), but that was just a hack very quick to implement, a stub for those who need at least something to ease maintainability immediately. A more complete consistent solution must be created.

The second big thing in ipfw which is inadequate to current needs is dynamic rules. But they still have to be rewritten just to support multiple rulesets, as it will be shown below.

To be generic and Right Thing(tm), this is a HUGE work, because we need to implement the following:

Quick unsystematic summary

  ipfw chain mychain create
  ipfw chain common create default-policy allow
  ipfw chain lanusers create
  ipfw chain mychain add 200 ...
  ipfw chain mychain delete 300 ...
  ipfw chain mychain destroy
  ipfw add 100 call common ip from any to any   # call from main ruleset
  ipfw ng_bpf mypktbody create
  ipfw ng_bpf mypktbody config proto udp and udp[42:4]=0x01020304
  ipfw add 200 deny ngmatch mypktbody
  ipfw bind mychain to interface em0 direction out
  ipfw bind lanusers to interface vlan100
  ipfw pfil-order move ipfilter bottom

Hope that's enough user-intuitive. And netgraph-based design here is very straightforward: for examples above, you just create netgraph nodes named "ipfwmychain", "ipfwcommon", etc., then do send NGM_IPFW_ADD message to the node with name of appropriate "chain" keyword, or main ng_ipfw ("ipfw:") if none.

Why Netgrpah?

Netgraph here is not a stack, but a clean alternative to hack raw_ip.c for new setsockopt()'s. Currently, ipfw is not very modular - to implement a new command, you need to introduce new socket option and put into a file totally unrelated to ipfw. And support part's of ipfw code there - just to check if module is loaded, for example. Or invent a home-grown protocol to multiplex all new commands into one optname. And try to support previous versions (it was hard to keep compatible interfaces when ABI changed in 8.2)... all of that sounds not hard and not important, but still ipfw's control is nowadays a total mess (remember at least ipfw nat broken buffer? it was broken for several major releases...).

And in Netgraph all that problems are already solved in a clean way (netgraph messages do provide clean versioning, etc.). As a bonus, advanced user is able to manually assign node (ruleset lives in node) in complex configuration to some non-typical point (usually the /sbin/ipfw will do). In fact, Netgraph main use here is to provide something like Linux' Netlink alternative for us.

NB: For typical uses all Netgraph complexity must be hidden from user behind the habitual /sbin/ipfw. This must follow the mpd's model - you may know something about Netgraph, but you don't have to (as a consequence, mpd is widely used, but Netgraph itself, the Netgraph-scary-to-user, is not). The system, however, should be transparent to advanced user knowing what he is doing.

What is the problem with dynamic rules?

The main reason is that it is just a pointer to static rule:

Then, if a packet entered via iface X, and there is a state (dynamic rule) there, then, whatever you write in rules for Y - the first keep-state will accept packet, no matter what rule, it will be accepted.

You can try to expoit the fact that in ipfw, dynamic rule just jumps to parent rule's action part, and, if that was "skipto", continues on a static ruleset (where it could be checked, which interface).

But this becomes unmaintainable mess even on 3-4 interfaces (it's like assembly language programming), and if you have hundreds of ifaces, you may encounter there is not enough numbers in 65536 to handle all X*Y combinations (not all ifaces need be stateful) even if config is somehow generated automatically. There is at least one company in real life which really comes up to such situation.

And now... Recall you have several rulesets, one per interface (X and Y), and have just a usual ipfw2's dynrules with a skipto. A skipto somewhere in ruleset for X, and packet in opposite direction (from B to A) meets a check-state in another ruleset for Y. And then suddenly jumps to... first ruleset for X. Oooops.

That's why current dynamic rules is absolutely not suited for multiple rulesets and must be rewritten. The other reason is to simplify other things, six rules in the example above just for one reply-to are too complex. Want a cleaner solution.

What do dynamic rules need?

Really, ipfw and many other components of network subsytem suffer from the fact each component is forced to do it's own connection state tracking, e.g. NAT, dynamic rules, Netflow, all are separate. This can be good sometimes, but:

We need unified state tracking in our firewall which will do all needed operations, it will be like Linux' conntrack or pf's states, in that it must:

What else?

Modules

We need loadable modules able to implement loadable opcodes. For example, leave first 128 opcodes for built-in functions (present nowadays) and other 128 for dynamically registered modules. Currently there is only one module, ipfw_nat, and it is implemented as a hack. DIFFUSE is another example, it tried better but still not in the tree. In fact, having a framework for modules will encourage people to write them, DIFFUSE may have been done and committed much earlier if we had a framework, who knows...

Another application of modules is utilizing hooks in dynamic rules. E.g. a helper for tracking FTP to allow data connections.

Parser in /sbin/ipfw

Totally a mess now, but we can't rewrite it to lex/yacc because all historical quirks of syntax do not fit into clean grammar. Given that modules are wanted, the idea should be borrowed from GEOM - a userland module for each kernel module, consistent keywords in the beginning of command line with passing rest of it to module, etc.

More human-friendly and morer readable? Let's introduce a directory /etc/ipfw where files, if present, will allow to control behaviour. E.g. we can have /etc/ipfw/tables listing pairs number and name, just like /etc/protocols do (idea stolen from route tables naming in file in iproute2). Benefits:

Example:

# cat /etc/ipfw/tables
1       badguys
2       client-2000421-3000454
# ipfw list
00001 deny ip from table(badguys) to any
00002 nat ip from table(client-2000421-3000454) to any out via em0
00003 pipe 1 ip from table(3) to any in

Here two tables are named and will be recognized in commands like ipfw add deny ip from table(badguys) to any, and table 3 is not named and shown as is, like in previous versions (POLA, don't use if you don't want, etc.).

Another thing to be present in the parser is a macro/aliases file to be expanded before actual command processing. The obvious use will be for things like regexps:

^all$ -> ip4 from any to any

to be able to write e.g.

ipfw add block all

instead of traditional string. Of course, that's trivial, the more useful application is to support older syntax. For example, suppose new dynrules NAT engine will be called by keywords nat44 and nat64 (for traditional IPv4-to-IPv4 NAT and for IPv6-toIPv4 NAT, respectively). And older libalias-based ipfw nat will be renamed to libalias ("ipfw add libalias ip from $lan to any"). But /etc/ipfw/preparse for two major releases will contain lines:

nat -> libalias
sho -> show
sh -> show

and all older scripts for ipfw nat will still work with libalias as before - POLA preserved. And you'll no longer get an annoying message:

# ipfw sh
ipfw: DEPRECATED: 'sh' matched 'show' as a sub-string

but could also define your own handy aliases.

Simple substitution is of little use, of course, although the "'sh' matched 'show'" was marked deprecated because it's one of the things making /sbin/ipfw parser code the big mess and yet another obstackle on the way to convert parser to clean lex/yacc parsing. The goal is to be able to rewrite /sbin/ipfw to speak only new syntax, and support as much of older syntax quirks as possible via this preparsing mechanism. Consider the following old syntax:

add 60001 deny { tcp or udp } from { table(3) or me } 80,443 to { 127.0.0.1 or 127.0.0.2 or 127.0.0.3 } established

It is old, it is messy to process in parser, it regularly show bugs in GNATS with various corner cases. It should really die, all these "proto from X to Y", in favor of more clean regular grammar of just an options list (section "RULE OPTIONS (MATCH PATTERNS)" in man page). But we can't let it just die because of backward compatibility and users accustomed to that syntax. A more complex preparser, may be with several sections in lex style, may be provided to deal with some subset of that. It is still questionable, which part and how exactly, but something like this should be done.

Syntax and framework parts/extensibility

As it is shown in nf-hipac discussion, the better human-friendly syntax is desirable, but this introduce problems for automated scripts changing configuration. It would be good to introduce a new Juniper-like syntax, but that syntax is yet to be designed. However, we must foresee such possibility and make this ipfw's rework suitable for such additions in the future.

The first thing which comes to mind is a flat space of opcodes (like in BPF) instead of many small ones in every rule. Then a big Juniper-like/C-style config (or even current rules chain) could be compiled to them with optimizations. The flat space of opcodes is a step toward Someoneā„¢ implementing it's JIT to native machine codes (as it was done for BPF).

The problem "human vs automation" could be solved by division to multiple rulesets. Each chain (ruleset in corresponding netgraph node) has:

  1. Stateful settings - dynamic rules always work before static ones

    • purely static part - don't deal with states in this chain at all
    • check dynamic rules but just skip this chain if dynamic rule merely exists
    • check dynamic rules and fully process their actions
  2. Static part - this chain consists either of:
    • traditional linked list of rules
    • a flat space of opcodes (opcodes are same as today, with a few extensions for control flow)
    • an external function for static check (returns verdict), e.g. an nf-hipac analogue or user has wrote it's own module in C

It is open question, however, what to do with counters in such "optimized ways". Personally I prefer to not have them at all (explicit user request may be exception): they are already inaccurate on SMP (seen in the wild), and trying to make them so wastes memory for an unclear gain - task of traffic accounting is better done with Netflow, debugging with counters is just a poor man solution (proper logging and tracing should exist). That's siad for static counters of course - counters for state tracking are useful and should be precise enough to be the foundation of Netflow export, if this wil ever be needed.

Modules provided are planned to utilize khelp(9) framework, with hooks in different places, e.g. on dynamic rule state transitions, for Layer7 tracking (e.g. PORT commands for FTP-DATA connections), etc.

MAY BE CONTINUED

Aside projects/ideas/info

These are not directly about ipfw_ng, but this info can help in discussions.

About nf-hipac and table_ng

There were the nf-hipac project for Linux offering a way to run very big static ruleset in optimized way instead of linear per-rule evaluation. Unfortunately, they commercialized and abandoned open-source version and even deleted their technical presentation describing the maths behind it. I've found by file name it on www.singularity.be/Trashcan/nf-hipac-nfws2005.pdf - please read there from slide 9, how a very big ruleset (200,000 static rules) could be optimized to a bunch of B-trees (a more detailed 99-page thesis about how it works I've found here).

This could be thought like an ipfw table containing not only IP addresses, but rather all packet header fields, both source and destination, at the same time supporting ranges for each parameters (e.g. covering IP adresses in 1.2.3.8-2.3.4.5 range, not only strict prefixes, ranges for every field). Or this could be thought like an entire previous-generation firewall's ruleset (like in ipfw1), present in a table form.

Denis Sotchenko (ibl at RusNet IRC network) proposed to implement such all-in-one optimized "table" as a something called e.g. ipfw table_ng in the existing ipfw just as another rule option. This comes from the observation that current ipfw in addition to static part has a somewhat "algoritmic" part - divert, netgraph, nat, pipe, etc. - where packet could be changed or dropped in unpredictable way. Thus ipfw can't be converted to this optimized table in it's ruleset directly, but this way is not very intrusive - you can cut out all static matches to this table_ng and leave in ruleset only table_ng/pipe/nat/netgraph/etc. "algorithmic" calls. Even the less intrusive way exists (no patches to current ipfw) if this table is implemented as a separate netgraph node.

There are several problems with this way, however. First, it deals only with static rules part, stateful ipfw parts still need reworking. Second, this solution involves complex math and is hard to implement. Third, this will create a maintenance problem where new opcodes are added to ipfw and new functionality must be implemented in this table_ng, too.

Another subtle problem comes from implementation and is not specific to table_ng. Every time you have a human-readable source and compile it to machine representation in optimized way, you can't decompile to source (to make e.g. ipfw show) - if you can, the optimization is very weak and is not worth doing (or that's very specialized case like current radix trees). Then you can't do ipfw show if this is optimized, which is inconvenient. Thus you have to keep source/config files (e.g. pf.conf) - and if you compile anyway, why not to do the language very handy for human, as it is done in pf.conf or, even better, in C-like config of Juniper firewalls?

Here comes the problem - that human-friendly languages are pain to automate. You can easily script ipfw table add/ipfw table delete (and with slightly more effort usual rules, too), but not that config file which is compiled at once as one big thing. Trying to handle absence of ipfw show or possibility to do automated "add/delete then recompile" leads to keeping somewhat-source together with compiled representation. And this hugely wastes kernel memory and requires a compiler - a complex piece of code - in a kernel. For mitigating this problem, Denis Sotchenko suggested a daemon keeping a source, which allows add/delete rules operations by automated scripts, and then compiles, keeping only compiled version in kernel.

Yet another small problem - interface names (via/xmit/recv). These are strings, and nf-hipac model deals with finite number ranges only, but interfaces could be renamed, could arrive and departure, you can't just use ifIndex. In Linux version it is "solved" by virtual index mappings and interposing a veto on several interface operations while firewall ruleset is active. But ipfw allows not only single interfaces but also wildcards like "vlan1[2-4]*"...

Why not pf, given it's SMP problem will be fixed?

TBD (thing in itself, when syntax sugar causes problems, no L7, cache inefficency, pfsync version incompat)

Architecture

One of the main goals for performance is cache efficiency. The older BSD network code tend to mix different things in universal structures - this has it's academic beauty, but the times has changed. Nowadays memory is like a disk with some sectors cached in "memory", and even that has been split (L1/L2/L3 caches). Typical cache line size is 64 bytes for 32-bit machines and 128 bytes for 64-bit machines (it may vary, but let's look toward mainstream). This guides us to folowing architectural goals, given that cache is always limited:

Dynamic rules

Currently dynamic rule is a struct with a key (struct ipfw_flow_id), counters, state and a pointer to parent rule to look for action on match. Action part can be rather large and consist of any opcodes. It is useful to be able to do any thing when a state is found, and we'll do - that's the reason why it is still be called dynamic rule instead of more proper state. But we can't jump to parent rule in case of multiple rulesets, so the obvious solution will be to just copy a buffer of opcodes to dynrule's struct. But how big should be the buffer? One opcode could be as long as 63 ipfw_insn's, 4 bytes each. Then you discover that for most states there will be just 4-8 bytes.

So then new struct opbuf is introduced, with a length and refcount. All dynamic rules created at the same point in ruleset will share the same opbuf. For even more optimization, a pointer to opbuf should be made as union with one struct ipfw_insn, for a common case with a just one opcode (usually "allow"), to not reference even one external cacheline for opbuf.

Next, dynrules are supposed to handle NAT function, too. Then your key need to consist of three address:port pairs: LAN endpoint, external server's endpoint, and externally visible (translated) address:port of NAT gateway itself (the struct alias_link in libalias is made this way). So you have more complex code to compare addresses, a need to specify which interface is internal and which external (as in pf), etc. But then you discover that many installations without NAT just waste 18 bytes for every tracked connection. And what about address family translation (NAT64), two IPv6 address:port's at one side and two IPv4 address:port's at another side?

So then state key is taken out of dynrule, and taken as only two address:port pairs. If you have NAT, then dynrule will reference two keys (possibly of different address families), only one otherwise. Info about to which address to translate could be also taken from key, saving space. The other benefit is that a state key could be (relatively) immutable - lock separating is also good.

What else could be taken out to separate structs? From the planned ipfwsync's perspective, some info is rarely changed and other could change as often as every packet arrives - that's counters. Counters are also a problem to be kept precise on SMP without performance loss. Thus counters should also go out of the main dynrule, and will be split to at least forward and opposite direction (a little help for a SMP and could be checked from action code, to e.g. shape only after a 100 Kb of traffic). It should be investigated whether VM subsystem can offer CoW bits in separate UMA zone for counters to ease detection which of them should be sent in updates by ipfwsync.

TO BE CONTINUED

IpfwNg (last edited 2012-05-11 01:46:49 by VadimGoncharov)