Project/routing API
Contents
LLTABLE
There are 2 main functions in lltable used to lookup lladdr data by external callers:
int arpresolve(ifnet *ifp, int is_gw, mbuf *m, sockaddr *dst, u_char *desten, uint32_t *pflags)
and the corresponding
int nd6_resolve(ifnet *ifp, int is_gw, mbuf *m, sockaddr *dst, u_char *desten, uint32_t *pflags)
API Problems
is_gw argument
is_gw is passed to alter error message (EHOSTUNREACH instead of EHOSTDOWN). Lltable should not known anything about errors specifics. This can (and already is) being lazy-computed in some of packet output routines (D4102 changed that for ether_output()).
Broadcast/Multicast
Broadcast/Multicast is handled inside <arp|nd_>resolve(). This is conceptually wrong since this mapping is dependent on actual protocol set being used. De-facto each output routine handles it itself instead of calling _resolve() routines.
Sockaddr usage
There is very little sense in passing sockaddr as a key to the lookup routine for particular family. Especially because internally lltable have been using addresses instead of sockaddrs since r286624.
New API
Given that, set of 2 new lookup functions is proposed:
int in_resolve_lla(ifnet *ifp, mbuf *m, struct in_addr dst, u_char *desten, uint32_t *pflags) int in6_resolve_lla(ifnet *ifp, mbuf *m, struct in6_addr *dst, u_char *desten, uint32_t *pflags)
More details in D4962.
Routing
There are numerous amount of functions / wrappers to perform rtentry lookup.
Two most notable are:
struct rtentry *rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags, u_int fibnum)
It performs radix lookup, using radix head read lock and returns locked and refcounted struct rtentry or NULL.
void rtalloc_ign_fib(struct route *ro, u_long ignore, u_int fibnum)
This one is a (popular) wrapper. After calling rtalloc1_fib() it saves rtentry into passed ro->ro_rt and unlocks it providing unlocked and refcounted rtentry.
API problems
Locking rtentry
Rte is locked and refcounted to ensure that the caller can safely dereference rtentry fields. The important fact here is that refcounting rtentry does not protect you from rt_ifp ifnet being destroyed after lookup.
The only refcount that is hold by rtentry is rt_ifa (which, in turn, is just a structure not refcounting anything itself). So, refcounting rtentry just gives you stable storage, nothing more. Theoretically, rtentry could be copied to stack structure, but in practice it's not feasible (sizeof(struct rtentry) > 200 bytes on amd64, with prefix/gw allocated separately, netmask pointer from radix, etc..).
Caller complexity
Instead of getting pre-computed result consumer has to alter the results itself: if RTF_GATEWAY is set then.., check which MTU is less: rt_mtu or ifp->ifp_mtu or rt_ifa->ifa_ifp->ifp_mtu or IN6_LINKMTU(), etc. Again, since most of callers are af-specific, passing sockaddrs as a key adds additional complexity. (especially with IPv6 LLA). Most consumers need very basic things: check if ifp is "right", get mtu, etc. Current API is an overkill for most of the tasks.
Internal problems
Having individual rtentries returned binds internal implementation to radix (and particular radix.c code). It also prevents (good) multipath implementation. In general, exposing all internals makes it really hard to hack routing code.
New API
Goals
- Provide pre-computed easy to-use result
- Use compact on-stack structures
- Provide different functions for different tasks varying in terms of provided information and speed
- Have per-AF functions and use addresses instead of sockaddr
- Be multipath-ready from day 1.
- Hide all of the internals avoiding passing rtentries to any data/control plane functions outside routing code.
Data path functions
This set of functions return minimal amount of data needed to forward a packet (or verify some basic data), e.g. next-hop data. Nexthop consists of interface, mtu, flags, data to prepend or next-hop address. Right now next-hop size is 64 bytes (leaving about 40 bytes for pre-calculated prepend data like ethernet header).
Next-hop based approach is a very important conversion step.
- It permits pre-calculating/tracking full prepend (ethernet/IB header first, lagg/vlan and tunnel headers next) since number of next-hops is typically very small.
- It permits using different lookup algorithms since the algorithm task is reduced to execute 'f(addr) = nhop_index' w/o embedding any special data into internal structures (which is not the case for both 'struct rtentry' and RADIX_MPATH case).
- Different lookup algorithms _are_ required: patricia is not so well suited for long IPv6 keys. Additionally, cases like 'single interface-with-default-route' and 'I have IPv4 full-view' also deserves different algorithms to achieve optimal performance. So, having next-hops is a necessary pre-requisite for making infrastructure for modular lookup algorithms (see ipfw(4) modular tables lookup infrastructure where similar ideas were implemented).
Source: ip_fib.c ip_fib.h in6_fib.c in6_fib.h
Actual functions description:
fibX_lookup_nh_basic()
The simplest form of lookup, no references are provided.
/* * Performs IPv4 route table lookup on @dst. Returns 0 on success. * Stores nexthop info provided @pnh4 structure. * Note that * - nh_ifp cannot be safely dereferenced * - nh_ifp represents logical transmit interface (rt_ifp) (e.g. if * looking up address on interface "ix0" pointer to "lo0" interface * will be returned instead of "ix0") * - nh_ifp represents "address" interface if NHR_IFAIF flag is passed * - howewer mtu from "transmit" interface will be returned. */ int fib4_lookup_nh_basic(uint32_t fibnum, struct in_addr dst, uint32_t flags, uint32_t flowid, struct nhop4_basic *pnh4) struct nhop4_basic { struct ifnet *nh_ifp; /* Logical egress interface */ uint16_t nh_mtu; /* nexthop mtu */ uint16_t nh_flags; /* nhop flags */ struct in_addr nh_addr; /* GW/DST IPv4 address */ }; /* * Performs IPv6 route table lookup on @dst. Returns 0 on success. * Stores basic nexthop info into provided @pnh6 structure. * Note that * - nh_ifp represents logical transmit interface (rt_ifp) by default * - nh_ifp represents "address" interface if NHR_IFAIF flag is passed * - mtu from logical transmit interface will be returned. * - nh_ifp cannot be safely dereferenced * - nh_ifp represents rt_ifp (e.g. if looking up address on * interface "ix0" pointer to "ix0" interface will be returned instead * of "lo0") * - howewer mtu from "transmit" interface will be returned. * - scope will be embedded in nh_addr */ int fib6_lookup_nh_basic(uint32_t fibnum, const struct in6_addr *dst, uint32_t scopeid, uint32_t flags, uint32_t flowid, struct nhop6_basic *pnh6) struct nhop6_basic { struct ifnet *nh_ifp; /* Logical egress interface */ uint16_t nh_mtu; /* nexthop mtu */ uint16_t nh_flags; /* nhop flags */ uint8_t spare[4]; struct in6_addr nh_addr; /* GW/DST IPv4 address */ };
fibX_lookup_nh_ext()
More advanced one: IPv4 source address is returned, some pointers can be referenced if required. Right now internal referencing/dereferencing is a no-op (which is exactly like current rtalloc/rte implementation).
/* * Performs IPv4 route table lookup on @dst. Returns 0 on success. * Stores extende nexthop info provided @pnh4 structure. * Note that * - nh_ifp cannot be safely dereferenced unless NHR_REF is specified. * - in that case you need to call fib4_free_nh_ext() * - nh_ifp represents logical transmit interface (rt_ifp) (e.g. if * looking up address of interface "ix0" pointer to "lo0" interface * will be returned instead of "ix0") * - nh_ifp represents "address" interface if NHR_IFAIF flag is passed * - howewer mtu from "transmit" interface will be returned. */ int fib4_lookup_nh_ext(uint32_t fibnum, struct in_addr dst, uint32_t flags, uint32_t flowid, struct nhop4_extended *pnh4) struct nhop4_extended { struct ifnet *nh_ifp; /* Logical egress interface */ uint16_t nh_mtu; /* nexthop mtu */ uint16_t nh_flags; /* nhop flags */ uint8_t spare[4]; struct in_addr nh_addr; /* GW/DST IPv4 address */ struct in_addr nh_src; /* default source IPv4 address */ uint64_t spare2[2]; }; void fib4_free_nh_ext(uint32_t fibnum, struct nhop4_extended *pnh4) /* * Performs IPv6 route table lookup on @dst. Returns 0 on success. * Stores extended nexthop info into provided @pnh6 structure. * Note that * - nh_ifp cannot be safely dereferenced unless NHR_REF is specified. * - in that case you need to call fib6_free_nh_ext() * - nh_ifp represents logical transmit interface (rt_ifp) by default * - nh_ifp represents "address" interface if NHR_IFAIF flag is passed * - mtu from logical transmit interface will be returned. * - scope will be embedded in nh_addr */ int fib6_lookup_nh_ext(uint32_t fibnum, const struct in6_addr *dst, uint32_t scopeid, uint32_t flags, uint32_t flowid, struct nhop6_extended *pnh6) struct nhop6_extended { struct ifnet *nh_ifp; /* Logical egress interface */ uint16_t nh_mtu; /* nexthop mtu */ uint16_t nh_flags; /* nhop flags */ uint8_t spare[4]; struct in6_addr nh_addr; /* GW/DST IPv6 address */ uint64_t spare2[2]; }; void fib6_free_nh_ext(uint32_t fibnum, struct nhop6_extended *pnh6)
fibX_lookup_prepend()
Actual functions used for forwarding. Passing struct route permits function to automatically program ro_len, ro_flags and ro_prepend fields so route pre-calculated data could be transparently passed to the output function.
/* * Performs lookup in IPv4 table fib @fibnum. * Assumes @ro->ro_rt points to 'struct nhop_prepend' storage. * In case of successful lookup ro->ro_rt is filled with * appropriate interface info and full L2 header to prepend or * nhop address. If route does not contain gateway, or gateway is unreachable, * NHF_L2_INCOMPLETE flag and gateway address is stored into nh->d.gw4 * If @hh is not NULL, additional nexthop data is stored there. * * Returns 0 on success. */ int fib4_lookup_prepend(uint32_t fibnum, struct in_addr dst, uint32_t flags, uint32_t flowid, struct route *ro, struct nhop4_helper *hh); /* Non-recursive nexthop */ struct nhop_prepend { uint16_t nh_flags; /* NH flags */ uint8_t nh_count; /* Number of nexthops or data length */ uint8_t spare[3]; uint16_t nh_mtu; /* given nhop MTU */ struct ifnet *nh_lifp; /* Logical transmit interface */ struct ifnet *nh_aifp; /* Interface address */ union { char nh_data[MAX_PREPEND_LEN]; /* data to prepend */ struct in_addr nh4_addr;/* IPv4 gw address */ struct in6_addr nh6_addr;/* IPv4 gw address */ }; }; void fib4_free_prepend(struct nhop_prepend *pnh);
Control plane functions
Some consumers want to check if particular prefix exists, or check actual rte flags, which cannot be achieved using fibX functions. Luckily, all(most) there consumers are control-plane ones (arp/nd code).
rib_lookup()
The goal is to provide real per-rte information and, if possible, avoid direct rte access. This is accomplished by using struct rt_addrinfo used as interaction mechanism between routing socket and routing table functions. The tricky part here is that rt_addrinfo itself does not have storage for prefx,mask and gateway (the latter could be of any size). So, the decision is the following: caller allocates (either on-stack or memory) sockaddrs of appropriate size for the fields it want to retrieve, sets sa.sa_len field to the size of allocated buffer and sets info.rti_info[RTAX_XXX] pointer to that sockaddr. Function then tries to fill all non-zero sockaddr on successful lookup.
/* * Lookups up route entry for @dst in RIB database for fib @fibnum. * Exports entry data to @info using rt_exportinfo(). * * if @flags contains NHR_REF, refcouting is performed on rt_ifp. * All references can be released later by calling rib_free_info() * * Returns 0 on success. * Returns ENOENT for lookup failure, ENOMEM for export failure. */ int rib_lookup_info(uint32_t fibnum, const struct sockaddr *dst, uint32_t flags, uint32_t flowid, struct rt_addrinfo *info) void rib_free_info(struct rt_addrinfo *info)