A sparse uintptr_t-sized array implemented as efficient trie

This is a WIP to implement a structure allowing to have a practically unbound array: mapping of uintptr_t to uintptr_t - something conceptually like:

uintptr_t array[UINTPTR_MAX];

typically used to associate a pointer to some structure with some int-sized index, being simple in use (not having to deal with optimal hash functions, changing number of buckets on growing, etc.).

Array is implemented as a digital tree (or trie), that is, 16-ary tree - this means that, for 32-bit words, the tree in worst case will have 8 levels depth (8 cache-line misses) - and that is upper bound (measures are taken to lower it in common cases). In contrast, red-black tree will achieve such depth already at very small number of entries (about 256). Implementation is inspired by Judy arrays and intended to be used like them, but in FreeBSD kernel.

Why just not port Judy itself to kernel? There are several reasons:

Judy is very complex beast. In Judy, there are approximately 25 (85 for amd64) major data structures and a similar number of minor structures, done in about 1 Mb of source code, different for different architectures. Very hard to understand completely, not even to maintain or debug.
Judy tries to optimize on memory/cache very agressively and sometimes is known to be a CPU hog, while most times such hard optimization is not needed in practice, a slightly worse solution would be enough (e.g. 8 levels instead of 4 is not much worse, compared to current 32 for binary trees).
Last but not least, Judy is LGPLed and is covered by a patent in some countries.

Requirements

The data structure must:

be relatively simple and portable between 32 and 64 bits architectures
use cache as efficiently as possible, provided other goals are met
provide basic functionality: set value (insert-or-replace), get value, delete (clear) value
provide additional handy functionality of B-trees:
- iterate over not all but just set values in (reverse) sorted order
- get next and previous neighbours for a given element
be easy to use

Design overview

Since typical usage is to store pointers, value of NULL/zero is used to mark absence of value - this is limitation of interface. NULL values are not actually stored.

All used data structures have the size of 16 machine words (thus the name "hextree") - so 64 bytes on 32-bit architectures and 128 on 64-bit ones. That sizes are also a typical cache-line size on those architectures, thus all structures are allocated (from UMA zone) cache-line aligned (in cases when cache-line sizes differ, these still are usually multiples of each other - so still good).

The fact structures are always size-aligned gives us following observation: the low 6 bits of pointer to structure are always zero (7 bits on 64-bit architectures, but for portability we have only 6). So we can utilize a trick from Judy: pack important information into these bits, like:

Type of structure we point to (leaf or intermediate node).
In some cases, next hex-digit of path, saving us from storing intermediate node with just one key.

What is path? Well, in tries / digital trees the keys them are usually not stored: the key for a given value is inferred by it's position in the tree, that is, the path from root node. In a straightforward implementation of 16-ary tree, you do all nodes as single type - array of 16 words, getting always 8 levels to reach all values. Very memory consuming, of course (so digital trees are rarely used in practice). But observe that array is sparse, and there will be typical situation when some intermediae node holds just one pointer, other are zero - the whole purpose of it is just to provide you next hex-digit of path, and then you go to next level. Such nodes can be eliminated, decreasing tree depth and cache-line fills, if you store this digit directly in parent node's pointer.

What other compressing technique can be used, while staying simple in implementation? If, on some level, you have very small number of children, it doesn't matter, one or more of next levels - you could store them directly in this node, which will be able to store up to 8 key-value pairs (remember, our leaves are just machine words, they're small, we could store them right here). This is not limited to final leaf values, but could also be used for intermediate nodes, provided we store number of significant digits in key (this limits us to max 16 levels, though - but hope at the times of 128-bit machines better structures will be invented already). Time to linear scan 8 pairs is still much faster than dereferencing pointer with possible cache miss.

So we end up with following structure types:

HEXTREE_TYPE_LEAF_16 - plain array of 16 final stored user values, always on last level
HEXTREE_TYPE_LEAF_8 - list of up to 8 key-value pairs, values are final stored user values
HEXTREE_TYPE_NODE_16 - plain array of 16 pointers to nodes on next level, index in it gives next hex-digit of path
HEXTREE_TYPE_NODE_8 - list of up to 8 key-value pairs, values are pointers to next intermediate or leaf node

We know type of each node via least-significant bits of pointer to it, because all space in nodes is used for data, type itself is not stored there. This leads us to requirement for user code to pass pointer to pointer to root node (user stores just one pointer to Hextree), and roughly the following algorithm of work:

First, user code initializes it's root pointer to NULL - this means empty hextree.
Then, user sets one values - hextree allocates HEXTREE_TYPE_LEAF_8 structure and modifies root pointer in user's data to reflect place and type of root node.
Up to 8 key-value pairs from user could be stored without further allocations.
When user adds 9-th value, new nodes are allocated and root node is converted in-place to HEXTREE_TYPE_NODE_16 or HEXTREE_TYPE_NODE_8, depending of how many differing digits are on first level. All leaf values are moved to newly allocated nodes, root pointer in user's data is modified to reflect new root node's type.
On addition of new values, process is repeated recursively: when new nodes are allocated, values are moved to them, and pointers in parent nodes are updated to reflect new type.
When user deletes value - that is, sets it to NULL - values are zeroed, if they are in HEXTREE_TYPE_LEAF_16, or both key and value are zeroed, if this is HEXTREE_TYPE_LEAF_8; in latter case, key-value pairs are stored in sorted order (zero means end-of-list), so trailing elements are shifted left to fill the gap.
Hextree always tries to allocate minimum number of new nodes, using possible compressing techniques for this goal. When number of elements on particular level and may be it's direct descendants, falls below some threshold, the tree begins to shrink, converting plain 16-element arrays to up-to-8-key-value lists and freeing memory. To avoid too often rebalances, current threshold is 4, but may be changed upon further investigation.

The HEXTREE_TYPE_NODE_8 stores the keys of key-value pairs with least-significant digit of key indicating how many of other key hex-digits make a path, or, in other words, how deep (which level) is node pointed to. Because HEXTREE_TYPE_NODE_8 already specifies full path to next node, both pointers to it or from it don't need to use compressing technique of placing one path's hex-digit to low bits of pointer. The same applies to pointers to HEXTREE_TYPE_LEAF_8. This leads us to the following possible values of least-significant bits of pointer in Hextree:

#define HEXTREE_TYPE_LEAF_8         0
#define HEXTREE_TYPE_LEAF_16        1
#define HEXTREE_TYPE_NODE_16        2
#define HEXTREE_TYPE_NODE_8         3
#define HEXTREE_TYPE_LEAF_16_DIGIT (HEXTREE_TYPE_LEAF_16 << 4)  // with digit: 16 .. 31 = 01xxxx
#define HEXTREE_TYPE_NODE_16_DIGIT (HEXTREE_TYPE_NODE_16 << 4)  // with digit: 32 .. 47 = 10xxxx

TO BE CONTINUED