gvirstor implementation details

gvirstor is (mostly) done and working, but real-world testing is required. See gvirstor page for details.

Firstly: the name. Any ideas for a more sexier name than "gvirstor"?

The idea behind gvirstor is simple:

Create a virtual storage device of some predefined user-specified size (e.g. 2TB)
Take a bunch of drives/partitions/etc (we'll call them "components")
Split components into two groups: allocated and unallocated. Components are in the allocated group if any amount of their contents is allocated to data storage in the virtual device.
Divide space in allocated providers into user-specified size chunks/extents (all chunks are of the same size)
Build and maintain an allocation table which maps what chunk in the virtual device is mapped to which on an allocated provider
Allocate space from providers chunk by chunk on as-needed basis

The goal: allow creation of huge storage devices with limited available physical storage space, and allow adding physical space as needed instead of at creation time.

Details

Storage of allocation table

The allocation table is stored at the start of the first component, as an array of following struct:

    u_int16   flags;          /* including: GVIRSTOR_FLAG_ALLOCATED */
    u_int16   provider_no;    /* sequential number of geom which stores this chunk */
    u_int32   provider_chunk; /* where on the provider is this chunk stored (physical chunk index) */

Single allocation map entry is 2+2+4=8 byte in size. Since the providers are divided in chunks, only chunk index needs to be kept, thus allowing us to use uint32 for provider_chunk member instead of ofs_t (which is 64-bit), and so avoid both large tables, alignment issues with a 64-bit member in a struct and keeps the struct size to power of 2. Virtual offsets will still use ofs_t for calculations, but with 4MB chunks and 32-bit chunk_id physical providers will be limited to 16777216GB (more if larger chunks are used).

If any of allocated components of gvirstor fail, the whole virstor is unusable. Thus, it's recommented to create virstor components out of mirrored (RAID1) providers.

Allocation

In theory, if there's only one writer (application) the allocation would proceed smoothly, first gobbling up the first geom, than the next, etc. with discontinuities due to cylinder groups on UFS. Since GEOM system doesn't know what file is beeing written and by who, it can't perform optimisation on this level.

Since allocation from physical providers is going in linear monotonical fashion, we don't need to keep a bitmap of free chunks for each provider, but rather a last-allocated-index field.

Usage of BIO_DELETE has been considered, but it would require keeping track of (de)allocation for within each chunk, thus complicating things immensly. So, BIO_DELETE requests will be ignored.

Allocation table synchronisation

The obvious way to do this is synchronous with an allocation event (i.e. don't allow the BIO_WRITE request to return until the table is updated). As current drives can do 60 MB/s, this results in ~ 15 additional seeks/s when chunk size is 4MB.

Performance

There should be an utility to measure performance for different sizes of chunks, to enable us to select optimal chunk size. My guess is that chunk size won't matter much except in heavily concurrent I/O access.