gvirstor implementation details

{i} gvirstor is (mostly) done and working, but real-world testing is required. See gvirstor page for details.

Firstly: the name. Any ideas for a more sexier name than "gvirstor"? :)

The idea behind gvirstor is simple:

The goal: allow creation of huge storage devices with limited available physical storage space, and allow adding physical space as needed instead of at creation time.

Details

Storage of allocation table

The allocation table is stored at the start of the first component, as an array of following struct:

    u_int16   flags;          /* including: GVIRSTOR_FLAG_ALLOCATED */
    u_int16   provider_no;    /* sequential number of geom which stores this chunk */
    u_int32   provider_chunk; /* where on the provider is this chunk stored (physical chunk index) */

Single allocation map entry is 2+2+4=8 byte in size. Since the providers are divided in chunks, only chunk index needs to be kept, thus allowing us to use uint32 for provider_chunk member instead of ofs_t (which is 64-bit), and so avoid both large tables, alignment issues with a 64-bit member in a struct and keeps the struct size to power of 2. Virtual offsets will still use ofs_t for calculations, but with 4MB chunks and 32-bit chunk_id physical providers will be limited to 16777216GB (more if larger chunks are used).

If any of allocated components of gvirstor fail, the whole virstor is unusable. Thus, it's recommented to create virstor components out of mirrored (RAID1) providers.

Allocation

In theory, if there's only one writer (application) the allocation would proceed smoothly, first gobbling up the first geom, than the next, etc. with discontinuities due to cylinder groups on UFS. Since GEOM system doesn't know what file is beeing written and by who, it can't perform optimisation on this level.

Since allocation from physical providers is going in linear monotonical fashion, we don't need to keep a bitmap of free chunks for each provider, but rather a last-allocated-index field.

Usage of BIO_DELETE has been considered, but it would require keeping track of (de)allocation for within each chunk, thus complicating things immensly. So, BIO_DELETE requests will be ignored.

Allocation table synchronisation

The obvious way to do this is synchronous with an allocation event (i.e. don't allow the BIO_WRITE request to return until the table is updated). As current drives can do 60 MB/s, this results in ~ 15 additional seeks/s when chunk size is 4MB.

Performance

There should be an utility to measure performance for different sizes of chunks, to enable us to select optimal chunk size. My guess is that chunk size won't matter much except in heavily concurrent I/O access.

gvirstor-details (last edited 2008-06-17T21:37:40+0000 by anonymous)