XFS FUSE Implementation (Week 10)
Metadata Integrity Checks
The largest scaling problem facing XFS is verification. Most of the data allocated in XFS is done so dynamically. They have to be discovered by walking the filesystem in different ways. In V5 of the filesystem several fields are included to check the integrity of data including:
- Magic numbers, to classify all types of metadata.
- A copy of the filesystem UUID, to confirm that a given disk block is connected to the superblock.
- The owner, to avoid accessing a piece of metadata which belongs to some other part of the filesystem.
- The filesystem block number, to detect misplaced writes.
- A CRC32c checksum of the entire block, to detect minor corruption.
It's important to note that the CRC algorithm used here is CRC_32_ISCSI. It's also of the utmost importance to note that while performing integrity checks we should skip the CRC field itself when reading the data.
Magic numbers are the most basic integrity check which makes sure that we are accessing what we are supposed to. Each type of structure has its own Magic number that should be checked upon reading. If the type of the read block doesn't match the expected type then we should stop walking and report that error.
As a primary concern, self describing metadata needs some form of overall integrity checking. We cannot trust the metadata if we cannot verify that it has not been changed as a result of external influences. Hence we need some form of integrity check, and this is done by adding CRC32c validation to the metadata block. If we can verify the block contains the metadata it was intended to contain, a large amount of the manual verification work can be skipped.
CRC32c was selected as metadata cannot be more than 64k in length in XFS and hence a 32 bit CRC is more than sufficient to detect multi-bit errors in metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is fast. So while CRC32c is not the strongest of possible integrity checks that could be used, it is more than sufficient for our needs and has relatively little overhead. Adding support for larger integrity fields and/or algorithms does really provide any extra value over CRC32c, but it does add a lot of complexity and so there is no provision for changing the integrity checking mechanism.
BLKNO and UUID
Self describing metadata needs to contain enough information so that the metadata block can be verified as being in the correct place without needing to look at any other metadata. This means it needs to contain location information. Just adding a block number to the metadata is not sufficient to protect against mis-directed writes - a write might be misdirected to the wrong LUN and so be written to the ”correct block” of the wrong filesystem. Hence location information must contain a filesystem identifier as well as a block number.
Another key information point in forensic analysis is knowing who the metadata block belongs to. We already know the type, the location, that it is valid and/or corrupted, and how long ago that it was last modified. Knowing the owner of the block is important as it allows us to find other related metadata to determine the scope of the corruption. For example, if we have a extent btree object, we don’t know what inode it belongs to and hence have to walk the entire filesystem to find the owner of the block. Worse, the corruption could mean that no owner can be found (i.e. it’s an orphan block), and so without an owner field in the metadata we have no idea of the scope of the corruption. If we have an owner field in the metadata object, we can immediately do top down validation to determine the scope of the problem.
Different types of metadata have different owner identifiers. For example, directory, attribute and extent tree blocks are all owned by an inode, whilst freespace btree blocks are owned by an allocation group. Hence the size and contents of the owner field are determined by the type of metadata object we are looking at. The owner information can also identify misplaced writes (e.g. freespace btree block written to the wrong AG).
- Inodes can have multiple “owners” in the directory tree; therefore the record contains the inode number instead of an owner or a block number.
- Superblocks have no owners.
- The disk quota file has no owner or block numbers.
- Metadata owned by files list the inode number as the owner.
- Per-AG data and B+tree blocks list the AG number as the owner.
- Per-AG header sectors don’t list owners or block numbers, since they have fixed locations.
- Remote attribute blocks are not logged and therefore the LSN must be -1.