It is critical that data for the VCS is maintained in a state where any corruption is noticed almost immediately. A good read of some short comings of the various VCS' can be found at:

Current Implementation

SVN Implementation

SVN has two current implementations of the repository backend.

The original one used Berkeley DB ("bdb"). It was notoriously fragile and troublesome. It stressed the db code in ways it wasn't used to and exposed bugs regularly. This gave early versions of subversion a bad reputation that it has been struggling to overcome for a long time.

The current "main" one is called "fsfs". Each commit is stored in a single file in a mostly-ascii format. Once the transaction is final, the commit files are never modified, ever. You can trivially checksum these (eg: md5/sha) or sign them or whatever suits your fancy. The change -> change format is a forward delta format with periodic bundling of deltas - the subversion folks call this skip-deltas and it is documented on their site. Changes are transaction based. The on-disk layout of files has lead to stressing the file system with very large numbers of commit - however, it has been changed to a more friendly layout that avoids directories with huge numbers of files in them.

There has been talk of a future third backend based on SQLite. svn-1.5 may be using SQLite for multi-path merge tracking.

Hg Implementation

Hg uses a well-thought repository format, aiming at both efficiency and robustness. It is better described here but to simplify:

  1. every checked-in file get two files inside .hg/data: <foo>.i representing the index of revisions within <foo>.d which contains the actual file data. File data is an aggregate of revisions, either stored as a compressed full copy or as a compressed delta. Full-copy is generated when the diffs are bigger than the file itself. That leads to a smaller disk usage and fast access time. Starting with 0.9, an optimisation has been made and files with combined size of <foo>.d and <foo>.i under a given threshold (128 KB for now), the two files are merged to keep the file count down

  2. the files within .hg are append-only as the history grows. That means more robust on-disk data and a very fast system for some operations

  3. every changeset is represented by a manifest and a changelog entry (files are 00manifest.{d,i} and 00changelog.{d,i} in .hg), again stored the same way as file data. revlogs have constant-time access which is good for performance.

  4. every repository operation is protected by a transaction-based system and in case of a problem, rollback is automatic.
  5. the repository does not need to be packed regularely
  6. repository is lexically-ordered whereas git is modification-ordered meaning that the latter will degrade to random-access overtime as history grows.

Tags are represented differently and stored in a version controlled file called .hgtags in the working directory, not in the repository itself. This allows for both local and global tags, the latter being distributed when a cloning operation is requested.

As the .hg directory holds the entire history overtime, there is no need to scatter the working directory with things like CVS/* and a repository is self-contained.

Git Implementation

Git's repository format is very simple. There are various files:

A branch is just a pointer to a commit, and since each commit points at its parents, you get the entire history from a single sha1 name. Each file is named by its sha1 hash, so it's easy to verify file contents for corruption using git-fsck-objects.

git-fsck-objects has other uses, too. If you were doing some complicated operation and realized you reverted an important change, or you crashed and fsck left one of your branch files empty, you can run git-fsck-objects to find what objects in your repository are unreferenced, use git-show on the commits to see which one is right, and then create the branch again. I've used this when I made a mistake in a rebase operation and realized it several commits down the road -- I recovered my old branch and just replayed my new commits on top of it.

As is, this file store would be a very inefficient mechanism -- no compression of files that are very similar but have different hashes. This is where packs come in. Generally, when a repository has many megabytes of unpacked files, a git-repack is done. This packs a bunch of files together into a larger compressed file. The packing results in good compression -- Mozilla conversions have apparently gone from a 3GB CVS repository to a 300MB git repository. Compare that to a 12GB fsfs subversion repository for that same conversion. Even with a packed repository, git operations remain fast, and the user doesn't need any knowledge of them.

Monotone Implementation

Monotone's repository is stored in an ACID-compliant SQLite database file. All monotone operations that write to the database occur inside transactions, and so either happen completely or not at all. This includes not only normal VCS operations, but also administrative actions such as database schema upgrades that may occur from time to time as monotone develops.

This feature has at least two facets:

Robust Repository

Supported (also see monotone:FeatureRobustRepository)

Monotone is almost fanatically pedantic and paranoid about internal consistency checking, and has a strict policy of failing early. If any problems are detected during an operation, monotone will abort rather than risking committing bad information to the database. This is a very particular kind of robustness that favours historical integrity over user perceptions where abrupt errors are triggered. These events are generally fairly rare, and always of interest to the developers, even if it means that a more user-friendly error message needs to be emitted.

Compact Repository

Supported (also see monotone:FeatureCompactRepository)

Monotone compresses file and delta storage with gzip before writing to the database. For typical source file content and large projects with long histories, the resulting database file is usually similar or smaller than the corresponding CVS ,v files. Storage size has not been a particularly strong development focus, and some information is stored uncompressed for fast indexing and searching. Further efforts at more compact storage forms could be made, but are not considered particularly important for size results alone at this stage given the already good results - such changes may happen in the course of making performance or other improvements as well.

VCSFeatureGoodRepositoryFormat (last edited 2008-06-17 21:37:43 by localhost)