Project Description
(from Google Summer of Code 2008 project listing)
Prior to UFS2, the performance of name lookups in directories with many entries on FreeBSD could be quite slow. FFS does not maintain an on-disk index of directory entries, so each name lookup required a linear search of the directory entries. With UFS2 a new technique called dirhash was developed to generate a hash map in memory of directory entries the first time a name lookup is run on a large directory, so that subsequent name lookups will be much faster. Dirhash has had a substantial positive impact on filesystem performance.
However, the amount of memory which can be used by dirhashes is fixed by the kernel (although it is tunable at runtime via a sysctl). I propose to implement dynamic memory allocation for dirhashes, allowing more memory to be allocated for dirhash when needed if it is available, and reclaimed if memory becomes scarce. The goal of this is to improve performance for filesystems with very large directories on systems with memory to spare, without the user having to manually tune the amount of memory available for dirhash.
Furthermore, if time permits I would like to investigate implementing on-disk indexing for UFS2. If directory indices are generated and stored on disk, there would no longer be a need to generate a completely new dirhash each time a directory is first accessed after system startup, or after a dirhash for a directory not recently accessed has been disposed of to conserve memory.
Plan
Either change dirhash memory allocation to use buffer cache, or keep existing memory allocation model but register new callback function to free older dirhashes when vm_lowmem events occur. Done
Once code works reasonably well in my own tests, call for testing from the community to shake out more bugs if possible. Done
Test for performance differences with dynamic dirhash vs. static under different work loads (e.g. huge maildirs, subversion fsfs repositories with a great many revisions, different amounts of memory in machine, etc.). Done
Get code in a state suitable for being committed to -CURRENT. Done
(Time permitting) Begin work on on-disk directory indexing in UFS2. Not done
Status
I blogged about this project here. The code I am working on can be found in the FreeBSD Perforce depot here.
(2008-6-19) I am working on implementing a function to be registered with EVENTHANDLER_REGISTER() for vm_lowmem events that will free dirhashes. I am trying to determine how to decide which hashes to free, and will need to find how many should be freed each time this event occurs. Determining the latter might be easiest through trial and error, watching how often vm_lowmem events pop up when the system is low on memory.
(2008-6-27) Here is a patch with a first attempt at freeing dirhashes when a low memory signal is sent: dirhash_lowmem_event_2008-6-28.patch. In this patch just the one oldest dirhash that can be freed will be.
(2008-7-12) I now have an updated patch: dirhash_lowmem_event_2008-7-12.patch. Now the dirhash data structure has a new field for the last used time stamp. When a vm_lowmem event occurs, every dirhash that has not been used for five seconds (this will probably need to be tuned) is discarded. If all dirhashes have been used more recently than this, then the first one on the dirhash list TAILQ will be deleted.
(2008-8-14) Updated patches again. Now there are two: one against HEAD dirhash_lowmem_head_2008-8-14.patch and one that should work both on 7.0 and 7-STABLE dirhash_lowmem_7-stable_2008-8-14.patch. These fix a bug in the previous code where hashes older than 5 seconds would never be deleted and instead only one hash at the head of the TAILQ would be removed with each vm_lowmem event. Also, now the deletion code is more aggressive: all hashes older than 5 seconds (tunable via a sysctl, vfs.ufs.dirhash_reclaimage) will always be deleted when a vm_lowmem event occurs, and if that doesn't free up at least 10% of the memory currently being used by dirhash, more hashes will be destroyed from the head of the TAILQ until we do free up at least 10% of the memory initially used. Also, below I have posted some results from some benchmarking.
(2008-10-8) Results about a whole bunch of new benchmarks are added below. Also, the dirhash code for 8-CURRENT has been updated in perforce to work with some new changes made to the dirhash code in SVN. New patch: dirhash_lowmem_head_2008-10-12.patch
(2008-11-19) My slides and audio from my talk at EuroBSDCon 2008 are available online.
(2009-7-7) I committed the dirhash vm_lowmem handler to -CURRENT about a month ago, and it will be included in 8.0-RELEASE. Also I plan to commit a backport of these changes to 7-STABLE, probably around September.
Benchmarks
I measured the performance of the changes I made to the dirhash code using a Python script that does a number of different operations on large directories, hopefully in a fairly realistic way. The script can be found here in perforce. It first untars 1000 randomly generated maildirs containing a total of one million tiny email messages. Then it begins ten threads which add new messages to these maildirs, while simultaneously kicking off a build of a FreeBSD kernel. The mail delivery threads continue running as the script then extracts a very large tarball containing the entire FreeBSD src subversion repository from a revision a few months old, checks out from the extracted repository and does a few other subversion operations on it, then removes the extracted repository. Next the script blocks until the mail delivery threads finish delivering all the messages specified in the script, and next kicks off ten threads that perform a number of IMAP operations on all of the maildirs. Once those are all finished, all the generated directories are removed. Times are measured for the initial maildir extraction, the kernel build, time to untar the svnroot, all svn operations, removing the svnroot while mail continue to be delivered, the mail creation threads, the imap threads, and removing all the remaining directories at the end.
I ran the tests on both FreeBSD 7.0 and 8-CURRENT for the AMD64 architecture. A GENERIC kernel is used on 7.0, and on -CURRENT the only change to the GENERIC config is that INVARIANTS and WITNESS related options are disabled. Furthermore malloc debugging is disabled on the -CURRENT system in userland. On the 7.0 system the creation threads deliver a total of 25 000 new messages to maildirs, while on 8-CURRENT I increased this to 50 000 because delivering 25 000 messages would take less time than all the other tasks which run alongside the mail creation threads.
The hardware used for all tests has a 1.8GHz Intel Core 2 Duo, 1GB memory, and a 3.5" 80GB 7200 RPM IDE hard drive. The maildirs are extracted, have messages added, and IMAP operations are done on a secondary hard drive, which is a 2.5" 40GB 5400RPM SATA drive. All tests are run with the system running multiuser, but not much besides sshd, openntpd, and postfix along with normal default FreeBSD daemons are running in the background.
Each battery of tests is run 5 times for a number of different dirhash settings. Using a kernel without my patch, tests are run with the maximum dirhash memory usage at the default 2MB and 64MB. With the kernel containing the vm_lowmem handler patch, 64MB is used across the board but the dirhash_reclaimage parameter is tweaked. On 7.0, 1s, 3s, 5s, and 10s are all tried for reclaim age, and on 8-CURRENT all these values are tried in addition to 4s and 8s. I put the results from all the test runs into a Gnumeric spreadsheet here: dh_benchmark_results.gnumeric. The graphs from this spreadsheet were exported as pngs and included below.
FreeBSD 7.0 Results
FreeBSD 8-CURRENT Results