Project Description
(from Google Summer of Code 2008 project listing)
Prior to UFS2, the performance of name lookups in directories with many entries on FreeBSD could be quite slow. FFS does not maintain an on-disk index of directory entries, so each name lookup required a linear search of the directory entries. With UFS2 a new technique called dirhash was developed to generate a hash map in memory of directory entries the first time a name lookup is run on a large directory, so that subsequent name lookups will be much faster. Dirhash has had a substantial positive impact on filesystem performance.
However, the amount of memory which can be used by dirhashes is fixed by the kernel (although it is tunable at runtime via a sysctl). I propose to implement dynamic memory allocation for dirhashes, allowing more memory to be allocated for dirhash when needed if it is available, and reclaimed if memory becomes scarce. The goal of this is to improve performance for filesystems with very large directories on systems with memory to spare, without the user having to manually tune the amount of memory available for dirhash.
Furthermore, if time permits I would like to investigate implementing on-disk indexing for UFS2. If directory indices are generated and stored on disk, there would no longer be a need to generate a completely new dirhash each time a directory is first accessed after system startup, or after a dirhash for a directory not recently accessed has been disposed of to conserve memory.
Plan
- Either change dirhash memory allocation to use buffer cache, or keep existing memory allocation model but register new callback function to free older dirhashes when vm_lowmem events occur.
- Once code works reasonably well in my own tests, call for testing from the community to shake out more bugs if possible.
- Test for performance differences with dynamic dirhash vs. static under different work loads (e.g. huge maildirs, subversion fsfs repositories with a great many revisions, different amounts of memory in machine, etc.)
- Get code in a state suitable for being committed to -CURRENT
- (Time permitting) Begin work on on-disk directory indexing in UFS2.
Status
I blog about this project here. The code I am working on can be found in the FreeBSD Perforce depot here.
(2008-6-19) I am working on implementing a function to be registered with EVENTHANDLER_REGISTER() for vm_lowmem events that will free dirhashes. I am trying to determine how to decide which hashes to free, and will need to find how many should be freed each time this event occurs. Determining the latter might be easiest through trial and error, watching how often vm_lowmem events pop up when the system is low on memory.
(2008-6-27) Here is a patch with a first attempt at freeing dirhashes when a low memory signal is sent: dirhash_lowmem_event_2008-6-28.patch. In this patch just the one oldest dirhash that can be freed will be.
(2008-7-12) I now have an updated patch: dirhash_lowmem_event_2008-7-12.patch. Now the dirhash data structure has a new field for the last used time stamp. When a vm_lowmem event occurs, every dirhash that has not been used for five seconds (this will probably need to be tuned) is discarded. If all dirhashes have been used more recently than this, then the first one on the dirhash list TAILQ will be deleted.
(2008-8-14) Updated patches again. Now there are two: one against HEAD dirhash_lowmem_head_2008-8-14.patch and one that should work both on 7.0 and 7-STABLE dirhash_lowmem_7-stable_2008-8-14.patch. These fix a bug in the previous code where hashes older than 5 seconds would never be deleted and instead only one hash at the head of the TAILQ would be removed with each vm_lowmem event. Also, now the deletion code is more aggressive: all hashes older than 5 seconds (tunable via a sysctl, vfs.ufs.dirhash_reclaimage) will always be deleted when a vm_lowmem event occurs, and if that doesn't free up at least 10% of the memory currently being used by dirhash, more hashes will be destroyed from the head of the TAILQ until we do free up at least 10% of the memory initially used. Also, below I have posted some results from some benchmarking.
Benchmarks
Here is a comparison of the speed running a number of different tasks that are fairly filesystem intensive with and without my dirhash low memory event handling code compiled into the kernel. The system used for testing is a 1.83GHz Core Duo Macbook with 2GB of memory and a 5400RPM 2.5" SATA hard drive. It's running i386 8-CURRENT, and WITNESS and INVARIANTS related options in the kernel are disabled. The system was running multiuser with only a few extra processes like wpa_supplicant, sshd, and powerd running. The vfs.ufs.dirhash_maxmem sysctl was set to 100MB at all times for both kernels, and vfs.ufs.dirhash_reclaimage was 5 seconds when using the vm_lowmem handling kernel. Each test was run with a random amount of wait time from 1 to 5 minutes between each test, and the whole set of tests was repeated for a total of four runs for the set.
|
mean run time - no vm_lowmem handler |
std. deviation - no vm_lowmem handler |
mean run time - with vm_lowmem handler |
std. deviation - with vm_lowmem handler |
postmark |
2378s |
16s |
2327s |
77s |
portsnap extract |
312s |
55s |
315s |
57s |
tar extract |
505s |
1,8s |
498s |
5,5s |
svn checkout |
1084s |
131s |
1023s |
16,7s |
rm -rf ports |
76s |
3,7s |
78s |
1,5s |
make buildworld |
4373s |
2,3s |
4382s |
23s |
rm -rf src |
115s |
6,5s |
115s |
9,1s |
rm -rf svnroot |
20s |
5,5s |
18,7s |
4,3s |
postmark was run with 1 000 000 files, 500 subdirectories, and 10 000 transactions. The portsnap extract test just involved extracting an already fetched ports tree into a clean directory with portsnap extract. tar extraction was of a bzipped tarball containing a copy of the FreeBSD svn repository, svnmirror-base-r179637.tbz2, fetched from ftp://ftp.freebd.org/pub/FreeBSD/development/subversion. The svn checkout test did a subversion checkout from this newly extracted local svnroot of revision 178000 from HEAD, and make buildworld was run in the checked out src directory. The different rm tests just timed doing rm -rf on the extracted ports and svnroot directories, and the checked out src directory.
The difference in run times was not significantly different with the two versions of code. During the postmark and portsnap tests no vm_lowmem events were seen, but a few of them did get fired and delete dirhashes in the later tests that used more memory. This did not seem to slow things down much though. Another test run was bonnie++ -r 2048 -n 1000 -s 0 -u root, and that also had little difference in the number of various operations per second for the different kernels.
Next I will be testing different values for vfs.ufs.dirhash_reclaimage to find a more optimal value there.