Contents
See also: Solaris: ZFS Evil Tuning Guide, loader.conf(5), sysctl(8).
ZFS Tuning Guide
(Work in Progress)
To use ZFS, at least 1 GB of memory is recommended (for all architectures) but more is helpful as ZFS needs *lots* of memory. Depending on your workload, it may be possible to use ZFS on systems with less memory, but it requires careful tuning to avoid panics from memory exhaustion in the kernel.
A 64-bit system is preferred due to its larger address space and better performance on 64-bit variables, which are used extensively by ZFS. 32-bit systems are supported though, with sufficient tuning.
History of FreeBSD releases with ZFS is as follows:
7.0+ - original ZFS import, ZFS v6; requires significant tuning for stable operation (no longer supported)
7.2 - still ZFS v6, improved memory handling, amd64 may need no memory tuning (no longer supported)
- 7.3+ - backport of new ZFS v13 code, similar to the 8.0 code
8.0 - new ZFS v13 code, lots of bug fixes - recommended over all past versions. (no longer supported)
- 8.1+ - ZFS v14
- 8.2+ - ZFS v15
- 9.0+ - ZFS v28
i386
Typically you need to increase vm.kmem_size_max and vm.kmem_size (with vm.kmem_size_max >= vm.kmem_size) to not get kernel panics (kmem too small). The value depends upon the workload. If you need to extend them beyond 512M, you need to recompile your kernel with increased KVA_PAGES option, e.g. add the following line to your kernel configuration file to increase available space for vm.kmem_size beyond 1 GB:
options KVA_PAGES=512
To chose a good value for KVA_PAGES read the explanation in the sys/i386/conf/NOTES file.
By default the kernel receives 1 GB of the 4 GB of address space available on the i386 architecture, and this is used for all of the kernel address space needs, not just the kmem map. By increasing KVA_PAGES you can allocate a larger proportion of the 4 GB address space to the kernel (2 GB in the above example), allowing more room to increase vm.kmem_size. The trade-off is that user applications have less address space available, and some programs (e.g. those that rely on mapping data at a fixed address that is now in the kernel address space, or which require close to the full 3 GB of address space themselves) may no longer run. If you change KVA_PAGES and the system reboots (no panic) after running a while this may be because the address space for userland applications is too small now.
For *really* memory constrained systems it is also recommended to strip out as many unused drivers and options from the kernel (which will free a couple of MB of memory). A stable configuration with vm.kmem_size="1536M" has been reported using an unmodified 7.0-RELEASE kernel, relatively sparse drivers as required for the hardware and options KVA_PAGES=512.
Some workloads need greatly reduced ARC size and the size of VDEV cache. ZFS manages the ARC through a multi-threaded process. If it requires more memory for ARC ZFS will allocate it. Previously it exceeded arc_max (vfs.zfs.arc_max) from time to time, but with 7.3 and 8-stable as of mid-January 2010 this is not the case anymore. On memory constrained systems it is safer to use an arbitrarily low arc_max. For example it is possible to set vm.kmem_size and vm.kmem_size_max to 512M, vfs.zfs.arc_max to 160M, keeping vfs.zfs.vdev.cache.size to half its default size of 10 Megs (setting it to 5 Megs can even achieve better stability, but this depends upon your workload).
There is one example (CySchubert) of ZFS running nicely on a laptop with 768 Megs of physical RAM with the following settings in /boot/loader.conf:
vm.kmem_size="330M"
vm.kmem_size_max="330M"
vfs.zfs.arc_max="40M"
vfs.zfs.vdev.cache.size="5M"
Kernel memory should be monitored while tuning to ensure a comfortable amount of free kernel address space. The following script will summarize kernel memory utilization and assist in tuning arc_max and VDEV cache size.
#!/bin/sh -
TEXT=`kldstat | awk 'BEGIN {print "16i 0";} NR>1 {print toupper($4) "+"} END {print "p"}' | dc`
DATA=`vmstat -m | sed -Ee '1s/.*/0/;s/.* ([0-9]+)K.*/\1+/;$s/$/1024*p/' | dc`
TOTAL=$((DATA + TEXT))
echo TEXT=$TEXT, `echo $TEXT | awk '{print $1/1048576 " MB"}'`
echo DATA=$DATA, `echo $DATA | awk '{print $1/1048576 " MB"}'`
echo TOTAL=$TOTAL, `echo $TOTAL | awk '{print $1/1048576 " MB"}'`
Note: Perhaps there is a more precise way to calculate / measure how large of a vm.kmem_size setting can be used with a particular kernel, but the authors of this wiki do not know it. Experimentation does work.
However, if you set vm.kmem_size too high in loader.conf, the kernel will panic on boot. You can fix this by dropping to the boot loader prompt and typing set vm.kmem_size="512M" (or a similar smaller number known to work.)
The vm.kmem_size_max setting is not used directly during the system operation (i.e. it is not a limit which kmem can "grow" into) but for initial autoconfiguration of various system settings, the most important of which for this discussion is the ARC size. If kmem_size and arc_max are tuned manually, kmem_size_max will be ignored, but it is still required to be set.
The issue of kernel memory exhaustion is a complex one, involving the interaction between disk speeds, application loads and the special caching ZFS does. Faster drives will write the cached data faster but will also fill the caches up faster. Generally, larger and faster drives will need more memory for ZFS.
To increase performance, you may increase kern.maxvnodes (in /etc/sysctl.conf) way up if you have the RAM for it (e.g. 400000 for a 2GB system). On i386, keep an eye on vfs.numvnodes during production to see where it stabilizes. (AMD64 uses direct mapping for vnodes, so you don't have to worry about address space for vnodes on this architecture).
amd64
NOTE (gcooper): this blanket statement is far from true 100% of the time, depending on how the system with ZFS is being used.
FreeBSD 7.2+ has improved kernel memory allocation strategy and no tuning may be necessary on systems with more than 2 GB of RAM.
Generic ARC discussion
The value for vfs.zfs.arc_max needs to be smaller than the value for vm.kmem_size (not only ZFS is using the kmem).
To monitor the ARC, you can use the script at http://jhell.googlecode.com/files/arc_summary.pl (ported from the Solaris version at http://cuddletech.com/arc_summary/). Another script which may be helpful is http://jhell.googlecode.com/files/arcstat.pl (ported from the Solaris version at http://blogs.sun.com/realneel/entry/zfs_arc_statistics).
To improve the random read performance, a separate L2ARC device can be used (zpool add <pool> cache <device>). A cheap solution is to add an USB memory stick (see http://www.leidinger.net/blog/2010/02/10/making-zfs-faster/). The high performance solution is to add a SSD.
Using a L2ARC device will increase the amount of memory ZFS needs to allocate, see http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg34674.html for more info.
Application Issues
ZFS is a copy-on-write filesystem. As such metadata from the top of the hierarchy is copied in order to maintain consistency in case of sudden failure, i.e. loss of power during a write operation. This obviates the need for an fsck-like requirement of ZFS filesystems at boot. However the downside to this is that applications which perform updates in place to large files, e.g. databases, will likely perform poorly in this application of the filesystem due to excessive I/O from copy-on-write (a fast SLOG device -- e.g. a SSD -- can help regarding the write performance of databases or any application which is doing synchronous writes (e.g. open with O_FSYNC) to the FS to make sure the data is on non-volatile storage when the write-call returns). Additionally, database applications, such as Oracle, maintain a large cache (called the SGA in Oracle) in memory will perform poorly due to double caching of data in the ARC and in the application's own cache. Reducing the ARC to a minimum can improve performance of applications which maintain their own cache. At http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide there are some generic recommendations for ZFS on Solaris which mostly apply to FreeBSD too.
General Tuning
There are some changes that can be made to improve performance in certain situations and avoid the bursty IO that's often seen with ZFS.
Loader tunables (in /boot/loader.conf):
# Disable ZFS prefetching # http://southbrain.com/south/2008/04/the-nightmare-comes-slowly-zfs.html # Increases overall speed of ZFS, but when disk flushing/writes occur, # system is less responsive (due to extreme disk I/O). # NOTE: Systems with 4 GB of RAM or more have prefetch enabled by default. vfs.zfs.prefetch_disable="1" # Decrease ZFS txg timeout value from 30 (default) to 5 seconds. This # should increase throughput and decrease the "bursty" stalls that # happen during immense I/O with ZFS. # http://lists.freebsd.org/pipermail/freebsd-fs/2009-December/007343.html # http://lists.freebsd.org/pipermail/freebsd-fs/2009-December/007355.html # default in FreeBSD since ZFS v28 vfs.zfs.txg.timeout="5" Sysctl variables (/etc/sysctl.conf): # Increase number of vnodes; we've seen vfs.numvnodes reach 115,000 # at times. Default max is a little over 200,000. Playing it safe... # If numvnodes reaches maxvnode performance substantially decreases. kern.maxvnodes=250000 # Set TXG write limit to a lower threshold. This helps "level out" # the throughput rate (see "zpool iostat"). A value of 256MB works well # for systems with 4 GB of RAM, while 1 GB works well for us w/ 8 GB on # disks which have 64 MB cache. <<BR>> vfs.zfs.txg.write_limit_override=1073741824 Be aware that the vfs.zfs.txg.write_limit_override tuning you see above may need to be adjusted for your system. It's up to you to figure out what works best in your environment.
Deduplication
Deduplication is a misunderstood feature in ZFS v21+; some users see it as a silver bullet for increasing capacity by reducing redundancies in data. Here are the author's (gcooper's) observations:
- There are some resources that suggest that one needs 2GB per TB of storage with deduplication [i] (in fact this is a misinterpretation of the text). In practice with FreeBSD, based on empirical testing and additional reading, it's closer to 5GB per TB.
- Using deduplication is slower than not running it.
- Deduplication [on 8.x/9.x at least] lies via stat(2) / statvfs(2); it reports the theoretical used space -- not the actual used space -- which can confuse scripts that look at df output, etc (TODO: find PR that mentions this).
Suggestions
If you are going to use deduplication and your machine is underspec'ed, you must set vfs.zfs.arc_max to a sane value or ZFS will wire down as much available memory as possible, which can create memory starvation scenarios.
- It's a much better idea in general to use compression -- as opposed deduplication -- if you're trying to save space, and you know that you can benefit from compression.
When in doubt, check how much you would actually gain from deduplication via zdb -S <zpool> instead of just turning it on. Please note that this will take a while to run, depending on the dataset/zpool selected.
References
NFS tuning
The combination of ZFS and NFS stresses the ZIL to the point that performance falls significantly below expected levels. The best solution is to put the ZIL on a fast SSD (or a pair of SSDs in a mirror, for added redundancy). The next best solution is to disable ZIL with the following setting in loader.conf (up to ZFS version 15):
vfs.zfs.zil_disable="1"
In latest ZFS (version 28) the vfs.zfs.zil_disable loader tunable was replaced with the "sync" dataset property. You can now enable/disable ZIL on a per-dataset basis.
zfs set sync=disabled tank/dataset
Disabling ZIL is not recommended where data consistency is required (such as database servers) but will not result in file system corruption. See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 .
ZFS is designed to be used with "raw" drives - i.e. not over already created hardware RAID volumes (this is sometimes called "JBOD" or "passthrough" mode when used with RAID controllers), but can benefit greatly from good and fast controllers.
MySQL
This assumes lots of RAM
Tweaks for MySQL
Tweaks for ZFS
- zfs set primarycache=metadata tank/db
- zfs set atime=off tank/db
- zfs set recordsize=16k tank/db
- zfs set zfs:zfs_nocacheflush = 1
- sysctl vfs.zfs.prefetch_disable=1