Benchmark Advice
Introduction
A FreeBSD release is shipped by default configured to work everywhere, from the little single-CPU 486/arm/mips/... compatible systems to the powerful multi-core 64-bit servers. As such some defaults may not be optimal for a specific benchmark/workload. See this tuning guide for more information on optimizations. Don't forget to benchmark each stage of the tuning process.
Highly recommended points in this text are marked with the icon. Consider not publishing results if those points haven't been followed. In any case, following all of the advice given below to the letter will not harm benchmark results.
Contents
Basics
Remember: when benchmarking two things, make sure that everything possible is the same (constants) and that the only difference between the two things to compare is what is benchmarked (variables). For example, when comparing the performance of GCC-compiled binaries, then use the same hardware host, the same OS install, the same source code, and only change the compiler versions used to compile the "benchmark" binaries. That way, the only variable is the "version of GCC", everything else is constant, and the benchmark is actually testing the performance of GCC.
Likewise, to benchmark the performance of two OSes, eliminate as many variables as possible:
- same hardware
- running the same benchmark binaries
- using the same versions of GCC
- using the same filesystems
- etc.
That provides a starting point.
Then, modify one of the constants above, and re-run the benchmarks.
Then, modify one more of the constants above, and re-run the benchmarks.
And so forth. Each time, vary only 1 thing, so that only the impact of that *ONE* thing is measured.
Comparing "random binary compile with GCC X on FreeBSD Y on filesystem Z on hardware config A" against "random binary built with GCC Q on Linux R on fileystem S on hardware config B" doesn't show anything. Was the performance difference due to hardware? Filesystem? OS? GCC version? Something else?
DONT's
DON'T benchmark under virtualisation (eg: VMWare, VirtualBox, etc) unless intentionally benchmarking the virtualisation system, even if it's "hardware-supported" (like Intel VT / AMD Pacifica). See WhyNotBenchmarkUnderVMWare for details.
DON'T benchmark -CURRENT or -BETA out of the box. Extensive debugging features are enabled by default. Benchmarking these versions accurately requires building FreeBSD from source. Disable *at least* WITNESS and DIAGNOSTIC options in the kernel configuration and define MALLOC_PRODUCTION in src/lib/libc/stdlib/malloc.c or /etc/src.conf.
- DON'T benchmark the speed of traditional filesystems (like UFS/UFS+SU/UFS+J/UFS+SUJ/ext3/ext4/...) with modern ones (like ZFS/btrfs) in cross-OS(-version) tests. If everything else is identical (OS/version/compiler/benchmarks) and only the FS is different, this does not apply, of course.
Be aware of exactly what is being benchmarked. For example, the base FreeBSD system includes two compilers at present: GCC 4.2.1 (FreeBSD 8+) and Clang/LLVM 3.0 (FreeBSD 9+). If comparing FreeBSD / GCC 4.2.1 against, for example, Ubuntu / GCC 4.7 then the results are unlikely to say anything meaningful about FreeBSD vs Ubuntu. It will just say something about high-performance computing with the default compiler, but with high-performance computing, one normally chooses the best suitable compiler for the task anyway. Newer versions of GCC are available in ports and LLVM/Clang is available for most other systems - use the same compiler on both systems for compute-bound benchmarks when comparing the influence of the system/kernel.
DO's
Include detailed information
When publishing benchmarks, always include at least the following information so that other people are able to fully understand the environment the benchmark was run in, and ideally replicate results
dmesg output of the test configuration(s).
- File system configuration, if the benchmark depends upon filesystem I/O, or could be influenced by it
- Hardware (disk, controller, etc) information
- Software (firmware, etc) information
- RAID config, cabling setup
- Filesystem paremeters , mount options, etc
- Process scheduler used. If unsure, or the default was used, state that it is the default scheduler
Obtain/produce reliable results
Use ministat to see if the results are statistically significant. Consider buying "Cartoon guide to statistics" ISBN: 0062731025. Highly recommended to learn about stddev and Student's T.
- Provide the difference in percent, the standard deviation (stddev) and the Student's T. Do not only provide pure numbers or only graphs of pure numbers.
Benchmark explanations and pitfalls
Generic information
Choosing the right scheduler
As of this writing (9.0 RC3) the ULE scheduler has some issues when more compute-bound threads are competing for CPU usage than there are available CPUs. This is under investigation. If the benchmark is doing something like this, investigate to see if the BSD scheduler may be better suited.
XXX to be confirmed: Single CPU systems may benefit from the BSD scheduler too.
Parallel read/write tests
In filesystem / disk IO tests where writes and reads are interleaved / in parallel, be aware that FreeBSD prioritizes writes over reads. (XXX: explain why?)
Huge write throughput difference when comparing to another OS
FreeBSD has a low limit on dirty buffers (= the amount of write kept in RAM instead of flushing to disk) since under realistic load the already cached data is much more likely to be reused and thus more valuable than freshly written data; aggressively caching dirty data would significantly reduce throughput and responsiveness under high load (= the huge difference in throughput only means the system is mostly idle and the interesting use-case is not being benchmarked). It can be that FreeBSD accepts somewhere dirty buffers in the tens of megabytes, wheres another OS accepts 100 times more. This could lead to the impression that another OS has a better write throughput, whereas in reality FreeBSD has better real-world behavior. While there are surely cases where 100 times more dirty buffers don't hurt or are even desirable, FreeBSD prefers to optimize for the mixed use-case instead of the write-only use-case.
An interesting benchmark in this case is to generate a load which causes the other system to exceed the amount of allowed dirty buffers so that the system starts to flush the data to disk.
XXX: how to tune the amount of dirty buffers, vfs.hidirtybuffers?
Tests which involve a lot of calls to get the current time
FreeBSD supports high precision timecounters. This does not really matter if one just wants to know the current time to the minute, but as FreeBSD is supposed to be suitable for a lot of tasks by default, high precision timekeeping is the default. Some applications which make use of a lot of calls to get the current time (because e.g. in Linux a faster but less precise way of obtaining the time is used and the application in question was developed mainly on Linux) are impacted by this.
Applications which are known to be impacted:
- mySQL
- ...
XXX: (Link to) explanation how to fix the applications would be good
HTTP benchmarks
FreeBSD does not enable full HTTP-optimizations by default. make sure the accf_http kernel module is loaded. The module only helps for HTTP serving, HTTPS does not benefit.
This can be either done at the command line via kldload accf_http, or by adding accf_http_load="YES" to /boot/loader.conf and rebooting the system.
The HTTP server also needs to support and enable the HTTP accept filter. For e.g. apache 2.2 this is the case (and the start script of apache can auto-load the accf_http module if it is not run in a jail. Add apache22_http_accept_enable="YES" to /etc/rc.conf). XXX: adding a list of other HTTP servers which support this?
Benchmarking ZFS
ZFS performance relies heavily on ample resourcing (disk performance, memory). Using ZFS on a one or two disks will not give improved performance (compared to e.g. UFS), but does provide improved data safety, such as detecting when data is damaged by radiation or data-manipulating disk errors.
Give the system sufficient memory, and/or one read-optimized SSD for L2ARC cache for read performance (the number of SSD's depends upon the size of the workingset) or two mirrored (for data safety in case one SSD gets damaged) write-optimized SSDs for the ZIL for synchronous (DBs/NFS/...) write performance.
Benchmark specific information
Blogbench
From the blogbench website: "Blogbench is a portable filesystem benchmark that tries to reproduce the load of a real-world busy file server."
So blogbench is a test which excercises the filesystem.
Take both read and the write performance into account. Reads and writes are done in parallel, so publishing only one of these numbers does not make sense (malicious people may think otherwise).
LAME
When using LAME to benchmark FreeBSD against another OS: its most probably not comparing the systems, but instead comparing the compilers (as with every userland-compute-bound application).
High precision benchmarking
Most of the text on which this page is based was originally posted here:
http://lists.freebsd.org/pipermail/freebsd-current/2004-January/019600.html
Note that this advice is mostly concerned with high-precision measurement of CPU-intensive tasks and may introduce needless complications for simpler benchmarks. For reference, PHK has used the procedures outlined below for fine-tuning his work on code that keeps track of machine time, thus the references to quartz crystals and temperature drift.
Also note that an extended version of these hints can be found in the FreeBSD Developers Handbook.
Additional tips
PHK> A number of people have started to benchmark things seriously now, and run into the usual problem of noisy data preventing any conclusions. Rather than repeat myself many times, I decided to send this email.
I experimented with micro-benchmarking some years back, here are some bullet points with a lot of the stuff I found out. You will not be able to use them all every single time, but the more you use, the better your ability to test small differences will be.
Disable APM and any other kind of clock fiddling (powerd, ACPI ?).
Run in single user mode. cron(8) and other daemons only add noise.
If syslog events are generated but unwanted, run syslogd with an empty syslogd.conf, otherwise, do not run it.
- If applicable (i.e. when not benchmarking file systems or disk IO), minimize the disk-I/O, avoid it entirely if possible
- Don't mount unnecessary, or unneeded filesystems.
Mount / and /usr and any other filesystem possible as read-only. This removes atime updates to disk (etc.) from the I/O picture. Mount all RW file systems with the "noatime" option if aiming for the highest performance.
If benchmarking file systems or disk IO, newfs the R/W test filesystem and populate it from a tar or dump file before every run. Unmount and mount it before starting the test. This results in a consistent filesystem layout. For a worldstone test this would apply to /usr/obj (just newfs and mount). For 100% reproducibility, populate the filesystem from a dd(1) file (i.e.: dd if=myimage of=/dev/ad0s1h bs=1m).
Put each filesystem on its own disk. This minimizes jitter from head-seek optimizations. If comparing different file systems on classical (mechanical) disk drives, it is *crucial* to benchmark them on the smallest partition size possible, and always on the same partition, as there are huge performance differences when accessing different areas on a mechanical drive. E.g. if the machine has 16 GB of RAM, perform the benchmark on a 32 GB - 40 GB partition.
- Use malloc backed or preloaded MD(4) partitions to benchmark raw file system performance.
- Reboot between individual iterations of the test; this gives a more consistent state.
- Remove all non-essential device drivers from the kernel. For instance, if USB is not needed for the test, don't put USB in the kernel. Drivers which attach often have timeouts ticking away.
Unconfigure unused hardware. Detach disk with atacontrol and camcontrol if not using them for the test.
- Do not configure the network unless that's part of what is tested (or until after the test to ship the results off to another computer).
- Do not run NTPD.
Minimize output to serial or VGA consoles. Running output into files gives less jitter. (Serial consoles can easily become a bottleneck). Do not touch the keyboard while test is running, even <space><back-space> shows up in the numbers.
Make sure the test is long enough, but not too long. If the test is too short, timestamping is a problem. If it is too long, temperature changes and drift will affect the frequency of the quartz crystals in the computer. Rule of thumb: more than a minute, less than an hour.
- Try to keep the temperature as stable as possible around the machine. This affects both quartz crystals and disk drive algorithms. To take this further, consider stabilized clock injection. (get a OCXO + PLL, inject output into clock circuits instead of motherboard xtal. Send me (phk) an email).
Run at least 3 but better is >20 for both "before" and "after" code. Try to interleave if possible (i.e: do not run 20xbefore then 20xafter); this makes it possible to spot environmental effects. Do not interleave 1:1, but 3:3; this makes it possible to spot interaction effects.
My preferred pattern: bababa{bbbaaa}*
This gives a hint after the first 1+1 runs (so it can be stopped if it goes entirely the wrong way), a stddev after the first 3+3 (gives a good indication if it is going to be worth a long run) and trending and interaction numbers later on.- Make sure that comparable systems either all have RAM-based tmpfs (on /tmp) or none of them have.
Enjoy, and please share any other tricks you might develop!