Contents
Preface
A FreeBSD release is shipped by default configured to work everywhere, from the little single-CPU 486/arm/mips/... compatible systems to the powerful multi-core 64-bit servers. As such some defaults may not be optimal for a specific benchmark/workload. A tuning guide is available in case you want to get the most out of it. Do not forget to benchmark if the tuning makes a difference for your workload.
Highly recommended points in this text are marked with the
icon. Please consider not publishing your results if you haven't followed at least those points. In any case, following all of the advice given below to the letter will not harm benchmark results.
Remember: when benchmarking two things, you need to make sure that everything possible is the same (constants) and that the only difference between the two things to compare is what you want to benchmark (variables). For example, if you want to compare the performance of GCC-compiled binaries, then you would use the same hardware host, the same OS install, the same source code, and only change the compiler versions used to compile the "benchmark" binaries. That way, the only variable is "version of GCC", everything else is constant, and thus the benchmark is actually testing the performance of GCC.
Likewise, if you want to benchmark the performance of two OSes, you need to eliminate as many variables as possible:
- same hardware
- running the same benchmark binaries
- using the same versions of GCC
- using the same filesystems
- etc.
That gives you the starting point.
Then, you modify one of the constants above, and re-run the benchmarks.
Then you modify one more of the constants above, and re-run the benchmarks.
And so forth. Each time, you vary only 1 thing, so that you can measure the impact of that *ONE* thing.
Comparing "random binary compile with GCC X on FreeBSD Y on filesystem Z on hardware config A" against "random binary built with GCC Q on Linux R on fileystem S on hardware config B" doesn't show anything. Was the performance difference due to hardware? Filesystem? OS? GCC version? Something else?
Do not do this!
DO NOT BENCHMARK UNDER VMWARE or any other virtual machine emulation unless you're benchmarking VMWare instead of your system, even if it's "hardware-supported" (like Intel VT / AMD Pacifica). This will mess up your results badly even in the best case. See WhyNotBenchmarkUnderVMWare for details.
Do not benchmark -CURRENT or -BETA. Extensive debugging features are most probably enabled. If you absolutely have to benchmark one of them, you have to build your own FreeBSD from source. You need to at least disable options WITNESS and DIAGNOSTIC in the kernel configuration and define MALLOC_PRODUCTION in src/lib/libc/stdlib/malloc.c or /etc/src.conf. - Do not compare the speed of a traditional FS (like UFS/UFS+SU/UFS+J/UFS+SUJ/ext3/ext4/...) with a more modern FS (like ZFS/btrfs) in a cross-OS(-version) test. If you keep everything else (OS/version/compiler/benchmarks) the same and only the FS is different, this does not apply, of course.
Please also be aware of what you are benchmarking. For example, the base FreeBSD system includes two compilers at present: GCC 4.2.1 (FreeBSD 8+) and Clang/LLVM 3.0 (FreeBSD 9+). If you compare FreeBSD / GCC 4.2.1 against, for example, Ubuntu / GCC 4.7 then the results are unlikely to tell you anything meaningful about FreeBSD vs Ubuntu. It will just tell something about high-performance computing with the default compiler, but if you are into high-performance computing, you normally chose the best suitable compiler for your task anyway. Newer versions of GCC are available in ports and LLVM/Clang is available for most other systems - make sure that you are using the same compiler on both systems for compute-bound benchmarks if you want to compare the influence of the system/kernel.
Information to include
If you publish benchmark results, you should always include the following information so that other people are able to fully understand the environment the benchmark was run in:
Include dmesg output of your test configuration(s) in the benchmark results - if your benchmark depends upon FS I/O, or is at least influenced by it, add your FS config (RAID config / hardware config / cabling info / which parameters were used to create the FS / mount options / ...)
- state which process scheduler is used (if you don't know and if you didn't change it, state that it is the default scheduler)
Make sure you get good numbers
Use ministat to see if your numbers are significant. Consider buying "Cartoon guide to statistics" ISBN: 0062731025. Highly recommended, if you've forgotten or never learned about stddev and Student's T. - Do not only provide pure numbers or only graphs of the pure numbers. Always provide the difference in percent, the stddev and the Student's T.
Benchmark explanations and pitfalls
Generic information
Choosing the right scheduler
As of this writing (9.0 RC3) the ULE scheduler has some issues when more compute-bound threads are competing for CPU usage than there are available CPUs. This is under investigation. If your benchmark is doing something like this, you should investigate to see if the BSD scheduler may be better suited for this benchmark.
XXX to be confirmed: Single CPU systems may benefit from the BSD scheduler too.
Parallel read/write tests
If you do a FS/disk I/O test where writes and reads are interleaved / in parallel, you need to be aware that FreeBSD prioritizes writes over reads. (XXX: explain why?)
Huge write throughput difference when comparing to another OS
FreeBSD has a low limit on dirty buffers (= the amount of write kept in RAM instead of flushing to disk) since under realistic load the already cached data is much more likely to be reused and thus more valuable than freshly written data; aggressively caching dirty data would significantly reduce throughput and responsiveness under high load (= the huge difference in throughput only means your system is mostly idle and you are not benchmarking the interesting use-case). It can be that FreeBSD accepts somewhere dirty buffers in the tens of megabytes, wheres another OS accepts 100 times more. This could lead to the impression that another OS has a better write throughput, whereas in reality FreeBSD has better real-world behavior. While there are surely cases where 100 times more dirty buffers don't hurt or are even something you want to have, FreeBSD prefers to optimize for the mixed use-case instead of the write-only use-case.
An interesting benchmark in this case is to generate a load which causes the other system to exceed the amount of allowed dirty buffers so that the system starts to flush the data to disk.
XXX: how to tune the amount of dirty buffers, vfs.hidirtybuffers?
Tests which involve a lot of calls to get the current time
FreeBSD has high precision timecounters. This does not really matter if you just want to know the current time to the minute, but as FreeBSD is supposed to be suitable for a lot of tasks by default, you get high precision timekeeping by default. Some applications which make use of a lot of calls to get the current time (because e.g. in Linux a faster but less precise way of obtaining the time is used and the application in question was developed mainly on Linux) are impacted by this.
Applications which are known to be impacted:
- mySQL
- ...
XXX: (Link to) explanation how to fix the applications would be good
HTTP benchmarks
FreeBSD does not enable full HTTP-optimizations by default. If you want to get the most performance out of FreeBSD, make sure the accf_http kernel module is loaded. The module only helps for HTTP serving, HTTPS does not benefit from this.
You can either do this on the command line via kldload accf_http, or by adding accf_http_load="YES" to /boot/loader.conf and reboot the system. The HTTP server also needs to support the HTTP accept filter. For e.g. apache 2.2 this is the case (and the start script of apache can auto-load the accf_http module if it is not run in a jail if you add apache22_http_accept_enable="YES" to /etc/rc.conf). XXX: adding a list of other HTTP servers which support this?
Benchmarking ZFS
If you want to benchmark ZFS, be aware that it will only shine if you are willing to spend money. Using ZFS on a one or two disks will not give improved performance (compared to e.g. UFS), but it will give improved safety for your data (you know when your data is damaged by e.g. radiation or data-manipulating harddisk-errors). To make it shine you need to add at least a lot fo RAM, or one read-optimized SSD for L2ARC cache for read performance (the number of SSD's depends upon the size of the workingset) or two mirrored (for data safety in case one SSD gets damaged) write-optimized SSDs for the ZIL for synchronous (DBs/NFS/...) write performance.
Benchmark specific information
Blogbench
From their website: "Blogbench is a portable filesystem benchmark that tries to reproduce the load of a real-world busy file server."
So blogbench is a test which excercises the FS (if you are not Wordpress.com (or similar), it's unlikely that you should think about it as a benchmarks for blogs).
You have to take the read and the write performance into account. Reads and writes are done in parallel, as such only presenting one of the numbers (in a publication or to your boss or whoever), does not make sense (malicious people may think otherwise).
LAME
If you compare FreeBSD against another OS using LAME as one of the benchmarks: most probably you are not comparing the systems, you are comparing the compilers (as with every userland-compute-bound application).
High precision benchmarking
Most of the text on which this page is based was originally posted here:
http://lists.freebsd.org/pipermail/freebsd-current/2004-January/019600.html
Note that this advice is mostly concerned with high-precision measurement of CPU-intensive tasks and may introduce needless complications for simpler benchmarks. For reference, PHK has used the procedures outlined below for fine-tuning his work on code that keeps track of machine time, thus the references to quartz crystals and temperature drift.
Also note that an extended version of these hints can be found in the FreeBSD Developers Handbook.
PHK> A number of people have started to benchmark things seriously now, and run into the usual problem of noisy data preventing any conclusions. Rather than repeat myself many times, I decided to send this email.
I experimented with micro-benchmarking some years back, here are some bullet points with a lot of the stuff I found out. You will not be able to use them all every single time, but the more you use, the better your ability to test small differences will be.
Disable APM and any other kind of clock fiddling (powerd, ACPI ?). Run in single user mode. cron(8) and and other daemons only add noise.
If syslog events are generated but unwanted, run syslogd with an empty syslogd.conf, otherwise, do not run it. - If applicable (i.e. you are not benchmarking file systems or disk IO), minimize the disk-I/O, avoid it entirely if you can.
- Don't mount filesystems you do not need.
Mount / and /usr and any other filesystem possible as read-only. This removes atime updates to disk (etc.) from your I/O picture. You should mount all RW file systems with the "noatime" option if aiming for the highest performance.
If benchmarking file systems or disk IO, newfs your R/W test filesystem and populate it from a tar or dump file before every run. Unmount and mount it before starting the test. This results in a consistent filesystem layout. For a worldstone test this would apply to /usr/obj (just newfs and mount). If you want 100% reproducibility, populate your filesystem from a dd(1) file (i.e.: dd if=myimage of=/dev/ad0s1h bs=1m).
Put each filesystem on its own disk. This minimizes jitter from head-seek optimizations. If comparing different file systems on classical (mechanical) disk drives, it is *crucial* to benchmark them on the smallest partition size possible, and always on the same partition, as there are huge performance differences when accessing different areas on a mechanical drive. E.g. if the machine has 16 GB of RAM, perform the benchmark on a 32 GB - 40 GB partition. - Use malloc backed or preloaded MD(4) partitions to benchmark raw file system performance.
- Reboot between individual iterations of your test; this gives a more consistent state.
- Remove all non-essential device drivers from the kernel. For instance, if you don't need USB for the test, don't put USB in the kernel. Drivers which attach often have timeouts ticking away.
Unconfigure hardware you don't use. Detach disk with atacontrol and camcontrol if you do not use them for the test.
- Do not configure the network unless you are testing it (or after your test to ship the results off to another computer).
- Do not run NTPD.
Minimize output to serial or VGA consoles. Running output into files gives less jitter. (Serial consoles can easily become a bottleneck). Do not touch the keyboard while test is running, even <space><back-space> shows up in your numbers.
Make sure your test is long enough, but not too long. If you test is too short, timestamping is a problem. If it is too long, temperature changes and drift will affect the frequency of the quartz crystals in your computer. Rule of thumb: more than a minute, less than an hour. - Try to keep the temperature as stable as possible around the machine. This affects both quartz crystals and disk drive algorithms. If you really want to get nasty, consider stabilized clock injection. (get a OCXO + PLL, inject output into clock circuits instead of motherboard xtal. Send me an email).
Run at least 3 but better is >20 for both "before" and "after" code. Try to interleave if possible (i.e: do not run 20xbefore then 20xafter); this makes it possible to spot environmental effects. Do not interleave 1:1, but 3:3; this makes it possible to spot interaction effects.
My preferred pattern: bababa{bbbaaa}*
This gives a hint after the first 1+1 runs (so you can stop it if it goes entirely the wrong way), a stddev after the first 3+3 (gives a good indication if it is going to be worth a long run) and trending and interaction numbers later on.- Make sure that comparable systems either all have RAM-based tmpfs (on /tmp) or none of them have.
Enjoy, and please share any other tricks you might develop!