Peformance analysis of the caching infrastructure (caching daemon + libc client part) took me quite a long time. Actually, I've made 2 sequences of tests.
First testing sequence
The whole idea of this sequence relies on using (executing 100 times) this shell code (similar code was also written for "services" database):
counter = 0 while [ $counter -lt $2 ] do cat $1 | cut -d ':' -f 1 | xargs -L 1 getent passwd cat $1 | cut -d ':' -f 3 | xargs -L 1 getent passwd counter=$(( $counter + 1 )) done
$2 - is the iterations number
$1 is the snapshot file name. The snapshot file is generated by the regression tests (NsswitchRegressionDetails). In "passwd" case each line of snapshot file is somewhat similar to /etc/passwd:
root:*:0:0:0::Charlie &:/root:/bin/csh:0:5087 toor:*:0:0:0::Bourne-again Superuser:/root::0:4831
So the idea of testing code is to run multiple get**byname() and get**byid() queries using snapshot file and calculate its execution time using time utility. "-L 1" argument was used intentionally, because the situation where particular process executes 1-2 (i.e. - a few) queries for nsdispatch() is somewhat pretty typical.
During the performance analysis 2 nsswitch.conf files were used. One with caching turned off:
passwd: files services: files
And one with caching turned on:
passwd: cache files services: cache files
This cached.conf file was used for testing:
enable-cache passwd yes enable-cache services yes keep-hot-count services 16000 suggested-size services 1889
(Services options tuning makes cached more suitable for working with very large amount of services data - like it is in IANA list).
All tests ran on FreeBSD 6.1-RELEASE #0 with getent tweaked to use libc.so.7 and separated nsswitch modules.
Test 1 - "passwd" database
In each iteration:
- lines 1-3 - results without caching
- lines 4-6 - caching was enabled and cached was working in single-threaded mode
- lines 7-9 - caching was enabled and cached was working with 8 threads
Each line is 100*23 getpwnam() queries + 100*23 getgrnam() queris + 2 full getent() sequences (23 queries each).
iteration 1: 2,14 real 0,71 user 1,43 sys 2,12 real 0,91 user 1,22 sys 2,12 real 1,03 user 1,12 sys 2,23 real 1,07 user 1,22 sys 2,27 real 1,04 user 1,26 sys 2,21 real 0,90 user 1,39 sys 2,25 real 1,08 user 1,21 sys 2,21 real 0,84 user 1,43 sys 2,26 real 0,92 user 1,38 sys iteration 2: 2,10 real 0,59 user 1,56 sys 2,08 real 1,00 user 1,14 sys 2,14 real 0,64 user 1,51 sys 2,22 real 0,77 user 1,50 sys 2,26 real 0,74 user 1,56 sys 2,23 real 1,08 user 1,21 sys 2,23 real 0,71 user 1,58 sys 2,26 real 0,44 user 1,86 sys 2,23 real 1,24 user 1,04 sys ...
These results were almost predictable. Cached results are only a little (5%) worse than "files" and threading adds a little overload when cached is not heavy-loaded, comparing to the single-threaded version.
Test 2 - "passwd" database
8192 read sequences of getent() functions. Each sequence is of 23 queries.
Several iterations. line 1 - results without caching. line 2 - results with caching.
14.522u 23.262s 0:37.99 99.4% 2+25k 0+0io 0pf+0w 16.781u 25.695s 0:42.96 98.8% 2+23k 0+0io 0pf+0w 14.567u 23.489s 0:38.29 99.3% 3+24k 0+0io 0pf+0w 17.953u 24.502s 0:42.86 99.0% 2+23k 0+0io 0pf+0w 14.256u 23.646s 0:38.32 98.8% 3+25k 0+0io 0pf+0w 17.089u 25.412s 0:42.90 99.0% 2+23k 0+0io 0pf+0w 14.210u 23.706s 0:38.36 98.8% 2+26k 0+0io 0pf+0w 15.135u 27.395s 0:43.23 98.3% 2+23k 0+0io 0pf+0w 12.895u 24.879s 0:38.00 99.3% 3+26k 0+0io 0pf+0w 16.398u 26.181s 0:43.28 98.3% 2+24k 0+0io 0pf+0w 13.115u 24.663s 0:38.08 99.1% 3+24k 0+0io 0pf+0w 18.333u 24.254s 0:43.05 98.9% 2+24k 0+0io 0pf+0w
These results also are quite good and almost predictable. Of course, traversing through BDB database is more faster, than communicating with cached through it's socket and marshalling/unmarshalling sent/received data.
Test 3 - "services" database
2*1977 getservbyname() queries.
Several iterations. Line 1 - results without caching. Line 2 - results with caching.
25,24 real 6,84 user 17,28 sys 24,68 real 6,80 user 16,35 sys 25,16 real 6,80 user 17,32 sys 24,50 real 6,75 user 16,16 sys 25,23 real 6,85 user 17,29 sys 24,62 real 6,76 user 16,18 sys
And this results were a bit disappointing - the cached-influenced performance gain could be a little higher. Well, no worries - explanation is further up on the road.
Actually, the testing technique, that was used for doing the above series of performance tests apparently was not precise enough. Besides, I really expected larger cached's drawback in "passwd" case and larger performance boost in "services" case.
So, the 2nd series of tests used different approach. The getent utility was tweaked by adding getrusage() calls, as in the fragment below:
getrusage(RUSAGE_SELF, &r1); if (parsenum(argv[i], &id)) pw = getpwuid((uid_t)id); else pw = getpwnam(argv[i]); getrusage(RUSAGE_SELF, &r2);
This script was used to execute getent:
counter=0 while [ $counter -lt $1 ] do tail -n 10 snapshot_pwd | cut -d ':' -f 1 | xargs -L 2 getent passwd >> res_pwd tail -n 100 snapshot_serv | cut -d ' ' -f 1 | xargs -L 2 getent services >> res_serv head -n 10 snapshot_pwd | cut -d ':' -f 1 | xargs -L 2 getent passwd >> res_pwd head -n 100 snapshot_serv | cut -d ' ' -f 1 | xargs -L 2 getent services >> res_serv counter=$(( $counter + 1 )) done
The probably most important thing in this script is "-L 2" xargs argument as it makes getent() executing 2 subsequent calls. So we have 2 queris per process, not 1 as in the first series of test. The first query usually makes a lot of nss-related overhead, because it calls nsdispatch(3) for the first time. It means reading configuration file, searching and opening libraries, etc. Second query doesn't have such overhead and shows the difference in performance between cached and files sources much better.
The getent() outputs were collected in the res_pwd and res_serv files, which have been analyzed by the simple python script, which calculated basic statistical values separately for odd and even lines. Odd lines relate to the queries that were made first by all getent() processes. And even lines relate to the second queries made by such processes.
The above script were ran 100 times for each system setting. The analysis of the results can be seen below:
Note: time is in microseconds. Caching turned off. "passwd": first time queries total time: 3603763.00 average time: 3603.76 median time: 3551.00 standard deviation: 278.66 minimal time: 3200.00 maximum time 5782.000 second time queries total time: 44880.00 average time: 44.88 median time: 47.00 standard deviation: 13.39 minimal time: 27.00 maximum time 157.000 "services": first time queries total time: 40777567.00 average time: 4077.76 median time: 5163.00 standard deviation: 662.35 minimal time: 3100.00 maximum time 22805.000 second time queries total time: 5529766.00 average time: 552.98 median time: 1069.00 standard deviation: 492.91 minimal time: 30.00 maximum time 1209.000 Caching turned on (caching daemon is in single threaded mode). "passwd": first time queries total time: 3533011.00 average time: 3533.01 median time: 3520.00 standard deviation: 131.42 minimal time: 3317.00 maximum time 5405.000 second time queries total time: 102717.00 average time: 102.72 median time: 100.00 standard deviation: 21.58 minimal time: 71.00 maximum time 197.000 "services": first time queries total time: 35081940.00 average time: 3508.19 median time: 3586.00 standard deviation: 136.52 minimal time: 3228.00 maximum time 10474.000 second time queries total time: 1010379.00 average time: 101.04 median time: 169.00 standard deviation: 22.31 minimal time: 71.00 maximum time 214.000 Caching turned on (caching daemon is in multithreaded mode). "passwd": first time queries total time: 3565034.00 average time: 3565.03 median time: 3493.00 standard deviation: 122.95 minimal time: 3361.00 maximum time 4358.000 second time queries total time: 124147.00 average time: 124.15 median time: 150.00 standard deviation: 27.58 minimal time: 78.00 maximum time 232.000 "services": first time queries total time: 35509606.00 average time: 3550.96 median time: 3466.00 standard deviation: 207.83 minimal time: 3298.00 maximum time 19709.000 second time queries total time: 1213242.00 average time: 121.32 median time: 137.00 standard deviation: 28.25 minimal time: 80.00 maximum time 257.000 Caching turned on (caching daemon is in single threaded mode + cache is flushed after each query ("each time like the first time" mode :)). "passwd": first time queries total time: 3681771.00 average time: 3681.77 median time: 3608.00 standard deviation: 96.18 minimal time: 3370.00 maximum time 4169.000 second time queries total time: 205721.00 average time: 205.72 median time: 167.00 standard deviation: 27.16 minimal time: 79.00 maximum time 380.000 "services": first time queries total time: 38467155.00 average time: 3846.72 median time: 4658.00 standard deviation: 497.23 minimal time: 3264.00 maximum time 6305.000 second time queries total time: 5138883.00 average time: 513.89 median time: 1343.00 standard deviation: 540.94 minimal time: 69.00 maximum time 1479.000
NOTE: above results were calculated using getrusage() output. When using gettimeofday() output, the results change, but not significantly. gettimeofday() measurements will be posted here a bit later.
These results are far more precise than the previous ones. It is basically because of treating first and second query differently and because of leaving all kinds of nsswitch-unrelated system overhead aside. As far as I'm concerned, this testing method is quite suitable for nsswitch performance testing.
While showing excellent results with caching enabled for "services" database, this test shows the ugly truth - it's nearly impossible to beat BDB query time with cached query time. And, though, easily explainable, this fact makes using cached for local sources questionable (not impossible, though). BDB and cached's response times are comparable - and, as we are dealing with with such lengths as 100-200 microseconds, using cached instead of BDB won't affect the system performance much. And though, BDB is obviously the fastest solution, caching daemon can allow to perform checks on local files to update cache if they are changed (with all precautions of not flushing the old data if something is wrong with the updated file), it is more lightweight solution, that does not require BDB in tree. But, once again, if the speed is the main and only concern, than BDB is the choice.
What's been optimized
After analysing the performance measurements, I've found 3 cached's parts that could be possibly optimized:
- Socket IO
- Multithreading drawbacks
- Lack of precaching
Items 1 and 2 appeared to be closely connected, as the major drawback is caused by using one kqueue among several threads. The actual drawback is proportional to the number of threads. Simple and straight-forward, but probably the only one, solution is to have own kqueue for each thread. This eliminates the problem. However, this causes an issue - if the thread hangs (which is possible when "perform-actual-lookups" mode is turned on), then all the requests, that it had started to receive would be abandoned. However, this issue is very seldom, so the #ifdef-ed solution was committed to perforce.
Socket IO optimizations appeared very hard to be done without major changes of the cached's architecture. And, most of the socket IO - related calls have normal execution time, which however is much longer than BDB-related calls time. Because of these 2 reasons, I didn't change anything significant in socket IO part.
Precaching can be very useful for such databases as "services" when "perform-actual-lookups" method is on. If we precache data on startup, all queries to the cached would be read-request-search-read-response queries. And this type of queries is the fastest one in cached. It has no overhead of writing to cache, or of performing the nsdispatch() lookup. So the "precache [cachename] [yes|no]" feature was implemented and submitted.
Futher tasks that can definitely help in estimation of cached and nsswitch performance are:
- Collect a statistics of how often different nsswitch functions are called on a typical standard-load system.
- Make performance measurements more precise, use more iterations.
- Include the internals of the nsdispatch (such as nss_common_cached_read()) in the performance measurements and estimate their performance too. This had actually been done while analysing the cached performance, but the code needs to be much cleaner.