Callchain Capture and Analysis with PmcTools

Introduction

With this patch, at the time of a PMC interrupt, hwpmc(4) will record a caller information from the stack in addition to the interrupted PC value.

In this wiki page we will take a brief look at the new information this feature can provide. We will also examine the shortcomings of the current set of heuristics used by tools.

How to capture callchains

pmcstat(8) will configure sampling PMCs to capture callchains by default, so you do not need to do anything special to turn this feature on.

The new -N option to pmcstat(8) turns off callchain capture for the times you prefer 'flat' profiles.

For example, to allocate system-wide sampling PMCs that count "instructions retired" on all CPUs of your system, use:

# pmcstat -S instructions -O sample.out

Analysis

The information captured in the log can be post-processed by pmcstat. Two kinds of summaries can be generated:

  1. A callchain summary that summarizes frequently sampled functions and their callers. Callchain summaries are requested using the new -G option.

  2. Traditional gprof(1) compatible gmon.out files can also be generated. gmon.out output is specified using the -g option.

An Example

The samples for this example were collected during a short interval (10 seconds) while the system was executing a make buildworld.

The Callchain Summary

The command pmcstat -R sample.out -G cc-summary-file generates a callchain summary.

# cat cc-summary-file
@ p4-instr-retired [95282 samples]

03.10%  [2951]     fnmatch1 @ /usr/bin/ld
 98.58%  [2909]      fnmatch
 01.42%  [42]        fnmatch1

The first line of the summary informs us that the log file had 95282 valid samples for PMC event "p4-instr-retired".

The most frequently sampled function was fnmatch1 in executable /usr/bin/ld, with 2951 samples out of the total. Function fnmatch, in the same executable object, was the most frequent caller of function fnmatch1.

The next most frequent samples fell in function Buf_AddByte in executable /home/obj/home/fcpi/src/make.i386/make:

02.30%  [2194]     Buf_AddByte @ /home/obj/home/fcpi/src/make.i386/make
 45.81%  [1005]      Parse_File
  17.01%  [171]        ReadMakefile
   87.72%  [150]         TryReadMakefile
   12.28%  [21]          main
 26.66%  [585]       Var_SubstOnly
  22.22%  [130]        For_Run
   100.0%  [130]         parse_for
 12.40%  [272]       parse_for
  99.26%  [270]        Parse_File
   100.0%  [270]         ReadMakefile
 04.79%  [105]       CondGetArg
  89.52%  [94]         CondToken
   100.0%  [94]          CondT
...output snipped for brevity...

The callchain summary is sorted with the most frequently sampled locations and their locations first.

Traditional gprof(1) style profiles

The -g option to pmcstat(8) turns on gmon.out generation:

# pmcstat -R sample.out -g

You would then run gprof(1) on the generated gmon.out files as before:

# gprof /usr/bin/ld p4-instr-retired/ld.gmon > ld.gprof
# gprof /boot/kernel/kernel p4-instr-retired/kernel.gmon > kernel.gprof
# gprof /lib/libc.so.7 p4-instr-retired/libc.so.7.gmon > libc.gprof

Lets take a look at what the kernel was doing:

# cat kernel.gprof
...header snipped...
                                  called/total       parents 
index  %time    self descendents  called+self    name           index
                                  called/total       children

                                                     <spontaneous>
[1]     24.7   64.00     2696.05                 syscall [1]
                1.00     1476.19      37/37          stat [7]
                1.00      655.74       1/1           open [18]
                0.00      204.05      72/72          execve [39]
                2.00      129.17       2/2           read [57]
                0.00       61.12       4/4           lstat [102]
                0.00       32.95       1/1           mkdir [155]
                0.00       27.99       2/2           unlink [165]
                1.00       20.70       2/2           write [193]
                0.00       16.52       1/1           fstatfs [231]
                1.00       14.20       2/2           munmap [238]
                1.00       12.31       1/1           sigaction [248]
               12.60        0.00       7/10          generic_copyout [215]
                2.00        8.00       1/1           fstat [269]
                7.64        0.88       2/11          userret [126]
                4.86        0.00       1/7           sched_userret [149]
                0.00        2.13       1/1           getdirentries [422]

As to be expected, gprof(1) has propagated most of the samples to function syscall.

                                                     <spontaneous>
[2]     21.0 2348.00        0.00                 sse2_pagezero [2]

2348 samples fell into sse2_pagezero. However, hwpmc(4) was not able to trace the call stack further.

                0.60       25.96       1/87          kern_rmdir [166]
                0.60       25.96       1/87          kern_execve [60]
                0.60       25.96       1/87          kern_unlink [164]
                0.60       25.96       1/87          kern_lstat [159]
                0.60       25.96       1/87          lstat [102]
                0.60       25.96       1/87          vn_open [21]
                0.60       25.96       1/87          kern_chown [171]
                0.60       25.96       1/87          elf32_load_file [170]
                0.60       25.96       1/87          kern_mkdir [154]
                1.20       51.92       2/87          kern_readlink [118]
                1.20       51.92       2/87          kern_access [108]
               12.55      545.17      21/87          vn_open_cred [22]
               31.68     1375.90      53/87          kern_stat [8]
[3]     20.7   52.00     2258.56      87         namei [3]
              282.00     1747.26     132/132         lookup [4]
                2.12       69.18       3/41          VOP_LOOKUP_APV [10]
                3.53       33.70       7/131         vput [16]
                7.43       25.85       4/7           vrele [110]
               27.33        5.28      18/27          uma_zfree_arg [123]
               19.11        5.46       8/18          uma_zalloc_arg [115]
                2.60       13.92       1/52          _vn_lock [12]
                0.00        9.50       1/1           VOP_READLINK_APV [272]
                0.29        1.31       1/17          vfs_busy [168]
                1.50        0.00       1/2           compute_cn_lkflags [375]
                1.18        0.00       1/39          critical_enter [127]


Here's what ld was doing:

# cat ld.gprof
...header snipped...
                                      18             map_input_to_output_sections [1]
[1]     86.9    1.00     6079.64       0+18      map_input_to_output_sections [1]
                0.00     6079.64      67/67          walk_wild [2]
                                      18             map_input_to_output_sections [1]

-----------------------------------------------

                0.00     6079.64      67/67          map_input_to_output_sections [1]
[2]     86.9    0.00     6079.64      67         walk_wild [2]
               12.00     6012.36    1742/1742        walk_wild_file [3]
               16.88       31.33      17/1840        fnmatch [5]
                0.33        3.49       1/21          bfd_check_format [20]
                0.08        3.17       1/1818        walk_wild_section [4]

-----------------------------------------------

               12.00     6012.36    1742/1742        walk_wild [2]
[3]     86.1   12.00     6012.36    1742         walk_wild_file [3]
              138.92     5765.95    1817/1818        walk_wild_section [4]
                6.67       69.78      20/21          bfd_check_format [20]
               22.18        0.00       4/11          strcmp [25]
                5.00        0.00       5/610         wildcardp [7]
                0.95        2.91       1/20          bfd_check_format_matches [21]
...output snipped...


And libc.so.7:

# cat libc.so.7.gmon
...header snipped...
                                                     <spontaneous>
[1]     95.0    1.00     3167.77                 regcomp [1]
              382.00     2785.77       1/1           regexec [2]

-----------------------------------------------

                                       1             regexec [2]
              382.00     2785.77       1/1           regcomp [1]
[2]     95.0  382.00     2785.77       1+1       regexec [2]
             2643.00       25.37    1618/1618        fgetwc [3]
                4.00       49.24       9/9           malloc [5]
               32.00        0.00       1/1           malloc_usable_size [8]
               17.33        9.54       7/21          free [4]
                5.29        0.00       5/35          __i686.get_pc_thunk.bx [6]
                                       1             regexec [2]

Caveats

hwpmc(4) currently uses a simple heuristic to determine the initial stack frame given an interrupted PC value. This heuristic doesn't always work correctly in the presence of aggressive instruction scheduling that is the default with the current GCC compiler. Thus a small percentage of samples could appear to be missing call frames. For example, a call chain foo->bar->buzz may be collected as foo->buzz, if the interrupted PC value fell in function buzz when it was just about to return to bar.

This has the following undesirable side-effects:

PmcTools/CallchainCaptureAndAnalysis (last edited 2008-06-17T21:38:01+0000 by anonymous)