PmcTools/PmcHardwareHowTo

This wiki page describes briefly the process of adding support for a new class of PMC to FreeBSD's hwpmc(4)/libpmc(3). It is meant to be read in conjunction with the code in sys/dev/hwpmc/*.

Contents

Contents
Quick Q&A
Adding support for a new kind of PMC
1. Userland Changes
2. The Kernel Driver
  1. PMC allocation
  2. Sampling

Quick Q&A

What are hwpmc(4), pmc(3) and pmcstat(8)?
hwpmc(4) and pmc(3) together make up a platform created to research tools that would use in-cpu performance monitoring counters present in modern hardware. A simple command line tool, pmcstat(8), was created as a proof of concept application for the platform.
Recently, Harald Servat ported the PAPI toolkit to the platform thereby enabling a large number of existing performance monitoring applications to run on FreeBSD.
What are the design goals of this platform?
These are:
1. Safety, security, correctness and reliability
  PMCs should be safe to be used by untrusted userland processes; the hwpmc(4) driver should enforce safety when accessing processor hardware resources. The framework should conform to the existing security architecture of the OS (i.e., no security holes).
2. Concurrent use of PMCs by multiple processes
  The framework should allow multiple processes to use PMCs in as transparent a manner as possible, multiplexing hardware as needed.
3. Ease of use
  Specification of PMC events should be easy and intuitive to the user. libpmc(3) allows users to specify measurement events using names that are close to that in vendor documentation. Additional event modifiers may be specified using a comma separated "name=value" syntax.
  PMC architectures vary widely. Specifying PMC events and modifiers using ASCII strings is a flexible way to accomodate a wide range of PMC architectures.

Adding support for a new kind of PMC

Adding support for a new PMC class involves the following steps:

Read the hwpmc(4) and pmc(3) manual pages to understand how everything works and to get an idea of the kind of documentation expected of a new PMC.
Read vendor documentation for the list of PMC measurements supported, and to understand the constraints imposed by hardware.
Teach userland to parse event names and event modifiers specified in vendor documentation and pass them to hwpmc(4).
Incorporate a new MD layer into hwpmc(4) if needed.

The recommended sequence is to start with the userland first and to tackle the kernel driver after that.

Userland Changes

[ See "lib/libpmc/libpmc.c", "sys/sys/pmc.h" and "sys/dev/hwpmc/pmc_events.h". ]

You would need to add new values to the 'enum pmc_cputype' and 'enum pmc_class' enumerations for your new PMC (see <sys/pmc.h>).

Users specify PMC measurement events and modifiers using ASCII strings. libpmc needs to convert this string representation to a 'PMC_OP_PMCALLOCATE' request to hwpmc(4).

PMC events are described using the __PMC_EV() macro in "sys/dev/hwpmc/pmc_events.h". There should be one such description for each measurement event listed in vendor documentation--most PMC implementations can measure a moderate to large number of measurement events (in the low 100s).

You should use event names that are "close" to the names in vendor documentation.

PMC measurements usually support additional modifiers, some modifiers being common to all possible measurements, and some specific to the measurement event in question. You will need to add a parser for these modifiers to libpmc and ensure that your parser rejects any modifier combinations that the vendor's documentation states is illegal.

Here too, choose names for modifiers that are close to vendor names.

You would then need to document each possible measurement event and its associated set of legal modifiers in the pmc(3) manual page. Keep the list in pmc(3) sorted alphabetically.

The PMC-independent parts of the allocate request are described by a 'struct pmc_op_pmcallocate' (see <sys/pmc.h>). The PMC-dependent parts of the allocation request are encoded in binary form in 'union pmc_md_op_pmcallocate' (see <machine/pmc_mdep.h>). You should choose your PMC-dependent representation in such a way so as to ease later use by hwpmc(4).

The Kernel Driver

[ See "sys/dev/hwpmc/*.[ch]" and 'struct pmc_mdep' in <sys/pmc.h> ]

Major chunks of a new driver include:

initialization & teardown	`(*pmd_{init,cleanup})()`
dealing with PMC allocation	`(*pmd_allocate_pmc)()`
configuring a PMC onto hardware	`(*pmd_config_pmc)()`
starting & stopping a PMC	`(*pmd_{start,stop}_pmc)()`
reading and writing a PMC's value	`(*pmd_{read,write}_pmc)()`
handling a PMC interrupt	`(*pmd_intr)().`
special processing for context switches	`(*pmd_switch_{in,out})()`

These functions deal with hardware, leaving the higher level 'generic' layers to deal with the 'abstract' PMCs that are exposed to applications.

Of these PMC allocation and interrupt handling are described below. The others are relatively straightforward and can be understood by browsing existing code in "sys/dev/hwpmc/*.c".

PMC allocation

PMC allocation requests are passed to the PMC dependent layer for vetting (see the *pmd_allocate_pmc() function pointer in 'struct pmc_mdep'). The generic code in "hwpmc_mod.c" calls into the PMC dependent layer once for each PMC resource present in the CPU, passing in the parameters for the allocation request. The PMC dependent function should return '0' if allocation is ok for the specified hardware resource.

The PMC dependent allocation function needs to ensure that the allocation request is legal for given hardware resource and also 'safe' in the sense of not violating any constraints documented by the vendor. Examples of such checks include:

If the desired PMC event is supported by the current CPU model and revision, per the manufacturer's documentation.
If the desired hardware event can be measured by the named hardware resource. On some processors, certain hardware events can only be measured on specific hardware counters.
If the set of additional PMC dependent bits specified by the PMC
dependent parts of the 'struct pmc_op_pmcallocate' request (i.e., the binary representation of the "modifiers") are legal for the specified hardware resource and for the event being measured on it.
That all architectural constraints are satisfied (e.g., the special programming needs for multi-core CPUs are taken care of).

If all is well, this function can precompute useful information and store this inside the 'struct pmc' associated with the allocation request.

Sampling

The PMC dependent function (*pmd_intr() in 'struct pmc_mdep') is called when the PMC interrupts the CPU. It needs to check which PMC on the current CPU caused the interrupt; it then needs to invoke the 'generic' processing routine 'pmc_process_interrupt()'. If the generic processing routine indicates that the sample/callchain was successfully recorded, the PMC dependent function needs to re-arm the interrupting PMC, otherwise it should leave the interrupting PMC unarmed (the higher level layer will re-arm the PMC when the congestion clears).

On the x86 (i386, amd64) architecture PMCs interrupt using an NMI. This means that on these architectures your handler could be invoked at any time, including when the CPU is in the middle of a critical section. The handler routine therefore cannot rely on the kernel's state being consistent. It also cannot use any of the kernel's normal synchronization primitives. Thus this routine needs to be carefully coded; usually you can rely on the processor's memory accesses being strongly ordered with respect to itself, but check the processor's documentation for the other cases.

Your code should be safe in the following scenarios:

When multiple PMCs on a CPU interrupt in close succession (i.e., handling of nested/back-to-back NMIs).
When different CPUs on the system are concurrently executing your (NMI) handler.
When the CPU is already in the PMC code (say it was servicing a 'pmc_write()' on the current CPU) when the sampling interrupt was taken.

Another point to keep in mind is that reading CPU MSRs (machine state registers) is usually expensive as most architectures define an MSR read or write as a "synchronizing" instruction. So when searching for interrupting PMCs you should first eliminate other possibilities before looking at PMC status registers.