This describes the mechanism used on a variety of platforms (ARM, MIPS, PowerPC) for handling "complicated" interrupt topologies involving multiple interrupt controllers attached at random throughout the device tree. This is targeted in particular at systems using either flattened device trees or CHRP-style Open Firmware (from which FDT is derived), but is generally applicable. Linux uses a similar mechanism on at least PowerPC.
Overview of mechanism
The core part of this system is a registry in machine-dependent code that maps some description of an interrupt to an IRQ number used by the rest of the kernel. This number is arbitrary; on systems in which a useful human-readable number can be extracted in a general way from the description, it is helpful for users for the IRQ number to be related to something about the system (e.g. the interrupt pin on single-controller systems) but it can be just a monotonically increasing integer.
Currently one interrupt mapping strategy is implemented: Open Firmware (or FDT) interrupt-parent / interrupt specifier tuples to IRQ. Bus code maps the Open Firmware interrupt specifier using the ofw_bus_map_intr() function, which is cascaded through the bus hierarchy and is usually resolved by nexus.
int ofw_bus_map_intr(device_t dev, phandle_t iparent, int icells, pcell_t *intr);
This takes the requesting device, the xref phandle of the interrupt parent (e.g. from the "interrupt-parent" property, or the equivalent entry in an interrupt-map) and the byte string describing the interrupt (e.g. the contents of the "interrupts" property, or the equivalent entry in an interrupt-map) and returns a unique IRQ number that can be added to a resource list and used with bus_alloc_resource(), bus_setup_intr(), etc.
Under the hood, this calls into machine-dependent code. The most general implementations (e.g. on PowerPC) have a mapping table that records the assigned IRQ and the interrupt specifier data. As PICs attach themselves during newbus discovery, they register their interest in a particular iparent value with the machine-dependent interrupt layer and eventually -- potentially at attach time but at latest by configure_final() -- are passed the section of the table corresponding to the registered iparent value and asked to configure them as specified by the PIC-dependent byte string and send interrupts corresponding to that descriptor to the IRQ assigned.
To be used with other enumeration mechanisms (e.g. ACPI), you could either abuse the function above or add another function (e.g. acpi_bus_map_intr) that takes the appropriate information and conveys it to the machine-dependent code appropriately (e.g. it would take a struct acpi_resource_extended_irq instead of an interrupt parent code and specifier).
This system is designed to support, without conflicts, systems with multiple interrupt controllers that have overlapping ranges of interrupt pins and that attach in an order unrelated to the need for the interrupts in the bus hierarchy. Because the interrupt hierarchy is unrelated to the bus hierarchy -- it can involve lateral traverses , can flow the wrong way on branches (i.e. bus parents can depend on interrupts provided by children), and devices can have multiple interrupt parents -- this cannot be done purely with bus methods.
There are three requirements here for any system that handles the complexity of interrupt routing allowed by CHRP/PAPR and the FDT spec:
- Map any N-byte (usually 4-12 for device trees) interrupt specifier + interrupt parent combination onto some globally unique 32-bit integer. The uniqueness and 32-bit requirement comes from a number of places in the kernel: the PCI MSI code and rman in particular, though there are other places.
- Allow bus_alloc_resource() and bus_setup_intr() to succeed in early boot even if the interrupt parent is not yet attached. This is required to break what would otherwise be dependency loops involving devices with interrupts handled by their bus children.
- Make the interrupt controller driver aware of the full N-byte string on attach, as it can encode flags (e.g. trigger mode and level) needed for interrupt setup that are encoded in a way specific to the model of interrupt controller.
To make this concrete, here are some examples of the kinds of things that drove the current API.
Case 1: the G5 Powermac
Apple PowerMac G5s have two PICs, one cascaded from the other. PIC 1 lives in the northbridge, and PIC 2 lives on a device on the PCI bus behind a couple of PCI<->PCI and HyperTransport bridges. Depending on the era of the hardware, these are cascaded in different directions.
This presents three puzzles:
- How do I represent interrupts on the PCI bus parent of PIC 2 that are handled by PIC 2? PIC 2 obviously can't attach before its bus parent, but the bus parent can't complete initialization without the ability to setup its interrupts.
- Devices on the PCI bus have interrupts handled by a mixture of PIC 1 and PIC 2, sometimes on the same device and not always expressable through the bus hierarchy. For example, one of the two storage controllers has an interrupt on PIC 2 run through a wire that doesn't go through the PCI connection and so isn't in the interrupt-map of the PCI bus, which is wired (mostly) to PIC 1, and about which the parent bus can and should know nothing.
- Both PICs are OpenPIC devices. In a multipass-type framework, how do you ensure that the attach order of the two OpenPICs is correct? The two OpenPIC devices are distinguished only by one of them (which one depends on the model of the machine) having an interrupts property and the other not and are implemented by the same driver.
These are solved by allowing bus_setup_intr() etc. to succeed before the PIC(s) has(have) attached, which breaks what would otherwise be a circular dependency loop (1). This also fixes (3) by removing any requirement on attach order of the two PICs. Because devices are free to modify their own resource lists and can call ofw_bus_map_intr() themselves, the ATA driver is free to allocate its own interrupt to satisfy (2).
Case 2: IBM OPAL firmware
The /ibm,opal device on IBM PowerNV systems has a non-standard interrupts property ("opal-interrupts") that contains the list of IRQs that should be forwarded to the firmware. These are not interrupts belonging to a single physical device at /ibm,opal (which is a virtual device anyway) and so are not in the interrupts property; nor do they necessarily share an interrupt parent. How do I represent this?
This is solved by making the interrupt tree independent of the bus tree: since the parent doesn't know anything after assignment time, the child can call ofw_bus_map_intr() and add its special interrupts directly to its resource list using the appropriate newbus methods. After this, it behaves normally.
Case 3: IBM XICS interrupts
On virtualized (and most non-virtualized) IBM hardware, the interrupts are one cell and that single cell encodes the interrupt parent, the line sense, and the IRQ. It is not possible to disentangle in a generic way the interrupt parameters from pin number or handling IC since that depends on firmware behavior.
This is solved by treating the interrupt specifier as fully opaque: with no required inference about contents, an inability to infer contents doesn't matter. This would break systems that expect to be able to disentangle a "true" IRQ number and config flags.
Case 4: MSIs
MSIs are assigned purely by the PCI bus and the PCI bus parent can't know about them from the device tree. How does the bus parent sensibly decorate resources like this? The PCI MSI API assumes that these all exist purely as 32-bit integers and they are not assigned through resource lists in the conventional fashion.
These are solved by allowing the IRQ to be a single scalar number and making knowledge of the routing independent of the newbus hierarchy after assignment time. The PCI layer gets a number like it wants and handles it appropriately after assignment by the device-tree-aware PCI host bridge driver.
Case 5: PCI LSIs
The PCI LSI API (PCIB_ROUTE_INTERRUPT()) is when it is possible to match a unit and pin to the interrupt-map property of the PCI bridge, which gives an interrupt specifier and parent. The output of this routine is a single number (the IRQ). By mapping this IRQ to the specifier using ofw_bus_map_intr(), the PCI code can route the interrupt properly with no further modifications.
API Control Flow: The life and times of an interrupt descriptor
A bus driver walks the device tree, enumerating its children. As it does so, it reads the #interrupt-parent and interrupts properties of its children (or the interrupt-map or...) and collects the xref phandle of the interrupt parent and N-byte interrupt specifier. It then calls ofw_bus_map_intr(dev, iparent, ncells, interruptspec) (or equivalent for ACPI, etc.), which returns an IRQ number. This is typically implemented by nexus, which calls into some general MD code. The MD code does the following:
- Check if that (parent, specifier) pair has been mapped already.
- If so, return the previously allocated IRQ. Otherwise, allocate a fresh one, potentially based on the specifier in some way for ease of reading dmesg (the first word is usually an interrupt pin, so that's a friendly first guess)
- Adds that (IRQ, (parent, specifier)) mapping to an internal table.
- If the relevant PIC is attached already, tell the PIC driver to map the given specifier to the allocated IRQ number. Otherwise, wait until the PIC attaches and tell it to map any previously allocated IRQs when it does.
For interrupts added by other mechanisms (MSI allocation, interrupts added by children rather than their parents, etc.), the control flow is identical and ofw_bus_map_intr() is called by whatever code initially provides an IRQ number (e.g. PCIB_ALLOC_MSI()).
- bus_alloc_resource() does nothing special
- bus_setup_intr(), if the PIC has attached, will unmask and configure the requested interrupt. Otherwise, like mapping (above), the setup request will be queued until the PIC attaches and success returned immediately.
- bus_config_intr() will, if the PIC is attached, set the requested polarity and trigger mode. Otherwise, the request will be queued until the PIC attaches and success returned immediately.
One PIC is designated the "root PIC". On PowerPC, this is determined by the PIC driver itself and is the one PIC that attaches and itself has no interrupts on other PICs. When the MD interrupt code receives a CPU interrupt, it calls a method on the root PIC to dispatch any pending interrupts. The PIC driver examines the registers on the PIC and consults its internal mapping table established during enumeration (on some PICs, like OpenPIC, this is done in hardware) to obtain the corresponding allocated IRQ number. Then it signals an interrupt on that IRQ back to the MD interrupt code.
If several interrupt controllers are cascaded (i.e. the output pin from one controller is attached to an input pin on another), the lower-level PIC registers an interrupt filter handler on its interrupt on the higher-level one using normal newbus mechanisms. This filter runs in the same context as the dispatch routine on the root PIC (and can be, and usually is, the exact same code) and is responsible for doing the same things: check registers, look up IRQ, signal interrupt on IRQ.