Power/Performance Management Session

Friday 11th July 2014.
Chair: RobinRandhawa from ARM

Introduction

Chair's impression: FreeBSD doesn't currently have a power-performance focused set of subsystems the way Linux does. Would be good to know what exists. Perhaps a wiki page tracking all existing support would be the way to go.
Having grepped the FreeBSD kernel _very_ briefly, the impression is that the x86 kernel port has some support for ACPI based power management (DVFS) but probably focused on some specific hardware implementations.
Chair's opinion: In this session, we'll get an overview of what support Linux has for power-perf. The hope is that FreeBSD can have a 'clean slate' approach to try and learn from Linux's mistakes/experience.
The discussion here had contributions from all present but used Robin's description as a central theme.

The situation with Linux

Power-perf focus has traditionally been more towards performance. Understandable - the original idea was to run fast on uniprocessor, then introduce SMP support, then scalability improvements, multi-socket support, NUMA, HPC etc. So power was always less of a concern. Things evolved first and foremost for the CPU. System aspects weren't considered until much later.
Consequently, the emphasis on energy awareness has been secondary. The furthest things went were to support ACPI power management for CPU specific DVFS (dynamic voltage and frequency management) as well as CPU specific Idle state management. This was considered adequate in the non-embedded context.
Over time, especially with the explosion of embedded support and targets, the situation began to change. DVFS and Idle management emerged for embedded devices in a non-ACPI context. Eventually, non CPU specific aspects were also incrementally addressed.
Unfortunately, power-perf management has evolved in silos and although each silo improves things in the general case over and above the status quo, the silos don't talk to each other well (or at all) and this results in lost potential for energy savings and performance. There isn't a truly holistic system-wide view. There is no consideration given to system aspects that can impact even what is currently addressed (such as the role of various interconnects in the system (memory, bus interconnects etc)). Even on the CPU side, various key micro-architectural scenarios are ignored (role of the cache hierarchy - cache footprinting, inter-task data dependency detection and management etc). The counter-argument is that a holistic solution that addresses even some of these problem ares can quickly become unmanageable in complexity.
Ultimately, what is needed is a modular strategy that aims first and foremost for a general system wide view but allows special-casing/tuning at a per subsystem/feature specific granular level. That way you have a 'big picture' solution possibility and yet have ways to localise things as needed. Ultimately something that has the potential to either support energy efficient operation (with minimal performance impact) or maximum performance with no intended power saving, just outright performance. Complexity is a tradeoff. Ideally you want something that has few tunables.

Prominent Linux power-perf features/subsystems

CPU centric

cpufreq

Primary DVFS strategy focused entirely on the CPU. The idea is to mitigate incumbent CPU load by modulating the CPU instruction issue rate. The load is estimated using coarse CPU busy-ness stats [%age time the CPU was active in a given time quantum]. The issue rate is modified using platform specific mechanisms that aim to change the CPU clock frequency. The more important aspect here is that the voltage is modulated as well and that results in appreciable potential power savings. Platform designers select a range of voltage and frequency tuples, known as OPPs (Operating Performance Points) after considering the system's capabilities.
The cpufreq architecture has a platform specific back-end driver that abstracts the platform specific mechanisms used to change the frequency (and the voltage). There is a generic 'front-end' that has configurable policy aspects.
Cpufreq has several policy 'governors' that govern how load is translated into choice of OPP. These governors can be switched at run-time. Some governors aim to track load as closely as possible. Some always aim to select higher OPPs on average aiming for performance. Some are conservative. Some try and mix-in user interaction to guide OPP selection for improved interactive response.
Cpufreq has a topology input requirement. This is because CPUs may be 'coupled' in their DVFS architecture in implementation specific ways. For example in some implementations, each CPU in a cluster of CPUs could traverse a range of OPPs independently of it's siblings. In other implementations, all CPUs in a cluster can only traverse OPPs together. Obviously, this needs to be known to the DVFS management subsystem so that it can account for the load variation on CPUs accordingly and request OPP transitions accordingly.
Problems
- Load calculation on CPUs doesn't take into account the nature of the work the CPU is doing. All that is usually taken into account is how busy the CPU was (aka how much was it's runqueue non-empty (slightly simplified)). In a myriad set of cases, this can result in wasteful operation. For example, if the CPUs are running a memory bound workload (like a large memset/memcpy operation), cpufreq assumes the load on the CPU is high and ends up increasing the frequency resulting in operation in the higher OPPs. This is wasteful since in such a scenario the CPU is likely going to stall on the memory interface and not going to be able to make forward progress even though the instruction issue rate is set higher by cpufreq.
- Other non-CPU root causes could also influence task residency on CPUs (and consequently load) incorrectly. For example, data dependent tasks running on separate packages/cluster. There is always a pretty strong implementation specific factor in such cases which needs things to be considered appropriately.
- Load average based calculations may not always result in the most appropriate OPP selection for tasks. Some governors try and play safe by selecting higher OPPs by default before falling back to a more appropriate OPP for the CPU load. This can waste power. Given that reasonably accurate per-task load stats are available, it would be worth evaluating whether per-task load history can be used to select OPPs for tasks getting scheduled in. There is a tradeoff between schedule intervals and the time taken to physically change OPPs.

cpuidle

Modern CPUs usually have idle states that are ordered by decreasing power cost and/or by increasing entry and exit latency. So shallow states are quickly entered and exited by save less power as compared to deeper states (which take more time to enter and exit). Examples of typical states in an ARM context would be clock gate on WFI, CPU power gate, cluster power gate, L2 retention.
cpuidle translates idle opportunities (when the CPUs have nothing to run) into implementation specific CPU idle state selection. The aim is to enter the deepest state while minimising performance penalties. This is done by choosing the state by considering the entry and exit latencies into the idling opportunity. The idling opportunity is deduced by considering synchronous views of CPU wakeups instances (queued timers) and predicting the likelihood of asynchronous wakeups (using various techniques involving keeping track of past wakeups from asynchronous sources etc).
cpuidle has a platform/CPU specific back-end driver and a generic front-end that has policy governors.
cpuidle has topological input requirements too. CPUs within a cluster tend to be coupled in that certain system specific deeper sleep states cannot be attained unless all coupled CPUs co-ordinate their entry into this state. Usually, this coupling exists at the level 2 cache boundary with CPUs sharing an L2 being coupled. Each coupled CPU needs to synchronise it's L1 caches with memory while the 'last man standing' is charged with the L2 synchronisation.
Problems
- Thanks to silo-isation, the idle state of CPUs isn't available to other subsystems that can benefit from this information. For example, the task scheduler might assume availability of a CPU for running a performance critical task which will suffer if the CPU is in a deep sleep state.

CPU hotplug

A subsystem that was originally intended as a fault-tolerant/fail-safe stategy but often gets abused for power-management.
Some PowerPC silicon could give system software a brown-out indication. The idea was that given such an indication, the OS could safely migrate all essential state away from the CPU in question. This would involve moving tasks, interrupts, timers, deferred work etc away from this CPU. Also essential context synchronisation would be needed (cache maintenance etc) to get context safely transferred away from this CPU to memory. Essential OS data structures would be modified to essentially make this CPU unavailable thereafter. Once done, the CPU could be physically powered off.
Over time, this framework was extended to be available more generically. Although it may be possible to come up with a way to plug out CPUs from within the kernel itself (based on some sensible utilisation metric) the job of instigating a plug-out action or a plug-in action is left to user-space.
CPU hotplug has a topologic input requirement. Plugging out the last CPU in a physical cluster or socket needs to power down that cluster.
Problems
- The Linux kernel wasn't very good at keeping CPUs quiet. This has improved of late but traditionally paranoid vendors who are worried about power savings often bypass cpuidle and use cpu hotplug as a sledge hammer to quiesce CPUs.
- Hotplug is costly. Of late, this has improved but there is room for more. Overall, CPU idle state management should be left to cpuidle and the system tuned enough to root out unnecessary wakeup reasons. Thay way cpuidle will be able to provide a far more fine grained idle management strategy than hotplug can provide.

The scheduler

Key scheduler class is CFS (completely fair scheduling). There are others (SCHED_RT, SCHED_FIFO, SCHED_EDF). Selected using task priority namespacing with different priority ranges being served by different classes.
The scheduler needs topological inputs and this is complicated to get right. Cache topology (which cores share an L2 - for beneficial task co-location or spreading), Power topology (which cores are able to power off/on independently of others - for idle management as well as for the scheduler to be able to place tasks efficiently), Voltage domain topology (which cores are part of the same voltage domain and will need to scale voltage (and frequency) together - needed for DVFS and efficient task placement). This is all platform/arch specific but a good implementation should allow these topologies to be suitably expressed thereby facilitating a common scheduler centric power-perf management architecture.
Problems
- I suppose it is appropriate to consider the scheduler in the context of CPU specific PM but the irony is that despite being the best positioned to know what the CPU utilisation is likely to be, the scheduler in Linux isn't wired up to be aware of the energy cost of its decisions.
- As a result load balance decisions often result in a poor choice of tasks running on CPUs that are not suitable for those tasks. For example, a low intensity task getting scheduled on to a high performance CPU or on a CPU that was already running at a high OPP resulting in energy wastage. Conversely a high intensity task running on a CPU that is currently in an idle state. Here the task takes a hit in that the CPU isn't available and even when it does become available, depending on it's state, task performance might get hit because of cold caches. (SCHED_MC was a feature that attempted to use CPU topology intermixed with cache topology to try and pack tasks to save energy or spread them across CPUs to maximise performance. The feature was eventually removed owing to complexity increase with difficult to prove net benefit).
- Generally, frameworks such as cpufreq and cpuidle make decisions without consulting the scheduler which opens things up to miscalculation and misstepping. Load calculation done by the scheduler needs to be scale invariant. 10% load should be 10% at any frequency the CPU is running at. This connect doesn't exist which means that load calculation is imprecise. The scale invariance needs to consider micro-architectural scales as well. For example, mix of instructions used for a given workload has it's own performance scale contribution (and of course, a consequent power impact based on the functional units in the processor used as a result of the use of specific instructions (NEON etc)).
- The scheduler assumes that all processing elements are symmetric. This is a problem for new designs like big.LITTLE but also for conventional scenarios with DVFS affecting compute differently for different CPUs. Or SMT for that matter.
- Code structure is complex. Huge barrier to entry for new comers. Features have been added over time as things evolved. Despite well meaning partitions between scheduler classes and features, there is no clear boundary with features such as NUMA support and throttling spread across areas.

Device centric

RTPM (Run-Time Power Management)

Framework that aims to manage the power states of capable devices. Some devices can offer choices between improved service (latencies on wifi) versus power cost.
Allows busing hierarchies with power management capability to power off if child devices are off etc.

Devfreq

Born out of sensitivity to off-CPU entities like busing interconnects, memory interconnects affecting system power.
On some systems interconnect frequencies can be manipulated by system software.
The idea is that the interconnect frequency can track activity on the CPUs using some sensible metric. So this is quite literally cpufreq for non CPU elements.
Relatively new and not used widely.
Problems
- From a design standpoint, it would probably make more sense to have a generic DVFS management framework with instantiations for CPUs, interconnects.
- Not straightforward to characterise and relate CPU activity to utilisation at the interconnect level.

Memory hotplug

Framework to permit powering off of unused banks of memory on capable systems.
Implements biased utilisation of physical memory to permit more unused memory to be power off.
Basically the amount of physical memory can be modified dynamically.
Relatively new and not used widely.

Others

pm-QoS

A constraint expression framework that allows device drivers to express constraints (usually latency) which can be then translated into misc system specific actions. Eg device latency constraint can result in limiting CPU idle states or attainable frequencies (in order to try and guarantee service to the device).

Misc

CPU isolation

It may be desired to reserve sets of CPUs for work that should not be interrupted by the OS. For example, latency sensitive work that can be affected by the OS' general book-keeping (which may perturb caches etc apart from pre-empting the work in question).
The kernel supports cpu isolation mechanisms that aim to allow user-space tasks to run largely without the kernel stepping in. The expectation is that the task should not end up invoking kernel services that in turn could result in more kernel involvement on the CPUs in question.
The problem with the existing CPU isolation support was the kernel being 'noisy' even when told to keep off of isolated CPUs. The situation has seemingly improved with the introduction of NOHZ, Adaptive NOHZ.

Power measurement

This usually requires specialist knowledge involving identifying power rails of interest on target platforms and installing shunt resistors that permit measuring voltage drops, and therefore current.
Some platforms, such as ARM's TC2 (big.LITTLE), have software readable energy counters for some power domains. These can be scraped at run-time to get pretty accurate power measurements.
Intel has RAPL (Running Average Power Limit) which exposes energy consumption stats via counters.

Energy aware scheduling

Questions such as
- How many cores do I need for this workload to run efficiently ?
- Can I save more power by running more cores at reduced frequency or few cores at higher frequency ?
are difficult to answer without providing the task scheduler with upfront energy costs. The trick as always is to keep things simple enough while still being useful. One approach that we are taking with Linux is to have a platform specific energy model that requires information to be provided up front (such as energy cost of running CPUs at various OPPs, energy cost of waking up a CPU, energy cost of waking up a cluster). This is then used in key load balance pathways to help the scheduler reason about what the consequence of choosing a formula for load balance is likely to be.
So when creating new tasks, waking tasks up, periodically balancing load between CPUs, the scheduler can reason about a given task's load profile and suitability of running this task on available CPUs.
The idea is to try and find a suitable formula that allows the most efficient set of CPUs to run without impacting performance.

Misc

Generally speaking, mobile systems seem to have ~30% of the energy budget going to the CPU subsystem. There are arguments about whether that is worth optimising for. It almost always is. It's worth noting that optimising for the CPU has a high probability of resulting in incrementally improvements system wide.
Server designs aimed at networking have good apriori knowledge for partitioning processing element resources between control and data planes. The usual requirement is to optimise for race-to-idle (run at top speed to complete work). The expectation is that any power-perf management is simple given the relatively less dynamic workloads that run in the control and data partitions.
Server designs aimed at datacenter type applications tend to run more dynamic workloads but the requirement is usually to not compromise compute so power management shouldn't 'intrude'. Performance biased DVFS is usually considered enough.
Embedded system workloads especially in the mobile context tend to be more dynamic requiring more attention for power-perf. There are interactive inputs needed here as well (touch boosting - I/O driven hinting to maximise OPPs briefly etc).
Linux has timer coalescing to reduce wakeups. It also has NOHZ which tries to keep scheduler ticks away from sleeping CPUs. The converse is also possible where sched ticks are kept away from busy CPUs to reduce perturbation.
ARM is working on DT bindings for cache topologies, power domain topologies, voltage domain topologies. (Related question: Can Linux DT bindings be reused by FreeBSD ? Answer was that it should be possible but there are semantic differences that need someone to step in and resolve. There really shouldn't be any OS-isms going into DT).
Android doesn't have a very good pervasive multi-threading story yet. VM serialisation is a problem.
Gorilla in the room with power-perf is thermal management. Usual solutions in the ARM space tend to be reactive (If critical, cut power. Otherwise clip compute using DVFS on a thermal trigger). This tends to be very jerky and inefficient. The power team at ARM is working on more intelligent and proactive scheme where we use a model containing apriori thermal characterisation to grant compute to various actors such that the thermal limit is never breached. And performance is maximised. Extended the existing thermal management framework.
FreeBSD's SCHED_ULE written by Jeff Robertson. Seemingly well documented and an easy read.
ARM doesn't have an architected way to get power measurements at run-time from various parts of a machine.
There are a lot of interesting boards coming up which are good for power-management. Exynos based boards from hardkernel such as the Odroid Xu3E.
ARM implementations usually have performance counters available even on production systems. Lots of interest in using these for the CPU side as well as general fabric. ARM has tools such as Development Studio 5 with Streamline that are great at helping visualise PMU counters and OS events. There are open-source tools such as kernelshark available that work with Linux kernel ftrace to provide plots of activity on CPUs.

Proposed next steps

SCHED_ULE. How does it stack up against Linux CFS in terms of per-task load tracking, predicting per-task load variation. What does it currently know about CPU power management aspects ? Jeff Robertson is the point man.
Proposed future discussion with David C and others aimed at helping architect a suitable and practical power-perf design for FreeBSD.
Some things that come to mind: An ACPI/non-ACPI agnostic abstraction for working with the power control plane. Encourage the implementation and re-use of frameworks between all architectures (not just ACPI for x86* and arbitrary stuff for ARM etc). (More later).

DevSummit/201407/PowerManagement (last edited 2021-04-25T07:12:52+0000 by JethroNederhof)