Reworking the call out API: towards a tickless kernel

Being worked on by PrashantVaibhav during GSoC 2009. Mentored by EdMaste, co-mentored by AttilioRao. Project blog: Prashant's FreeBSD blog

Certain details of the project could be modified over its course. This Wiki page is work-in-progress and currently describes the tentative plan.


This project seeks to redesign the callout/timeout API and implementation in the kernel, to address certain long standing issues and make it more scalable for the future. An additional goal would be to take advantage of these changes to make the kernel tickless, and able to make use of higher-resolution timers in a hardware independent way.

Description & Rationale

The current callout scheme suffers from several issues including being periodic at a fixed rate, doing a lot of things during each interrupt, polluting the callout data structures with a lot of unnecessary callouts which either do not fire or get rearmed frequently. The project redesigns the API and implementation to address these issues. In particular, I plan to make the entire callout wheel processing tickless-capable and capable of using any available hardware timer on the platform, through the use of deadline-capable timers in addition to supporting periodic timers.

Making the kernel tickless has several advantages, the most prominent being more efficient CPU usage because of less clock cycles being wasted in servicing periodic timer interrupts even when there is no callout to expire at that instant. Consequently, the processor can stay quiescent for longer durations and only be woken up if there is an actual timer to be expired, resulting in lesser power consumption. This also has good implications for virtual machine performance, as it avoids having to switch to the VM context unnecessarily.

Timer ticks

The primary means of keeping track of the time at which to expire each callout would be a pseudo-opaque data type hwclocktick_t (pseudo-opaque because it should allow equality and ordering operations). This will represent the number of hardware-dependent clock ticks. Each available hardware clock would expose functions to convert between hwclocktick_t and wall time, for applications which need timers to be expired at a given wall time. Frequently used hwclocktick_t values could be cached at boot-time or adaptively while the kernel is running. For clocks known to always run at a fixed rate, these values can also be cached at compile time. This data structure would most likely be a 64bit unsigned integer (but depends on arch).

Hardware clocks

To make the callout implementation hardware-independent to a reasonable degree, each available hardware clock will be represented as a structure containing function pointers. Each clock will expose functions to convert between wall time and hwclocktick_t, functions to return the accuracy range for given deadline, functions to initialize the timer (for periodic timers), and arm the timer to a given hwclocktick_t (for deadline timers), stop and deinit the timer, and a function to set the callback consumer function to call when the timer has expired (or periodically). This scheme will allow us to use most kinds of timers, whether periodic or deadline-capable, to drive the callout wheel.

The callout "wheel"

This is the function which would get called by the HW timer when the deadline has expired, or periodically for periodic timers. The current hwclocktick_t will be passed as an argument. During this phase, for periodic timers, the passed hwclocktick_t will be compared to the next-in-line timer to expire, which will be called out if it is within the deadline. For deadline timers, the next-in-line timer will be called out directly, as the deadline would have already expired.

Arming, canceling, draining and rescheduling timers

This part is where significant implementation details would surface. Based on discussions during 2006 [1], my observations and conclusions are:

A binary heap could be used to store the callouts to expire within a short duration, while an unordered (or weakly ordered) data structure such as a linked list is used to store callouts with long waiting times. These timers (e.g. timers used in TCP) frequently never expire or get rescheduled, thus it would be useless effort to store them in the main binary heap and reorder them frequently. Once per second (or two), callouts from this secondary list would be merged into the main binary heap of short-term callouts. The major disadvantage with this scheme, in addition to added complexity, is that it would cause CPU burstiness, potentially undesirable in high performance/realtime configurations.

Fibonacci heaps which have O(1) amortized complexity for the major operations (insert, access top, merge) and O(log n) amortized complexity for other operations are an ideal candidate to consolidate the multiple lists. However, they suffer from the same burstiness in worst-case scenarios (as high as O(n)).

Hence, the 2nd option is using a binary heap for storing all callouts. Average-case complexity of most operations on the binary heap is O(1), with worst case being O(log n). This would reduce the overall code complexity as well. Although this is my present conclusion, the particular data structure to use still needs to be decided to maximize performance and minimize code complexity (I have not done any profiling yet to check the characteristics of callout queueing/expiring on a live system.)

The user-visible API

This API needs special thought as it should ideally not change (much) during the next several years. Borrowing from discussions made on the mailing list in 2006, my tentative proposal is as follows:


Conversion functions

Callout functions

The callout_set_within() function could enable applications to set a non-precise timer, while at the same time potentially help us optimize its insertion in and removal from the callout list (depending on the data structure chosen). E.g. for a binary heap, insertion could be stopped as soon as the parent index is within the (potentially large) range, avoiding further iterations to locate it at a precise location in the heap.

Runtime Control

Some sysctl keys will be added to select which hardware timer to use, whether to run them in deadline-mode (tickless) or periodic mode (and at what frequency). Exact keys to add will be decided during the course of the project.

Benchmarking / testing / further work

The minimum goal would be no regression. The API could co-exist with the current API to ease transition.

The benchmarking would primarily be done by measuring impact on ethernet/TCPIP performance, specially for gigabit connections. (Needs discussion)

Further work could include auditing sources of other parts to identify potential problem areas. For example, the current NFS rpc code requests periodic interrupts and implements its own naive callout scheme. Changes to such components might be needed to make the tickless feature more pervasive.

Milestones / deliverables

Since some details of the plan are still tentative, the exact goals and milestones could be revised. I mention current expected deliverables and weekly plan below. The most current version can be found in Perforce at [0].

Project:        Callout API redesign
Programmer:     Prashant Vaibhav <>
Mentors:        Ed Maste <>
                Attilio Rao <>
Wiki page:      [[SOC2009PrashantVaibhav]]

Weekly Plan

20 Apr  - 26 Apr        Proposal accepted, wiki/perforce account setup etc.
27 Apr  - 03 May        Setup development environment, get oriented around the
04 May  - 10 May        Understand current implementation fully, devise plan of
11 May  - 17 May        ..continue + discuss with mentors to finalize the
18 May  - 22 May        ..continue + revise weekly plan if needed

23 May                  [[ GSoC 2009 coding phase begins ]]

25 May  - 31 May        Implement callout list data structure (binary heap)
01 Jun  - 07 Jun        Add functions for user apps to arm/cancel/etc. callout
                        timers (user-facing changes should be completed)
08 Jun  - 14 Jun        Implement hardware timer interface for the PIT (init,
                        deinit, arm, conversion functions etc.)
15 Jun  - 20 Jun        Modify softclock to expire next-in-line timer. By this
                        time a prototype implementation should be working.
21 Jun  - 28 Jun        (vacation) - tentative
29 Jun  - 05 Jul        Continue softclock() modification, make PIT robust. At
                        this stage, the prototype implementation should be on
                        par with current scheme, in terms of functionality
06 Jul  - 12 Jul        Start work with HPET abstraction / deadline mode

13 Jul                  [[ Midterm evaluations due ]]

13 Jul  - 19 Jul        Finish off implementation of deadline mode (and get rid
                        of PIT's limited range), add TSC support also
20 Jul  - 26 Jul        Implement automatic choice of best timer (between PIT
                        and HPET, depending on availability), fail-safe mode
27 Jul  - 03 Aug        Overall code review, bug squash & polish, devise the
                        benchmarking plan
04 Aug  - 10 Aug        Continue with code review and benchmarking, start to
                        write API documentation, man pages

10 Aug                  [[ Suggested pencils down date ]]

10 Aug  - 18 Aug        Catch-up if needed, finish off documentation etc.

18 Aug                  [[ GSoC 2009 ends ]]


[0] Project milestones and weekly plan -- most current version

[1] a proposed callout API - Poul-Henning Kamp, et al

[2] Clockevents and dyntick in Linux kernel

[3] ''Redesigning the BSD callout and timer facilities'': A Costello, G Varghese, OB Drive - Washington Univ., St. Louis, MO, Tech. Rep. WUCS, 1995

[4] ''Hashed and Hierarchical Timing Wheels: Efficient Data Structures for Implementing a Timer Facility'', George Varghese, Tony Lauck, 1996

[5] Presentation on ''Redesigning the BSD UNIX Callout and Timer Facilities'', A Costello, 1996

[6] kern_timeout.c from FreeBSD CURRENT

PrashantVaibhav/SOC2009 (last edited 2023-09-03T22:35:54+0000 by GrahamPerrin)