Reworking the call out API: towards a tickless kernel
Being worked on by PrashantVaibhav during GSoC 2009. Mentored by EdMaste, co-mentored by AttilioRao. Project blog: Prashant's FreeBSD blog
Certain details of the project could be modified over its course. This Wiki page is work-in-progress and currently describes the tentative plan.
This project seeks to redesign the callout/timeout API and implementation in the kernel, to address certain long standing issues and make it more scalable for the future. An additional goal would be to take advantage of these changes to make the kernel tickless, and able to make use of higher-resolution timers in a hardware independent way.
- Reworking the call out API: towards a tickless kernel
- Description & Rationale
- Timer ticks
- Hardware clocks
- The callout "wheel"
- Arming, canceling, draining and rescheduling timers
- The user-visible API
- Runtime Control
- Benchmarking / testing / further work
- Milestones / deliverables
Description & Rationale
The current callout scheme suffers from several issues including being periodic at a fixed rate, doing a lot of things during each interrupt, polluting the callout data structures with a lot of unnecessary callouts which either do not fire or get rearmed frequently. The project redesigns the API and implementation to address these issues. In particular, I plan to make the entire callout wheel processing tickless-capable and capable of using any available hardware timer on the platform, through the use of deadline-capable timers in addition to supporting periodic timers.
Making the kernel tickless has several advantages, the most prominent being more efficient CPU usage because of less clock cycles being wasted in servicing periodic timer interrupts even when there is no callout to expire at that instant. Consequently, the processor can stay quiescent for longer durations and only be woken up if there is an actual timer to be expired, resulting in lesser power consumption. This also has good implications for virtual machine performance, as it avoids having to switch to the VM context unnecessarily.
The primary means of keeping track of the time at which to expire each callout would be a pseudo-opaque data type hwclocktick_t (pseudo-opaque because it should allow equality and ordering operations). This will represent the number of hardware-dependent clock ticks. Each available hardware clock would expose functions to convert between hwclocktick_t and wall time, for applications which need timers to be expired at a given wall time. Frequently used hwclocktick_t values could be cached at boot-time or adaptively while the kernel is running. For clocks known to always run at a fixed rate, these values can also be cached at compile time. This data structure would most likely be a 64bit unsigned integer (but depends on arch).
To make the callout implementation hardware-independent to a reasonable degree, each available hardware clock will be represented as a structure containing function pointers. Each clock will expose functions to convert between wall time and hwclocktick_t, functions to return the accuracy range for given deadline, functions to initialize the timer (for periodic timers), and arm the timer to a given hwclocktick_t (for deadline timers), stop and deinit the timer, and a function to set the callback consumer function to call when the timer has expired (or periodically). This scheme will allow us to use most kinds of timers, whether periodic or deadline-capable, to drive the callout wheel.
The callout "wheel"
This is the function which would get called by the HW timer when the deadline has expired, or periodically for periodic timers. The current hwclocktick_t will be passed as an argument. During this phase, for periodic timers, the passed hwclocktick_t will be compared to the next-in-line timer to expire, which will be called out if it is within the deadline. For deadline timers, the next-in-line timer will be called out directly, as the deadline would have already expired.
Arming, canceling, draining and rescheduling timers
This part is where significant implementation details would surface. Based on discussions during 2006 , my observations and conclusions are:
A binary heap could be used to store the callouts to expire within a short duration, while an unordered (or weakly ordered) data structure such as a linked list is used to store callouts with long waiting times. These timers (e.g. timers used in TCP) frequently never expire or get rescheduled, thus it would be useless effort to store them in the main binary heap and reorder them frequently. Once per second (or two), callouts from this secondary list would be merged into the main binary heap of short-term callouts. The major disadvantage with this scheme, in addition to added complexity, is that it would cause CPU burstiness, potentially undesirable in high performance/realtime configurations.
Fibonacci heaps which have O(1) amortized complexity for the major operations (insert, access top, merge) and O(log n) amortized complexity for other operations are an ideal candidate to consolidate the multiple lists. However, they suffer from the same burstiness in worst-case scenarios (as high as O(n)).
Hence, the 2nd option is using a binary heap for storing all callouts. Average-case complexity of most operations on the binary heap is O(1), with worst case being O(log n). This would reduce the overall code complexity as well. Although this is my present conclusion, the particular data structure to use still needs to be decided to maximize performance and minimize code complexity (I have not done any profiling yet to check the characteristics of callout queueing/expiring on a live system.)
The user-visible API
This API needs special thought as it should ideally not change (much) during the next several years. Borrowing from discussions made on the mailing list in 2006, my tentative proposal is as follows:
hwclocktick_t — semi-opaque hw-dependent measure of time
callout_handle_t — opaque handle used to represent the callout timer (this would actually be an index to help us locate the callout within our binary heap)
callout_flags_t — enum containing certain flags to specify if this callout is likely to expire, whether to repeat it, and flags for future expansion
callout_set(callout_handle_t* handle, hwclocktick_t whentoexpire, func*, arg*, mtx* = 0, callout_flags_t = 0)
callout_set_within(callout_handle_t* handle, hwclocktick_t mintime, hwclocktick_t maxtime, func*, arg*, mtx* = 0, callout_flags_t = 0) — expire the timer any time within the given range
callout_reset(callout_handle_t* handle, hwclocktick_t newtime)
The callout_set_within() function could enable applications to set a non-precise timer, while at the same time potentially help us optimize its insertion in and removal from the callout list (depending on the data structure chosen). E.g. for a binary heap, insertion could be stopped as soon as the parent index is within the (potentially large) range, avoiding further iterations to locate it at a precise location in the heap.
Some sysctl keys will be added to select which hardware timer to use, whether to run them in deadline-mode (tickless) or periodic mode (and at what frequency). Exact keys to add will be decided during the course of the project.
Benchmarking / testing / further work
The minimum goal would be no regression. The API could co-exist with the current API to ease transition.
The benchmarking would primarily be done by measuring impact on ethernet/TCPIP performance, specially for gigabit connections. (Needs discussion)
Further work could include auditing sources of other parts to identify potential problem areas. For example, the current NFS rpc code requests periodic interrupts and implements its own naive callout scheme. Changes to such components might be needed to make the tickless feature more pervasive.
Milestones / deliverables
Since some details of the plan are still tentative, the exact goals and milestones could be revised. I mention current expected deliverables and weekly plan below. The most current version can be found in Perforce at .
Project: Callout API redesign Programmer: Prashant Vaibhav <email@example.com> Mentors: Ed Maste <firstname.lastname@example.org> Attilio Rao <email@example.com> Wiki page: [[SOC2009PrashantVaibhav]] Weekly Plan ----------- 20 Apr - 26 Apr Proposal accepted, wiki/perforce account setup etc. 27 Apr - 03 May Setup development environment, get oriented around the codebase 04 May - 10 May Understand current implementation fully, devise plan of attack 11 May - 17 May ..continue + discuss with mentors to finalize the implementation 18 May - 22 May ..continue + revise weekly plan if needed 23 May [[ GSoC 2009 coding phase begins ]] 25 May - 31 May Implement callout list data structure (binary heap) 01 Jun - 07 Jun Add functions for user apps to arm/cancel/etc. callout timers (user-facing changes should be completed) 08 Jun - 14 Jun Implement hardware timer interface for the PIT (init, deinit, arm, conversion functions etc.) 15 Jun - 20 Jun Modify softclock to expire next-in-line timer. By this time a prototype implementation should be working. 21 Jun - 28 Jun (vacation) - tentative 29 Jun - 05 Jul Continue softclock() modification, make PIT robust. At this stage, the prototype implementation should be on par with current scheme, in terms of functionality 06 Jul - 12 Jul Start work with HPET abstraction / deadline mode 13 Jul [[ Midterm evaluations due ]] 13 Jul - 19 Jul Finish off implementation of deadline mode (and get rid of PIT's limited range), add TSC support also 20 Jul - 26 Jul Implement automatic choice of best timer (between PIT and HPET, depending on availability), fail-safe mode 27 Jul - 03 Aug Overall code review, bug squash & polish, devise the benchmarking plan 04 Aug - 10 Aug Continue with code review and benchmarking, start to write API documentation, man pages 10 Aug [[ Suggested pencils down date ]] 10 Aug - 18 Aug Catch-up if needed, finish off documentation etc. 18 Aug [[ GSoC 2009 ends ]]
 Project milestones and weekly plan -- most current version
 a proposed callout API - Poul-Henning Kamp, et al
 Clockevents and dyntick in Linux kernel
 ''Redesigning the BSD callout and timer facilities'': A Costello, G Varghese, OB Drive - Washington Univ., St. Louis, MO, Tech. Rep. WUCS, 1995
 ''Hashed and Hierarchical Timing Wheels: Efficient Data Structures for Implementing a Timer Facility'', George Varghese, Tony Lauck, 1996
 Presentation on ''Redesigning the BSD UNIX Callout and Timer Facilities'', A Costello, 1996