Project Title
Pluggable disk scheduler for FreeBSD
Synopsis
The FreeBSD disk scheduler is the standard "elevator", also known as C-LOOK, which is designed to maximize throughput, but cannot grant fairness among clients of the disk subsystem. In fact, large sequential I/O requests, or certain access patterns from one or a few processes, might almost completely starve other processes. This is particularly critical for applications with soft real time requirements such as audio/video players (or, to some degree, recorders).
The problem could be solved by:
1 Implementing support for pluggable (ie. run-time selectable) disk schedulers, and
2 Implementing some of the disk schedulers presented in the literature that give better fairness guarantees.
This would not only allow users with specific requirements to fulfill them, but also help experimenting with new proposals in this area. Our choice will be one called Hybrid, which is a variant of a proportional share scheduler.
STATUS
01/09/2005 Added RELENG_6 support. Fixed a bug on RELENG_5.
26/08/2005 RELENG_4 and RELENG_5 were under heavy testing on August.
The following things were made:
Added credentials to requests (bio) for the ATA, MFS and AIO drivers (see below the "credentials" section)
- A lot of bugs have been fixed.
- Added the "credentials" section to the architectural overview below.
26/07/2005 improved architectural overview below. Fixed some bugs.
Current status:
- completed coding of scheduler switch and one sample scheduler,
- HYBRID;
- modified 'nice' to implement per-process I/O priorities;
- implemented a sysctl interface to provide per-uid and per-gid priorities;
- tested under RELENG_4 and RELENG_5.
SIGNIFICANT SYSTEM CHANGES:
struct bioq changed to support extra fields;
struct proc changed to add per-proc I/O weights (temporary)
nice -d x ... runs a process with I/O weight x
sysctl vfs.scheduler.name=hybrid selects a new scheduler
sysctl vfs.scheduler.classifier={PID|UID|GID} selects the way I/O is classified.
- ...
TODO:
- add manpages;
- remove debugging messages
18/07/2005 first draft of the architectural overview (matching the code).
[Note that most code is common to all versions of FreeBSD. The 4.x branch could use some simplifications due to the different locking model in the kernel, but we decided to use the same architecture in all branches to ease debugging.]
ARCHITECTURAL OVERVIEW
In short, the goal of the project is to implement pluggable (i.e. run-time selectable) disk schedulers for the FreeBSD kernel, so that device drivers can pick one or another depending on the requirements (e.g. privilege throughput, or fairness among classes of requests, etc.). The final goal is to allow different disk schedulers to be loaded and possibly switched at runtime.
Background
At a high level, disk I/O requests are issues by user programs in response to system calls (read, write, ...) or by kernel components (e.g. pager, swapper, exec loader...). All parameters describing the requests (including the block address) are eventually stored into a structure (bio) which is passed, more or less directly, to the actual device driver.
Each driver typically stores requests into a queue (bioq), one per device. In FreeBSD 5.x and above, the queue is manipulated through the following API
- bioq_init(bioq) initializes the queue. Typically called at device attach time.
- bioq_flush(bioq, ...) flushes the queue. Typically called at device detach time.
- bioq_disksort(bioq, bio) adds the new request to the queue.
- bioq_first(bioq) returns the address of the next request to serve, or NULL if the queue is empty.
- bioq_remove(bioq, bio) removes the specified request from the queue.
This interface is used in a reasonable consistent way by practically all device drivers, though some (RAID-related ?) bypass the API and do direct head/tail insertion and removals on the internal queue. There are also a few version-dependent exceptions to remember:
in FreeBSD 6.x and above, the ATA driver does not sort bios, but does the sorting at a lower level into the driver.
- in FreeBSD 4.x, the init and flush wrappers are expanded inline in the drivers; also, the data structures roughly corresponding to bio and bioq are
called buf and bufq respectively;
To the best of our knowledge there is not a specification of what the above API should do, so the functions just do something that is reasonable for the 'standard' disk scheduler (C-LOOK), which implements a one-way scan of the requests. This lack of specification may lead to some uncertainty when trying to replace C-LOOK with some other schedulers.
Examples:
- once bioq_first() returns a non-null value, all subsequent calls must return the same value until bioq_remove() is called, even if there are bioq_disksort() calls in the middle. This is because there is no way to notify the scheduler that a block is under service. With the native disk scheduler this is not an issue because the request at the head cannot be changed by subsequent requests, but different scheduling algorithms could have different effects on the ordering of requests.
- bioq_disksort() is supposed to insert a request in the queue, so a subsequent bioq_first() is expected to return a non-NULL value, and some drivers depend on this behaviour. Once again this is guaranteed by by the CSCAN scheduler, but there is a class of schedulers (non work-conserving ones, e.g. anticipatory scheduling) which may decide to defer the insertion of the request to a later time.
These and other issues must be kept in mind when implementing new schedulers.
There is another major issue to be considered. C-LOOK uses, as a sorting key, the disk block address. Other algorithms might require additional information to classify requests, e.g. the user/process/group ID, to implement some kind of fairness in scheduling requests. In the current framework, this information is missing.
Virtualizing the API
The first step in virtualizing the interface is to replace the single instance of the scheduler API with one that supports the switching. To this purpose, we have provided wrappers around bioq_init(), bioq_first(), bioq_remove(), and bioq_disksort(), which eventually call the equivalent function of each scheduler. Note that bioq_flush() is not modified because it is simply implemented in terms of bioq_first()/bioq_remove() calls.
It is important to note that some implementation details are a bit convoluted due to locking requirements (discussed in the next few sections).
Scheduler descriptor
Each disk scheduling algorithm implements the scheduler API which is exported through a structure
struct _disk_sched_interface { struct _disk_sched_interface *next; char *name; /* symbolic name */ int refcount; /* users of this scheduler */ void (*disksort)(struct buf_queue_head *head, struct buf *bp); void (*remove)(struct buf_queue_head *head, struct buf *bp); struct buf *(*get_first)(struct buf_queue_head *head); void (*init)(struct buf_queue_head *head); void (*delete)(struct buf_queue_head *head); void (*load)(void); /* load scheduler */ void (*unload)(void); /* unload scheduler */ };
Note that, in addition to the implementation of the standard API, we have a few additional functions, namely delete(), used to release per-queue resources when a scheduler is not used, and lad()/unload() which are used to perform per-scheduler operation ad load/unload time. Also, the descriptor contains a link field, a refcount, and a symbolic name used to select the scheduler.
Typically, for each *init() call, the scheduler will allocate a struct containing extra device information to complement those in the bioq. The link to this struct will be through a direct pointer in the *bioq, which also contains a pointer to the _disk_sched_interface in use for the sysctl vfs.scheduler.name=foo selects the disk scheduler named foo. The extra info is deleted by calling *delete() at appropriate times, typically when a scheduler switch is performed. *disksort(), *get_first() and *remove() are simply pointers to the scheduler- specific parts of the bioq_*() calls.
Each new disk scheduler requires a properly initialized descriptor, followed by a SYSINIT call, e.g.
static struct _disk_sched_interface foo_sched = { .name= "foo", .disksort= foo_disksort, .remove= foo_remove, .get_first= foo_first, }; SYSINIT(fooload, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, disk_sched_load, &foo_sched);
Scheduler switches
Scheduler switch requests are issued through a sysctl. Not knowing how/if the queues are in use, all the sysctl handler can do is record the request by incrementing a version number and recording the pointer to the requested scheduler. At the next useful bioq_*() request(s), individual queues will detect the change, try to cleanup pending requests, and switch to the new scheduler. At the moment the process is implemented indipendently on each queue, by diverting all requests to a 'suspend' queue until the main queue is empty, then performing the switch and requeueing all pending requests with the new scheduler.
Changes to struct bioq
To support the pluggable schedulers, we needed to add a few fields to the bioq structure (in 4.x it was possible, although not terribly efficient, to avoid these modifications; but in 5.x and above, due to the different locking model, that solution could not be used).
The bioq is extended with a pointer to the _disk_sched_interface currently in use for that queue; a version number (to detect pending switch requests); a pointer to the scheduler-specific info; and an additional queue to store requests suspended due to scheduler switches.
Locking
Normal queue operations are simply wrappers around the scheduler-specific bioq_*() calls, so these operations are already protected by the individual drivers' locks.
The approach followed to support scheduler switches (just record the request and defer the actual switch to the next useful bioq_*() requests on the same queue) does not require additional locks for the queues.
We do need, however, a global disk scheduler lock to protect from contentions in accessing the data structures, namely the list of available schedulers, the version number, and the current scheduler in use. This (non recursive) lock is wrapped into SCHED_LOCK() and SCHED_UNLOCK() calls in sys/kern/subr_disk_sched.c, and all is needed is to make sure that it is always a leaf lock to avoid deadlocks.
Classifier
Some schedulers require a classifier module to annotate requests with some identifier of the entity they should be charged to -- e.g. the process/user/group ID. This is necessary in case the scheduler wants to provide some kind of fairness among different entities.
To this purpose we needed to add a new field to the bio structure, containing classifier information.
Unfortunately, the classification cannot be done in the scheduler itself (upon the call to bioq_disksort()) because the necessary information may not be available anymore. As an example, process/user information of the original requester are lost here because curthread refers to some kernel thread as opposed to the user thread which issued the system call.
As a consequence, we need to run the classifier way before, when the bio itself is created and initialized. Unfortunately, there is not a unified mechanism to build the bio, and each driver (such as ATA driver, MFS driver, AIO driver, etc) does it in its own way.
Some work has been done in this area to identify the (driver-specific) places to annotate the bio with classifier info. We have modified the ATA, MFS and AIO drivers. We have also provided some extra fields in the classifier to identify unmarked requests.
NOTE classification might be imprecise: multiple requests for the same block will be merged and only one will result into an actual request to the driver. So it might well happen that a process gets I/O done without being 'charged' for it.
Available classifiers
At the moment we support classification based on PID, UID, GID. The type of classifier in use is selected by a sysctl variable vfs.scheduler.classifier. Depending on the choice, all requests with the same identifier are considered part of the same flow. Each flow is assigned an I/O weight (which some schedulers may use) using the following mechanisms:
for PID-based classifiers, nice -d n ... runs a process with I/O weight n; renice -d m changes the I/O weight.
- for UID and GID based classifier, the I/O weight is not part of the proc structure, but instead it is stored permanently in the kernel. As an interim solution we have used a sysctl interface so that
sysctl vfs.scheduler.set=PID X Y sets the weight for PID X equal to Y, and a similar syntax is used to set the per-GID weight.
The range for weights, similar to user priorities, is -20 to +20. Internally these are mapped with a log-like scale to values between 1 and 1000.
Userland and kernel changes
nice and renice have been modified to support the -d N argument to set the I/O weight. The latter is stored in an extra field in struct proc.
Sample schedulers
In addition to the default scheduler we have implemented two more:
- dumb, a very dumb disk scheduler which sorts requests in FIFO order, leading to very simple implementation and very bad performance;
- hybrid, an experimental scheduler which tries to implement proportional fair sharing with the performance of C-LOOK
Testing
Generating suitable disk I/O patterns to test the disk scheduler is not straightforward, as a simple process looping on reads or writes will only have one outstanding request at a time, which is not enough for e.g. PID-based classifiers to test certain schedulers.
To generate traffic we used the aio_*() routines, which allow a process to have multiple outstanding requests. In our tests, we run a number of these processes, each with several (10-20) outstanding requests, and different I/O weights, to verify
Summary
The final implementation supports pluggable schedulers for 4.x, 5.x (tested on both) and 6.x (untested due to lack of hardware, but basically the same code as 5.x). On-the-fly switching of schedulers is supported. The code is Giant-free and smp-safe. We have provided a sample classifier
Missing or incomplete parts
The current implementation has the following limitations:
- we could not provide support for non work-conserving schedulers, due to a couple of reasons:
- the assumption, in some drivers, that bioq_disksort() will make requests immediately available (so a subsequent bioq_first() will not return NULL).
- the fact that there is no bioq_lock()/bioq_unlock(), so the scheduler does not have a safe way to generate requests for a given queue.
- not having a centralized place to create bio's, we are not sure we have trapped all instances, especially for drivers (e.g. RAID) which we could not test. In the long term, it would be probably worthwhile trying to provide a bio_init() routine that the various clients could use to allocate/initialize the data structure, as opposed to doing the work inline.
- as said, the ATA driver in 6.x/7.x moves the disksort one layer below the one we are working at, so this particular work won't help on ATA-based 6.x machines.
We should figure out how to address this, because the work done at that layer is mostly a replica of the bioq_*() API.
Fixing these limitations is possible but requires a fair amount of work to clean up the individual pieces of the system.
- the selection of the scheduler is still global. It would be useful to implement per-device scheduler selection (not hard to do, all the support is already there, it suffices to extend the sysctl interface);
- documentation, as usual, is a bit incomplete;
Old, stale, working notes
17/07/2005: Hybrid scheduler is now implemented and works on FreeBSD 4.x (working on performance testing now). Implementation can only ensure fairness among processes for now; I'm currently working to get fairness among users and groups, too (an accounting system).
Scheduler switching is the major issue now: we need a way to move requests from the old scheduler to the new one. On FreeBSD 4.x it is easy, because it does not use semaphores: we just disable interrupts and we can access without issues to the disk queues.
On FreeBSD 5.x and 6.x however things are not so easy. Queues are protected with semaphores which are used by the ATA driver when needed. bio_* functions are always called with the queue locked so we can do what we want there. But during scheduler switching we need a way to obtain the lock to work on the queue. However this is neither easy nor clean: sempahore is a driver "private" member. Driver knows when lock and unlock it, and a direct access to the semaphore will break incapsulation; this means that we could cause deadlocks and others bad things.
One way to address the problem may be the following:
we know that each request is inserted invoking bioq_disksort and deleted calling bioq_remove. When a scheduler switching is involved we put all incoming requests in a support queue, which is different from the one used by the scheduler. Eventually, the scheduler queue will become empty after consequentially calls to bioq_remove. When this happens, we can do the very scheduler switching and insert requests from the support queue to the new sheduler queue. An important thing to say is that inserting requests is NOT done by calling the bio_disksort (as previously stated we should obtain a lock before calling disksort and we can't); instead we call the "strategy" function. strategy is generally called by the filesystem to read/write a request from/to the disk. Its work is to initialize the driver to accept requests (which among all things also means obtain the lock to the driver queue) and then calling bioq_disksort. If we call strategy to insert requests to the new driver we do not need to worry about the semaphore because strategy function will do this for us.
07/07/2005: What stated in the previous report is now implemented on FreeBSD 4.x. The 'id' is a string which is "cscan" for the default scheduler and will be "hybrid" for the new scheduler which I'm implementing just now. Scheduler switching can be done with a sysctl call like: "sysctl vfs.scheduler.name=foo". On FreeBSD 5.x, default functions have little different names:
- bio_disksort, called by filesystem to schedule disk requests
- bio_first, called by phisical device to get the next request to serve
- bio_remove, called by phisical device to remove a request from the queue (This is always called AFTER bufq_first).
Changes to FreeBSD 4.x should work in the same way in the 5.x branch, too, but are not implemented just now.
Note: I shall decide what happens to pending requests when a scheduler switching is involved. Basic idea is to add two function pointers to the elements of the linked list, named "init" and "flush" respectively. When a scheduler switching in involved, we can call the "flush" function for the exiting scheduler and "init" for the starting one. There are to way to manage pending requests:
- A scheduler should put all pending requests a data structure which is returned by "flush" and then passed to the starting scheduler using "init". This method requires a bit of programming.
- When exiting, a scheduler can put pending requests into the default queue (the one used by cscan) and the new one can read requests from there. There is no need to write code for this solution, but it's not a "clean" way to manage the issue
01/07/2005: FreeBSD 4.x scheduling interface exports three functions called by OS:
- bufqdisksort, called by filesystem to schedule disk requests
- bufq_first, called by phisical device to get the next request to serve
- bufq_remove, called by phisical device to remove a request from the queue (This is always called AFTER bufq_first).
Whatever scheduler you'd like implementing, you need to rewrite the previous three functions to do the "dirty" job.
Actually, FreeBSD has no support for scheduler switching, so the first thing to do is to add it.
Basic idea is to make bufqdisksort, bufq_first and bufq_remove pointer functions so that they can "point" to the specific scheduler procedures.
We will create a linked list, in which each element represets a scheduler. An element should have the following fields:
- An 'id', which is unique among all schedulers
- Function pointers to the scheduler specific bufqdisksort, bufq_first and bufq_remove
The default FreeBSD scheduler (aka. CLOOK) shall be added to this list, too. Using the 'sysctl' command (and passing it the 'id') at runtime is possible to select the current scheduler. Each time a scheduler is selected, we shall copy the functions pointer contained in the list element to the "global" bufqdisksort, bufq_first and bufq_remove. This also have a big advantage: we don't need to change parts in the kernel which contains call to those functions.