Hierarchical_Resource_Limits

RCTL

This used to be called HRL, later called "Resource Containers", after that renamed to RCTL. Note that this is unrelated to Solaris command with the same name.

Purpose

Resource limits in FreeBSD, as implemented currently with setrlimit(2), have a number of drawbacks:

1. They are per process, except for a few that are per user. This is hardcoded and cannot be changed by the system administrator. However, in many situations, one is not really interested in per-process limits - what he wants is to prevent a _user_ from doing Bad Things, which means what is required is per _user_ limits. It is possible to do this, kind of - set limits per-process and then limit the number of processes. This is suboptimal, however - setting maximum number of file descriptors to less than 100 will affect many applications (servers, torrents...); also, some users may need more than 50 legitimate processes (e.g. compilation of some large software package, with all the make(1) and sh(1) instances), and 5000 (per-process limit times max processes) is only slightly less than kern.maxfiles on the machine I'm typing this on right now. Much better way of solving this would be to set per user limit to e.g. 300.

2. System administrator cannot choose what should happen when the limit gets exceeded. For example, it's impossible to set a memory limit so that the process gets restarted (via SIGHUP) when the limit gets exceeded.

3. There is no way to change limits without restarting processes. As it is now, after changing limits in /etc/login.conf, they won't apply until the user logs out and then logs in again.

This is what RCTL provides:

1. Resource limits apply to what the system administrator wants them to apply. What the resource limit applies to is determined by the "subject" field of RCTL rule. For example, subject may be "process:1234" or "user:trasz" or "jail:42".

2. Resource limits do what the administrator wants them to do. Action to be taken when the limit gets exceeded is determined by the "action" field of RCTL rule. Example actions are "deny" (which means denying the allocation), "sighup" (which means sending SIGHUP to the offending process) and "log" (which means logging message to the syslog). There may be several rules with the same subject and resource, differing by action - for example, one may set rule so that when the offending process exceeds 500MB of memory, a warning gets logged to the syslog; when it exceeds 1GB of memory, the SIGHUP is sent to the process, and when it exceeds 2GB of memory, it gets killed with SIGKILL.

3. Resource limits are stored as RCTL rules, kind of similar to firewall rules. Rules look like this: "process:613:stacksize:deny=536870912", and may be added and removed by the system administrator at any time. When a rule is added, it is enforced immediately, without the need to restart any process.

User interface

Man page available via "man rctl"

There is an rc(8) script to load ruleset at system startup from /etc/rctl.conf.

Implementation details

The rctl(8) tool communicates with the kernel via newly introduced system calls: rctl_get_racct(), rctl_get_rules(), rctl_get_limits(), rctl_add_rule() and rctl_remove_rule(). Instead of binary structures or parameters, these syscalls transmit rules to and from the kernel as text. This gets rid of possible incompatibilities between kernel and the userland when new resource types or actions are added. These interfaces are used only for administrative purposes and the performance is not critical. The userland tool doesn't do much more than resolving names and shorthands, e.g. replacing 'u:trasz:maxprocesses:deny=100' with 'user:1001:maxprocesses:deny=100' before passing stuff to the kernel.

Another system calls introduced with RCTL are setloginclass(2) and getloginclass(2). These are required to make kernel know about current login class for a process, in order for rules such as 'loginclass:users:maxprocesses:deny=100/user' to work. Login class is set by setusercontext(3). The id(1) utility was modified to show current login class.

Each rule ('struct rctl_rule') is attached to entities they apply to. For example, rule 'user:trasz:maxprocesses:deny=100/process', when added, will be first attached to 'struct uidinfo' for that user. Refcount for that uidinfo will be increased so that it doesn't go away even when the last process of that user exits. There is no global list or tree of rules that would have to be maintained in code paths important for performance; however, to implement ruleset management operations, such as listing the rules, adding or removing them, kernel needs to walk through all the uidinfo, loginclass, jail and proc structures. There is an optimization to avoid it for rules where subject is a process; removing these is important, as they might be temporary rules added by setrlimit(2); temporary rules are destroyed when a process exits.

Rules are attached to _all_ of the entities they apply to. In the example above, if there are any processes owned by that user, the rule will be attached to these processes as well as the uidinfo. Rules are refcounted. Apart from the refcount, rules are immutable - changing a rule requires removing it and adding new one. There are no relationships between the rules themselves.

Containers ('struct racct') are embedded in structures describing entities that resource usage may be accounted for. For example, there is 'struct racct p_racct;' in 'struct proc'. Containers have a table that counts amount of resources allocated; one table entry per resource type. Container contain pointers to their parent containers. They form a hierarchy, but it is not a tree - container may have several parents. Containers in 'struct uidinfo' or 'struct loginclass' have no parents. Containers in 'struct prison' may have at most one parent container - jail "above" them in jail hierarchy. Containers in 'struct proc' may have at most three parent containers - one for uidinfo, one for loginclass, and one for jail.

When a resource usage counter in container is changed, it is changed by the same amount in all parent containers, and, in case of hierarchical jails, their parent containers, recursively.

RCTL requires its accounting routines to be called whenever resource allocation or freeing takes place. For the most part, this means using two routines: racct_add(proc, resource, amount) and racct_sub(proc, resource, amount). The racct_add() routine returns 0 on success, and EDOOFUS if the allocation should fail, i.e. the call would exceed a limit. Actions other than "deny" - logging a message or sending a signal - are handled transparently by the HRL; the code calling racct_add() doesn't need to worry about them. Both routines update resource usage in the containers.

One problem with racct_add()/racct_sub() is that it is unsuitable for e.g. file descriptor limits. File descriptor table can be shared between several processes; if one process would open a file and another closed it, it would cause the file descriptor usage for the latter process to go below zero. For that reason, there is also a racct_set() call, which sets the resource usage for a resource specified to a given amount. In the file descriptor case, it sets highest open file descriptor number for a process. This means file descriptor accounting is still somewhat imprecise, but it's not fatal.

XXX: More to be added later.

Code

The code shipped with 9.0-RELEASE; important reliability fixes were committed before 9.1-RELEASE. The version in 10-CURRENT also supports CPU percentage limits.

To use RCTL, add

options RACCT
options RCTL

to the kernel config file and rebuild your kernel.

Hierarchical_Resource_Limits (last edited 2014-05-17T15:33:45+0000 by EdwardTomaszNapierala)