We now look at the various system calls constituting the realtime process scheduling API. These system calls allow us to control process scheduling policies and priorities.
Although realtime scheduling has been a part of Linux since version 2.0 of the kernel, several problems persisted for a long time in the implementation. A number of features of the implementation remained broken in the 2.2 kernel, and even in early 2.4 kernels. Most of these problems were rectified by about kernel 2.4.20.
The sched_get_priority_min() and sched_get_priority_max() system calls return the available priority range for a scheduling policy.
#include <sched.h> intsched_get_priority_min
(int policy); intsched_get_priority_max
(int policy);
Both return nonnegative integer priority on success, or -1 on error
For both system calls, policy specifies the scheduling policy about which we wish to obtain information. For this argument, we specify either SCHED_RR
or SCHED_FIFO
. The sched_get_priority_min() system call returns the minimum priority for the specified policy, and sched_get_priority_max() returns the maximum priority. On Linux, these system calls return the numbers 1 and 99, respectively, for both the SCHED_RR
and SCHED_FIFO
policies. In other words, the priority ranges of the two realtime policies completely coincide, and SCHED_RR
and SCHED_FIFO
processes with the same priority are equally eligible for scheduling. (Which one is scheduled first depends on their order in the queue for that priority level.)
The range of realtime priorities differs from one UNIX implementation to another. Therefore, instead of hard-coding priority values into an application, we should specify priorities relative to the return value from one of these functions. Thus, the lowest SCHED_RR
priority would be specified as sched_get_priority_min(SCHED_FIFO), the next higher priority as sched_get_priority_min(SCHED_FIFO) + 1, and so on.
SUSv3 doesn’t require that the SCHED_RR
and SCHED_FIFO
policies use the same priority ranges, but they do so on most UNIX implementations. For example, on Solaris 8, the priority range for both policies is 0 to 59, and on FreeBSD 6.1, it is 0 to 31.
In this section, we look at the system calls that modify and retrieve scheduling policies and priorities.
The sched_setscheduler() system call changes both the scheduling policy and the priority of the process whose process ID is specified in pid. If pid is specified as 0, the attributes of the calling process are changed.
#include <sched.h>
int sched_setscheduler
(pid_t pid, int policy,
const struct sched_param *param);
Returns 0 on success, or -1 on error
The param argument is a pointer to a structure of the following form:
struct sched_param { int sched_priority; /* Scheduling priority */ };
SUSv3 defines the param argument as a structure to allow an implementation to include additional implementation-specific fields, which may be useful if an implementation provides additional scheduling policies. However, like most UNIX implementations, Linux provides just the sched_priority field, which specifies the scheduling priority. For the SCHED_RR
and SCHED_FIFO
policies, this must be a value in the range indicated by sched_get_priority_min() and sched_get_priority_max(); for other policies, the priority must be 0.
The policy argument determines the scheduling policy for the process. It is specified as one of the policies shown in Table 35-1.
Table 35-1. Linux realtime and nonrealtime scheduling policies
Policy | Description | SUSv3 |
---|---|---|
| Realtime first-in first-out | • |
| Realtime round-robin | • |
| Standard round-robin time-sharing | • |
| Similar to | |
| Similar to |
A successful sched_setscheduler() call moves the process specified by pid to the back of the queue for its priority level.
SUSv3 specifies that the return value of a successful sched_setscheduler() call should be the previous scheduling policy. However, Linux deviates from the standard in that a successful call returns 0. A portable application should test for success by checking that the return status is not -1.
The scheduling policy and priority are inherited by a child created via fork(), and they are preserved across an exec().
The sched_setparam() system call provides a subset of the functionality of sched_setscheduler(). It modifies the scheduling priority of a process while leaving the policy unchanged.
#include <sched.h>
int sched_setparam
(pid_t pid, const struct sched_param *param);
Returns 0 on success, or -1 on error
The pid and param arguments are the same as for sched_setscheduler().
A successful sched_setparam() call moves the process specified by pid to the back of the queue for its priority level.
The program in Example 35-2 uses sched_setscheduler() to set the policy and priority of the processes specified by its command-line arguments. The first argument is a letter specifying a scheduling policy, the second is an integer priority, and the remaining arguments are the process IDs of the processes whose scheduling attributes are to be changed.
Example 35-2. Modifying process scheduling policies and priorities
procpri/sched_set.c
#include <sched.h> #include "tlpi_hdr.h" int main(int argc, char *argv[]) { int j, pol; struct sched_param sp; if (argc < 3 || strchr("rfo", argv[1][0]) == NULL) usageErr("%s policy priority [pid...]\n" " policy is 'r' (RR), 'f' (FIFO), " #ifdef SCHED_BATCH /* Linux-specific */ "'b' (BATCH), " #endif #ifdef SCHED_IDLE /* Linux-specific */ "'i' (IDLE), " #endif "or 'o' (OTHER)\n", argv[0]); pol = (argv[1][0] == 'r') ? SCHED_RR : (argv[1][0] == 'f') ? SCHED_FIFO : #ifdef SCHED_BATCH (argv[1][0] == 'b') ? SCHED_BATCH : #endif #ifdef SCHED_IDLE (argv[1][0] == 'i') ? SCHED_IDLE : #endif SCHED_OTHER; sp.sched_priority = getInt(argv[2], 0, "priority"); for (j = 3; j < argc; j++) if (sched_setscheduler(getLong(argv[j], 0, "pid"), pol, &sp) == -1) errExit("sched_setscheduler"); exit(EXIT_SUCCESS); }procpri/sched_set.c
In kernels before 2.6.12, a process generally must be privileged (CAP_SYS_NICE
) to make changes to scheduling policies and priorities. The one exception to this requirement is that an unprivileged process can change the scheduling policy of a process to SCHED_OTHER
if the effective user ID of the caller matches either the real or effective user ID of the target process.
Since kernel 2.6.12, the rules about setting realtime scheduling policies and priorities have changed with the introduction of a new, nonstandard resource limit, RLIMIT_RTPRIO
. As with older kernels, privileged (CAP_SYS_NICE
) processes can make arbitrary changes to the scheduling policy and priority of any process. However, an unprivileged process can also change scheduling policies and priorities, according to the following rules:
If the process has a nonzero RLIMIT_RTPRIO
soft limit, then it can make arbitrary changes to its scheduling policy and priority, subject to the constraint that the upper limit on the realtime priority that it may set is the maximum of its current realtime priority (if the process is currently operating under a realtime policy) and the value of its RLIMIT_RTPRIO
soft limit.
If the value of a process’s RLIMIT_RTPRIO
soft limit is 0, then the only change that it can make is to lower its realtime scheduling priority or to switch from a realtime policy to a nonrealtime policy.
The SCHED_IDLE
policy is special. A process that is operating under this policy can’t make any changes to its policy, regardless of the value of the RLIMIT_RTPRIO
resource limit.
Policy and priority changes can also be performed from another unprivileged process, as long as the effective user ID of that process matches either the real or effective user ID of the target process.
A process’s soft RLIMIT_RTPRIO
limit determines only what changes can be made to its own scheduling policy and priority, either by the process itself or by another unprivileged process. A nonzero limit doesn’t give an unprivileged process the ability to change the scheduling policy and priority of other processes.
Starting with kernel 2.6.25, Linux adds the concept of realtime scheduling groups, configurable via the CONFIG_RT_GROUP_SCHED
kernel option, which also affect the changes that can be made when setting realtime scheduling policies. See the kernel source file Documentation/scheduler/sched-rt-group.txt
for details.
The sched_getscheduler() and sched_getparam() system calls retrieve the scheduling policy and priority of a process.
#include <sched.h>
int sched_getscheduler
(pid_t pid);
Returns scheduling policy, or -1 on error
int sched_getparam
(pid_t pid, struct sched_param *param);
Returns 0 on success, or -1 on error
For both of these system calls, pid specifies the ID of the process about which information is to be retrieved. If pid is 0, information is retrieved about the calling process. Both system calls can be used by an unprivileged process to retrieve information about any process, regardless of credentials.
The sched_getparam() system call returns the realtime priority of the specified process in the sched_priority field of the sched_param structure pointed to by param.
Upon successful execution, sched_getscheduler() returns one of the policies shown earlier in Table 35-1.
The program in Example 35-3 uses sched_getscheduler() and sched_getparam() to retrieve the policy and priority of all of the processes whose process IDs are given as command-line arguments. The following shell session demonstrates the use of this program, as well as the program in Example 35-2:
$su
Assume privilege so we can set realtime policies Password: #sleep 100 &
Create a process [1] 2006 #./sched_view 2006
View initial policy and priority of sleep process 2006: OTHER 0 #./sched_set f 25 2006
Switch process to SCHED_FIFO policy, priority 25 #./sched_view 2006
Verify change 2006: FIFO 25
Example 35-3. Retrieving process scheduling policies and priorities
procpri/sched_view.c
#include <sched.h> #include "tlpi_hdr.h" int main(int argc, char *argv[]) { int j, pol; struct sched_param sp; for (j = 1; j < argc; j++) { pol = sched_getscheduler(getLong(argv[j], 0, "pid")); if (pol == -1) errExit("sched_getscheduler"); if (sched_getparam(getLong(argv[j], 0, "pid"), &sp) == -1) errExit("sched_getparam"); printf("%s: %-5s %2d\n", argv[j], (pol == SCHED_OTHER) ? "OTHER" : (pol == SCHED_RR) ? "RR" : (pol == SCHED_FIFO) ? "FIFO" : #ifdef SCHED_BATCH /* Linux-specific */ (pol == SCHED_BATCH) ? "BATCH" : #endif #ifdef SCHED_IDLE /* Linux-specific */ (pol == SCHED_IDLE) ? "IDLE" : #endif "???", sp.sched_priority); } exit(EXIT_SUCCESS); }procpri/sched_view.c
Since SCHED_RR
and SCHED_FIFO
processes preempt any lower-priority processes (e.g., the shell under which the program is run), when developing applications that use these policies, we need to be aware of the possibility that a runaway realtime process could lock up the system by hogging the CPU. Programmatically, there are a few of ways to avoid this possibility:
Establish a suitably low soft CPU time resource limit (RLIMIT_CPU
, described in Details of Specific Resource Limits) using setrlimit(). If the process consumes too much CPU time, it will be sent a SIGXCPU
signal, which kills the process by default.
Set an alarm timer using alarm(). If the process continues running for a wall clock time that exceeds the number of seconds specified in the alarm() call, then it will be killed by a SIGALRM
signal.
Create a watchdog process that runs with a high realtime priority. This process can loop repeatedly, sleeping for a specified interval, and then waking and monitoring the status of other processes. Such monitoring could include measuring the value of the CPU time clock for each process (see the discussion of the clock_getcpuclockid() function in Obtaining the Clock ID of a Specific Process or Thread) and checking its scheduling policy and priority using sched_getscheduler() and sched_getparam(). If a process is deemed to be misbehaving, the watchdog thread could lower the process’s priority, or stop or terminate it by sending an appropriate signal.
Since kernel 2.6.25, Linux provides a nonstandard resource limit, RLIMIT_RTTIME
, for controlling the amount of CPU time that can be consumed in a single burst by a process running under a realtime scheduling policy. Specified in microseconds, RLIMIT_RTTIME
limits the amount of CPU time that the process may consume without performing a system call that blocks. When the process does perform such a call, the count of consumed CPU time is reset to 0. The count of consumed CPU time is not reset if the process is preempted by a higher-priority process, is scheduled off the CPU because its time slice expired (for a SCHED_RR
process), or calls sched_yield() (Relinquishing the CPU). If the process reaches its limit of CPU time, then, as with RLIMIT_CPU
, it will be sent a SIGXCPU
signal, which kills the process by default.
The changes in kernel 2.6.25 can also help prevent runaway realtime processes from locking up the system. For details, see the kernel source file Documentation/scheduler/sched-rt-group.txt
.
Linux 2.6.32 added SCHED_RESET_ON_FORK
as a value that can be specified in policy when calling sched_setscheduler(). This is a flag value that is ORed with one of the policies in Table 35-1. If this flag is set, then children that are created by this process using fork() do not inherit privileged scheduling policies and priorities. The rules are as follows:
If the calling process has a realtime scheduling policy (SCHED_RR
or SCHED_FIFO
), then the policy in child processes is reset to the standard round-robin time-sharing policy, SCHED_OTHER
.
If the process has a negative (i.e., high) nice value, then the nice value in child processes is reset to 0.
The SCHED_RESET_ON_FORK
flag was designed to be used in media-playback applications. It permits the creation of single processes that have realtime scheduling policies that can’t be passed to child processes. Using the SCHED_RESET_ON_FORK
flag prevents the creation of fork bombs that try to evade the ceiling set by the RLIMIT_RTTIME
resource limit by creating multiple children running under realtime scheduling policies.
Once the SCHED_RESET_ON_FORK
flag has been enabled for a process, only a privileged process (CAP_SYS_NICE
) can disable it. When a child process is created, its reset-on-fork flag is disabled.
A realtime process may voluntarily relinquish the CPU in two ways: by invoking a system call that blocks the process (e.g., a read() from a terminal) or by calling sched_yield().
#include <sched.h>
int sched_yield
(void);
Returns 0 on success, or -1 on error
The operation of sched_yield() is simple. If there are any other queued runnable processes at the same priority level as the calling process, then the calling process is placed at the back of the queue, and the process at the head of the queue is scheduled to use the CPU. If no other runnable processes are queued at this priority, then sched_yield() does nothing; the calling process simply continues using the CPU.
Although SUSv3 permits a possible error return from sched_yield(), this system call always succeeds on Linux, as well as on many other UNIX implementations. Portable applications should nevertheless always check for an error return.
The use of sched_yield() for nonrealtime processes is undefined.
The sched_rr_get_interval() system call enables us to find out the length of the time slice allocated to a SCHED_RR
process each time it is granted use of the CPU.
#include <sched.h>
int sched_rr_get_interval
(pid_t pid, struct timespec *tp);
Returns 0 on success, or -1 on error
As with the other process scheduling system calls, pid identifies the process about which we want to obtain information, and specifying pid as 0 means the calling process. The time slice is returned in the timespec structure pointed to by tp:
struct timespec { time_t tv_sec; /* Seconds */ long tv_nsec; /* Nanoseconds */ };