Realtime Process Scheduling API

We now look at the various system calls constituting the realtime process scheduling API. These system calls allow us to control process scheduling policies and priorities.

Note

Although realtime scheduling has been a part of Linux since version 2.0 of the kernel, several problems persisted for a long time in the implementation. A number of features of the implementation remained broken in the 2.2 kernel, and even in early 2.4 kernels. Most of these problems were rectified by about kernel 2.4.20.

The sched_get_priority_min() and sched_get_priority_max() system calls return the available priority range for a scheduling policy.

#include <sched.h>

int sched_get_priority_min(int policy);
int sched_get_priority_max(int policy);

Note

Both return nonnegative integer priority on success, or -1 on error

For both system calls, policy specifies the scheduling policy about which we wish to obtain information. For this argument, we specify either SCHED_RR or SCHED_FIFO. The sched_get_priority_min() system call returns the minimum priority for the specified policy, and sched_get_priority_max() returns the maximum priority. On Linux, these system calls return the numbers 1 and 99, respectively, for both the SCHED_RR and SCHED_FIFO policies. In other words, the priority ranges of the two realtime policies completely coincide, and SCHED_RR and SCHED_FIFO processes with the same priority are equally eligible for scheduling. (Which one is scheduled first depends on their order in the queue for that priority level.)

The range of realtime priorities differs from one UNIX implementation to another. Therefore, instead of hard-coding priority values into an application, we should specify priorities relative to the return value from one of these functions. Thus, the lowest SCHED_RR priority would be specified as sched_get_priority_min(SCHED_FIFO), the next higher priority as sched_get_priority_min(SCHED_FIFO) + 1, and so on.

Note

SUSv3 doesn’t require that the SCHED_RR and SCHED_FIFO policies use the same priority ranges, but they do so on most UNIX implementations. For example, on Solaris 8, the priority range for both policies is 0 to 59, and on FreeBSD 6.1, it is 0 to 31.

In this section, we look at the system calls that modify and retrieve scheduling policies and priorities.

The sched_setscheduler() system call changes both the scheduling policy and the priority of the process whose process ID is specified in pid. If pid is specified as 0, the attributes of the calling process are changed.

#include <sched.h>

int sched_setscheduler(pid_t pid, int policy,
 const struct sched_param *param);

Note

Returns 0 on success, or -1 on error

The param argument is a pointer to a structure of the following form:

struct sched_param {
     int sched_priority;        /* Scheduling priority */
};

SUSv3 defines the param argument as a structure to allow an implementation to include additional implementation-specific fields, which may be useful if an implementation provides additional scheduling policies. However, like most UNIX implementations, Linux provides just the sched_priority field, which specifies the scheduling priority. For the SCHED_RR and SCHED_FIFO policies, this must be a value in the range indicated by sched_get_priority_min() and sched_get_priority_max(); for other policies, the priority must be 0.

The policy argument determines the scheduling policy for the process. It is specified as one of the policies shown in Table 35-1.

A successful sched_setscheduler() call moves the process specified by pid to the back of the queue for its priority level.

SUSv3 specifies that the return value of a successful sched_setscheduler() call should be the previous scheduling policy. However, Linux deviates from the standard in that a successful call returns 0. A portable application should test for success by checking that the return status is not -1.

The scheduling policy and priority are inherited by a child created via fork(), and they are preserved across an exec().

The sched_setparam() system call provides a subset of the functionality of sched_setscheduler(). It modifies the scheduling priority of a process while leaving the policy unchanged.

#include <sched.h>

int sched_setparam(pid_t pid, const struct sched_param *param);

Note

Returns 0 on success, or -1 on error

The pid and param arguments are the same as for sched_setscheduler().

A successful sched_setparam() call moves the process specified by pid to the back of the queue for its priority level.

The program in Example 35-2 uses sched_setscheduler() to set the policy and priority of the processes specified by its command-line arguments. The first argument is a letter specifying a scheduling policy, the second is an integer priority, and the remaining arguments are the process IDs of the processes whose scheduling attributes are to be changed.

In kernels before 2.6.12, a process generally must be privileged (CAP_SYS_NICE) to make changes to scheduling policies and priorities. The one exception to this requirement is that an unprivileged process can change the scheduling policy of a process to SCHED_OTHER if the effective user ID of the caller matches either the real or effective user ID of the target process.

Since kernel 2.6.12, the rules about setting realtime scheduling policies and priorities have changed with the introduction of a new, nonstandard resource limit, RLIMIT_RTPRIO. As with older kernels, privileged (CAP_SYS_NICE) processes can make arbitrary changes to the scheduling policy and priority of any process. However, an unprivileged process can also change scheduling policies and priorities, according to the following rules:

The sched_getscheduler() and sched_getparam() system calls retrieve the scheduling policy and priority of a process.

#include <sched.h>

int sched_getscheduler(pid_t pid);

Note

Returns scheduling policy, or -1 on error

int sched_getparam(pid_t pid, struct sched_param *param);

Note

Returns 0 on success, or -1 on error

For both of these system calls, pid specifies the ID of the process about which information is to be retrieved. If pid is 0, information is retrieved about the calling process. Both system calls can be used by an unprivileged process to retrieve information about any process, regardless of credentials.

The sched_getparam() system call returns the realtime priority of the specified process in the sched_priority field of the sched_param structure pointed to by param.

Upon successful execution, sched_getscheduler() returns one of the policies shown earlier in Table 35-1.

The program in Example 35-3 uses sched_getscheduler() and sched_getparam() to retrieve the policy and priority of all of the processes whose process IDs are given as command-line arguments. The following shell session demonstrates the use of this program, as well as the program in Example 35-2:

$ su                          Assume privilege so we can set realtime policies
Password:
# sleep 100 &                 Create a process
[1] 2006
# ./sched_view 2006           View initial policy and priority of
 sleep process
2006: OTHER  0
# ./sched_set f 25 2006       Switch process to
SCHED_FIFO policy, priority 25
# ./sched_view 2006           Verify change
2006: FIFO  25

Since SCHED_RR and SCHED_FIFO processes preempt any lower-priority processes (e.g., the shell under which the program is run), when developing applications that use these policies, we need to be aware of the possibility that a runaway realtime process could lock up the system by hogging the CPU. Programmatically, there are a few of ways to avoid this possibility:

Note

The changes in kernel 2.6.25 can also help prevent runaway realtime processes from locking up the system. For details, see the kernel source file Documentation/scheduler/sched-rt-group.txt.

Linux 2.6.32 added SCHED_RESET_ON_FORK as a value that can be specified in policy when calling sched_setscheduler(). This is a flag value that is ORed with one of the policies in Table 35-1. If this flag is set, then children that are created by this process using fork() do not inherit privileged scheduling policies and priorities. The rules are as follows:

  • If the calling process has a realtime scheduling policy (SCHED_RR or SCHED_FIFO), then the policy in child processes is reset to the standard round-robin time-sharing policy, SCHED_OTHER.

  • If the process has a negative (i.e., high) nice value, then the nice value in child processes is reset to 0.

The SCHED_RESET_ON_FORK flag was designed to be used in media-playback applications. It permits the creation of single processes that have realtime scheduling policies that can’t be passed to child processes. Using the SCHED_RESET_ON_FORK flag prevents the creation of fork bombs that try to evade the ceiling set by the RLIMIT_RTTIME resource limit by creating multiple children running under realtime scheduling policies.

Once the SCHED_RESET_ON_FORK flag has been enabled for a process, only a privileged process (CAP_SYS_NICE) can disable it. When a child process is created, its reset-on-fork flag is disabled.

A realtime process may voluntarily relinquish the CPU in two ways: by invoking a system call that blocks the process (e.g., a read() from a terminal) or by calling sched_yield().

#include <sched.h>

int sched_yield(void);

Note

Returns 0 on success, or -1 on error

The operation of sched_yield() is simple. If there are any other queued runnable processes at the same priority level as the calling process, then the calling process is placed at the back of the queue, and the process at the head of the queue is scheduled to use the CPU. If no other runnable processes are queued at this priority, then sched_yield() does nothing; the calling process simply continues using the CPU.

Although SUSv3 permits a possible error return from sched_yield(), this system call always succeeds on Linux, as well as on many other UNIX implementations. Portable applications should nevertheless always check for an error return.

The use of sched_yield() for nonrealtime processes is undefined.

The sched_rr_get_interval() system call enables us to find out the length of the time slice allocated to a SCHED_RR process each time it is granted use of the CPU.

#include <sched.h>

int sched_rr_get_interval(pid_t pid, struct timespec *tp);

Note

Returns 0 on success, or -1 on error

As with the other process scheduling system calls, pid identifies the process about which we want to obtain information, and specifying pid as 0 means the calling process. The time slice is returned in the timespec structure pointed to by tp:

struct timespec {
    time_t tv_sec;          /* Seconds */
    long   tv_nsec;         /* Nanoseconds */
};