The standard kernel scheduling algorithm usually provides adequate performance and responsiveness for the mixture of interactive and background processes typically run on a system. However, realtime applications have more stringent requirements of a scheduler, including the following:
A realtime application must provide a guaranteed maximum response time for external inputs. In many cases, these guaranteed maximum response times must be quite small (e.g., of the order of a fraction of a second). For example, a slow response by a vehicle navigation system could be catastrophic. To satisfy this requirement, the kernel must provide the facility for a high-priority process to obtain control of the CPU in a timely fashion, preempting any process that may currently be running.
A time-critical application may need to take other steps to avoid unacceptable delays. For example, to avoid being delayed by a page fault, an application can lock all of its virtual memory into RAM using mlock() or mlockall() (described in Memory Locking: mlock() and mlockall()).
A high-priority process should be able to maintain exclusive access to the CPU until it completes or voluntarily relinquishes the CPU.
A realtime application should be able to control the precise order in which its component processes are scheduled.
SUSv3 specifies a realtime process scheduling API (originally defined in POSIX.1b) that partly addresses these requirements. This API provides two realtime scheduling policies: SCHED_RR
and SCHED_FIFO
. Processes operating under either of these policies always have priority over processes scheduled using the standard round-robin time-sharing policy described in Process Priorities (Nice Values), which the realtime scheduling API identifies using the constant SCHED_OTHER
.
Each of the realtime policies allows for a range of priority levels. SUSv3 requires that an implementation provide at least 32 discrete priorities for the realtime policies. In each scheduling policy, runnable processes with higher priority always have precedence over lower-priority processes when seeking access to the CPU.
The statement that runnable processes with higher priority always have precedence over lower-priority processes needs to be qualified for multiprocessor Linux systems (including hyperthreaded systems). On a multiprocessor system, each CPU has a separate run queue (this provides better performance than a single system-wide run queue), and processes are prioritized only per CPU run queue. For example, on a dual-processor system with three processes, process A with realtime priority 20 could be queued waiting for CPU 0, which is currently running process B with priority 30, even though CPU 1 is running process C with a priority of 10.
Realtime applications that employ multiple processes (or threads) can use the CPU affinity API described in CPU Affinity to avoid any problems that might result from this scheduling behavior. For example, on a four-processor system, all noncritical processes could be isolated onto a single CPU, leaving the other three CPUs available for use by the application.
Linux provides 99 realtime priority levels, numbered 1 (lowest) to 99 (highest), and this range applies in both realtime scheduling policies. The priorities in each policy are equivalent. This means that, given two processes with the same priority, one operating under the SCHED_RR
policy and the other under SCHED_FIFO
, either may be the next one eligible for execution, depending on the order in which they were scheduled. In effect, each priority level maintains a queue of runnable processes, and the next process to run is selected from the front of the highest-priority nonempty queue.
Applications with all of the requirements listed at the start of this section are sometimes referred to as hard realtime applications. However, the POSIX realtime process scheduling API doesn’t satisfy all of these requirements. In particular, it provides no way for an application to guarantee response times for handling input. To make such guarantees requires operating system features that are not part of the mainline Linux kernel (nor most other standard operating systems). The POSIX API merely provides us with so-called soft realtime, allowing us to control which processes are scheduled for use of the CPU.
Adding support for hard realtime applications is difficult to achieve without imposing an overhead on the system that conflicts with the performance requirements of the time-sharing applications that form the majority of applications on typical desktop and server systems. This is why most UNIX kernels—including, historically, Linux—have not natively supported realtime applications. Nevertheless, starting from around version 2.6.18, various features have been added to the Linux kernel with the eventual aim of allowing Linux to natively provide full support for hard realtime applications, without imposing the aforementioned overhead for time-sharing operation.
Under the SCHED_RR
(round-robin) policy, processes of equal priority are executed in a round-robin time-sharing fashion. A process receives a fixed-length time slice each time it uses the CPU. Once scheduled, a process employing the SCHED_RR
policy maintains control of the CPU until either:
it reaches the end of its time slice;
it voluntarily relinquishes the CPU, either by performing a blocking system call or by calling the sched_yield() system call (described in Relinquishing the CPU);
it terminates; or
it is preempted by a higher-priority process.
For the first two events above, when a process running under the SCHED_RR
policy loses access to the CPU, it is placed at the back of the queue for its priority level. In the final case, when the higher-priority process has ceased execution, the preempted process continues execution, consuming the remainder of its time slice (i.e., the preempted process remains at the head of the queue for its priority level).
In both the SCHED_RR
and the SCHED_FIFO
policies, the currently running process may be preempted for one of the following reasons:
a higher-priority process that was blocked became unblocked (e.g., an I/O operation on which it was waiting completed);
the priority of another process was raised to a higher level than the currently running process; or
the priority of the currently running process was decreased to a lower value than that of some other runnable process.
The SCHED_RR
policy is similar to the standard round-robin time-sharing scheduling algorithm (SCHED_OTHER
), in that it allows a group of processes with the same priority to share access to the CPU. The most notable difference is the existence of strictly distinct priority levels, with higher-priority processes always taking precedence over lower-priority processes. By contrast, a low nice value (i.e., high priority) doesn’t give a process exclusive access to the CPU; it merely gives the process a favorable weighting in scheduling decisions. As noted in Process Priorities (Nice Values), a process with a low priority (i.e., high nice value) always receives at least some CPU time. The other important difference is that the SCHED_RR
policy allows us to precisely control the order in which processes are scheduled.
The SCHED_FIFO
(first-in, first-out) policy is similar to the SCHED_RR
policy. The major difference is that there is no time slice. Once a SCHED_FIFO
process gains access to the CPU, it executes until either:
In the first case, the process is placed at the back of the queue for its priority level. In the last case, when the higher-priority process has ceased execution (by blocking or terminating), the preempted process continues execution (i.e., the preempted process remains at the head of the queue for its priority level).
The Linux 2.6 kernel series added two nonstandard scheduling policies: SCHED_BATCH
and SCHED_IDLE
. Although these policies are set via the POSIX realtime scheduling API, they are not actually realtime policies.
The SCHED_BATCH
policy, added in kernel 2.6.16, is similar to the default SCHED_OTHER
policy. The difference is that the SCHED_BATCH
policy causes jobs that frequently wake up to be scheduled less often. This policy is intended for batch-style execution of processes.
The SCHED_IDLE
policy, added in kernel 2.6.23, is also similar to SCHED_OTHER
, but provides functionality equivalent to a very low nice value (i.e., lower than +19). The process nice value has no meaning for this policy. It is intended for running low-priority jobs that will receive a significant proportion of the CPU only if no other job on the system requires the CPU.