This chapter covers various topics that are prerequisites for system programming. We begin by introducing system calls and detailing the steps that occur during their execution. We then consider library functions and how they differ from system calls, and couple this with a description of the (GNU) C library.
Whenever we make a system call or call a library function, we should always check the return status of the call in order to determine if it was successful. We describe how to perform such checks, and present a set of functions that are used in most of the example programs in this book to diagnose errors from system calls and library functions.
We conclude by looking at various issues related to portable programming, specifically the use of feature test macros and the standard system data types defined by SUSv3.
A system call is a controlled entry point into the kernel, allowing a process to request that the kernel perform some action on the process’s behalf. The kernel makes a range of services accessible to programs via the system call application programming interface (API). These services include, for example, creating a new process, performing I/O, and creating a pipe for interprocess communication. (The syscalls(2) manual page lists the Linux system calls.)
Before going into the details of how a system call works, we note some general points:
A system call changes the processor state from user mode to kernel mode, so that the CPU can access protected kernel memory.
The set of system calls is fixed. Each system call is identified by a unique number. (This numbering scheme is not normally visible to programs, which identify system calls by name.)
Each system call may have a set of arguments that specify information to be transferred from user space (i.e., the process’s virtual address space) to kernel space and vice versa.
From a programming point of view, invoking a system call looks much like calling a C function. However, behind the scenes, many steps occur during the execution of a system call. To illustrate this, we consider the steps in the order that they occur on a specific hardware implementation, the x86-32. The steps are as follows:
The application program makes a system call by invoking a wrapper function in the C library.
The wrapper function must make all of the system call arguments available to the system call trap-handling routine (described shortly). These arguments are passed to the wrapper via the stack, but the kernel expects them in specific registers. The wrapper function copies the arguments to these registers.
Since all system calls enter the kernel in the same way, the kernel needs some method of identifying the system call. To permit this, the wrapper function copies the system call number into a specific CPU register (%eax
).
The wrapper function executes a trap machine instruction (int 0x80
), which causes the processor to switch from user mode to kernel mode and execute code pointed to by location 0x80
(128 decimal) of the system’s trap vector.
More recent x86-32 architectures implement the sysenter
instruction, which provides a faster method of entering kernel mode than the conventional int 0x80
trap instruction. The use of sysenter
is supported in the 2.6 kernel and from glibc 2.3.2 onward.
In response to the trap to location 0x80
, the kernel invokes its system_call() routine (located in the assembler file arch/i386/entry.S
) to handle the trap. This handler:
Saves register values onto the kernel stack (The Stack and Stack Frames).
Checks the validity of the system call number.
Invokes the appropriate system call service routine, which is found by using the system call number to index a table of all system call service routines (the kernel variable sys_call_table). If the system call service routine has any arguments, it first checks their validity; for example, it checks that addresses point to valid locations in user memory. Then the service routine performs the required task, which may involve modifying values at addresses specified in the given arguments and transferring data between user memory and kernel memory (e.g., in I/O operations). Finally, the service routine returns a result status to the system_call() routine.
Restores register values from the kernel stack and places the system call return value on the stack.
Returns to the wrapper function, simultaneously returning the processor to user mode.
If the return value of the system call service routine indicated an error, the wrapper function sets the global variable errno (see Handling Errors from System Calls and Library Functions) using this value. The wrapper function then returns to the caller, providing an integer return value indicating the success or failure of the system call.
On Linux, system call service routines follow a convention of returning a nonnegative value to indicate success. In case of an error, the routine returns a negative number, which is the negated value of one of the errno constants. When a negative value is returned, the C library wrapper function negates it (to make it positive), copies the result into errno, and returns -1 as the function result of the wrapper to indicate an error to the calling program.
This convention relies on the assumption that system call service routines don’t return negative values on success. However, for a few of these routines, this assumption doesn’t hold. Normally, this is not a problem, since the range of negated errno values doesn’t overlap with valid negative return values. However, this convention does cause a problem in one case: the F_GETOWN
operation of the fcntl() system call, which we describe in Section 63.3.
Figure 3-1 illustrates the above sequence using the example of the execve() system call. On Linux/x86-32, execve() is system call number 11 (__NR_execve
). Thus, in the sys_call_table vector, entry 11 contains the address of sys_execve(), the service routine for this system call. (On Linux, system call service routines typically have names of the form sys_xyz(), where xyz() is the system call in question.)
The information given in the preceding paragraphs is more than we’ll usually need to know for the remainder of this book. However, it illustrates the important point that, even for a simple system call, quite a bit of work must be done, and thus system calls have a small but appreciable overhead.
As an example of the overhead of making a system call, consider the getppid() system call, which simply returns the process ID of the parent of the calling process. On one of the author’s x86-32 systems running Linux 2.6.25, 10 million calls to getppid() required approximately 2.2 seconds to complete. This amounts to around 0.3 microseconds per call. By comparison, on the same system, 10 million calls to a C function that simply returns an integer required 0.11 seconds, or around one-twentieth of the time required for calls to getppid(). Of course, most system calls have significantly more overhead than getppid().
Since, from the point of view of a C program, calling the C library wrapper function is synonymous with invoking the corresponding system call service routine, in the remainder of this book, we use wording such as “invoking the system call xyz()” to mean “calling the wrapper function that invokes the system call xyz().”
Appendix A describes the strace command, which can be used to trace the system calls made by a program, either for debugging purposes or simply to investigate what a program is doing.
More information about the Linux system call mechanism can be found in [Love, 2010], [Bovet & Cesati, 2005], and [Maxwell, 1999].