It's said that C is a low-level language. Part of what is meant by this is that much of the memory management for an application program is left to the programmer to implement. Although this approach can be quite powerful, it also places a great responsibility on the programmer.
It's also said that C is a relatively small language and an easy one to learn. However, C is only small if you don't consider a typical implementation of the standard C library, which is huge—and many programmers find C to be an easy language to use only up until they encounter pointers.
In general, a program bug can cause one of two things to happen:
It can cause the program to do something that the programmer doesn't intend. Such bugs are often due to flaws in logic, as in the number-sorting program in Chapter 3, where we put a node into the wrong branch of a tree. We've concentrated on this type of bug up until now.
It can cause the program to "bomb" or "crash." These bugs are often associated with the mishandling or misuse of pointers. This is the type of bug we'll deal with in this chapter.
What really happens when a program crashes? We'll explain it here and show how it relates to finding the bug that produces the crash.
In the vernacular of the programming world, a program crashes when an error causes it to cease to execute, abruptly and abnormally. By far the most common cause of a crash is for a program to attempt to access a memory location without having the permission to do so. The hardware will sense this and execute a jump to the operating system (OS). On Unix-family platforms, which are our focus here and in most of this book, the OS will normally announce that the program has caused a segmentation fault, commonly referred to as a seg fault, and discontinue execution of the program. On Microsoft Windows systems, the corresponding term is general protection fault. Whatever the name, the hardware must support virtual memory and the OS must make use of it in order for this error to occur. Although this is standard for today's general-purpose computers, the reader should keep in mind that it is often not the case for small, special-purpose computers, such as the embedded computers used to control machines.
In order to effectively use GDB/DDD to deal with seg faults, it is important to understand exactly how memory access errors occur. In the next few pages, we will present a brief tutorial on the role played by virtual memory (VM) during the execution of programs. Our specific focus will be on how VM issues relate to seg faults. Thus, even if you have studied VM in computing courses, the focus here may give you some new insights that will help you deal with seg faults in your debugging work.
As mentioned earlier, a seg fault occurs when your program has a memory access problem. To discuss this, it is important to first understand how a program is laid out in memory.
On Unix platforms, a program's set of allocated virtual addresses typically is laid out something like the diagram in Figure 4-1.
Here virtual address 0 is at the bottom, and the arrows show the direction of growth of two of the components, the heap and the stack, eating up the free area as they grow. The roles of the various pieces are as follows:
The text section consists of the machine instructions produced by the compiler from your program's source code. Each line of C code, for instance, will typically translate into two or three machine instructions, and the collection of all the resulting instructions makes up the text section of the executable. The formal name for this section is .text.
This component includes statically linked code, including /usr/lib/crt0.o, system code that does some initialization and then calls your main()
.
The data section contains all the program variables that are allocated at compile time—that is, your global variables.
Actually, this section consists of various subsections. The first is called .data and consists of your initialized variables, that is, those given in declarations like
int x = 5;
There is also a .bss section for uninitialized data, given in declarations like
int y;
When your program requests additional memory from the operating system at run time—for example, when you call malloc()
in C, or invoke the new
construct in C++—the requested memory is allocated in an area called the heap. If you run out of heap space, a call to brk()
can be used to expand the heap (which is precisely what malloc()
and friends do).
The stack section is space for dynamically allocated data. The data for function calls—including arguments, local variables, and return addresses—are stored on the stack. The stack grows each time a function call is made and shrinks each time a function returns to its caller.
Your program's dynamically linked code is not shown in the picture above due to the platform dependence of its location, but it is somewhere in there.
Let's explore this a bit. Consider the following code:
int q[200]; int main( void ) { int i, n, *p; p = malloc(sizeof(int)); scanf("%d", &n); for (i = 0; i < 200; i++) q[i] = i; printf("%x %x %x %x %x\n", main, q, p, &i, scanf); return 0; }
The program itself doesn't do much, but we've written it as a tool to informally explore the layout of virtual address space. To that end, let's run it:
% a.out 5 80483f4 80496a0 9835008 bfb3abec 8048304
You can see that the approximate locations of the text section, data section, heap, stack, and dynamically linked functions are 0x080483f4
, 0x080496a0
, 0x09835008
, 0xbfb3abec
, and 0x08048304
, respectively.
You can get a precise account of the program's memory layout on Linux by looking at the process's maps file. The process number happens to be 21111, so we'll look at the corresponding file, /proc/21111/maps:
$ cat /proc/21111/maps 009f1000-009f2000 r-xp 009f1000 00:00 0 [vdso] 009f2000-00a0b000 r-xp 00000000 08:01 4116750 /lib/ld-2.4.so 00a0b000-00a0c000 r-xp 00018000 08:01 4116750 /lib/ld-2.4.so 00a0c000-00a0d000 rwxp 00019000 08:01 4116750 /lib/ld-2.4.so 00a0f000-00b3c000 r-xp 00000000 08:01 4116819 /lib/libc-2.4.so 00b3c000-00b3e000 r-xp 0012d000 08:01 4116819 /lib/libc-2.4.so 00b3e000-00b3f000 rwxp 0012f000 08:01 4116819 /lib/libc-2.4.so 00b3f000-00b42000 rwxp 00b3f000 00:00 0 08048000-08049000 r-xp 00000000 00:16 18815309 /home/matloff/a.out 08049000-0804a000 rw-p 00000000 00:16 18815309 /home/matloff/a.out 09835000-09856000 rw-p 09835000 00:00 0 [heap] b7ef8000-b7ef9000 rw-p b7ef8000 00:00 0 b7f14000-b7f16000 rw-p b7f14000 00:00 0 bfb27000-bfb3c000 rw-p bfb27000 00:00 0 [stack]
You needn't understand all of this. The point is that in this display, you can see your text and data sections (from the file a.out), as well as the heap and stack. You can also see where the C library (for calls to scanf()
, malloc()
, and printf()
) has been placed (from the file /lib/libc-2.4.so). You should also recognize a permissions field whose format is similar to the familiar file permissions displayed by ls, indicating privileges such as rw-p
, for example. The latter will be explained shortly.
The virtual address space shown in Figure 4-1 conceptually extends from 0 to 2w–1, where w is the word size of your machine in bits. Of course, your program will typically use only a tiny fraction of that space, and the OS may reserve part of the space for its own work. But your code, through pointers, could generate an address anywhere in that range. Often such addresses will be incorrect be due to "entomological conditions"—that is, because of bugs in your program!
This virtual address space is viewed as organized into chunks called pages. On Pentium hardware, the default page size is 4,096 bytes. Physical memory (both RAM and ROM) is also viewed as divided into pages. When a program is loaded into memory for execution, the OS arranges for some of the pages of the program to be stored in pages of physical memory. These pages are said to be resident, and the rest are stored on disk.
At various times during execution, some program page that is not currently resident will be needed. When this occurs, it will be sensed by the hardware, which transfers control to the OS. The latter brings the required page into memory, possibly replacing another program page that is currently resident (if there are no free pages of memory available), and then returns control to our program. The evicted program page, if any, becomes nonresident and will be stored on disk.
To manage all of this, the OS maintains a page table for each process. (The Pentium's page tables have a hierarchical structure, but here we assume just one level for simplicity, and most of this discussion will not be Pentium-specific.) Each of the process's virtual pages has an entry in the table, which includes the following information:
The current physical location of this page in memory or on disk. In the latter case, the entry will indicate that the page is nonresident and may consist of a pointer to a list which ultimately leads to a physical location on disk. It may show, for instance, that virtual The Bottom Line: Each Has Its Value of the program is resident and is located in physical Using DDD of memory.
Permissions—read, write, execute—for this page.
Note that the OS will not allocate partial pages to a program. For example, if the program to be run has a total size of about 10,000 bytes, it would occupy three pages of memory if fully loaded. It would not merely occupy about 2.5 pages, as pages are the smallest unit of memory manipulated by the VM system. This is an important point to understand when debugging, because it implies that some erroneous memory accesses by the program will not trigger seg faults, as you will see below. In other words, during your debugging session, you cannot say something like, "This line of source code must be okay, since it didn't cause a seg fault."
Keep the virtual address space in Table 4-1 in mind, and continue to assume that the page size is 4,096 bytes. Then virtual page 0 comprises bytes 0 though 4,095 of the virtual address space, Debugging Tools Used in This Book comprises bytes 4,096 through 8,191, and so on.
As mentioned, when we run a program, the OS creates a page table that it uses to manage the virtual memory of the process that executes the program code. (A review of OS processes is presented in the material on threads in Chapter 5.) Whenever that process runs, the hardware's page table register will point to that table.
Conceptually speaking, each page of the virtual address space of the process has an entry in the page table (in practice, various tricks can be used to compress the table). This page table entry stores various pieces of information related to the page. The data of interest in relation to seg faults are the access permissions for the page, which are similar to file access permissions: read, write, and execute. For example, the page table entry for Of What Value Is a Debugging Tool for the Principle of Confirmation? will indicate whether your process has the right to read data from that page, the right to write data to it, and the right to execute instructions on it (if the page contains machine code).
As the program executes, it will continually access its various sections, described above, which causes the page table to be consulted by the hardware as follows:
Each time the program uses one of its global variables, read/write access to the data section is required.
Each time the program accesses a local variable, the program accesses the stack, requiring read/write access to the stack section.
Each time the program enters or leaves a function, it makes one or more accesses to the stack, requiring read/write access to the stack section.
Each time the program accesses storage that had been created by a call to malloc()
or new
, a heap access occurs, again requiring read/write access.
Each machine instruction that the program executes will be fetched from the text section (or from the area for dynamically linked code), thus requiring read and execute permission.
During the execution of the program, the addresses it generates will be virtual. When the program attempts to access memory at a certain virtual address, say y, the hardware will convert that to a virtual page number v, which equals y divided by 4,096 (where the division uses integer arithmetic, discarding the remainder). The hardware will then check entry v in the page table to see whether the permissions for the page match the operation to be performed. If they do match, the hardware will get the desired location's actual physical page number from this table entry and then carry out the requested memory operation. But if the table entry shows that the requested operation does not have the proper permission, the hardware will execute an internal interrupt. This will cause a jump to the OS's error-handling routine. The OS will normally then announce a memory access violation and discontinue execution of the program (i.e., remove it from the process table and from memory).
A bug in your program could result in a permissions mismatch and generate a seg fault during any of the types of memory access listed above. For instance, suppose your program contains a global declaration
int x[100];
and suppose your code contains a statement
x[i] = 3;
Recall that in C/C++, the expression x[i]
is equivalent to (and really means) *(x+i)
, that is, the contents of the memory location pointed to by the address x+i
. If the offset i
is, say, 200000, then this will likely produce a virtual memory address y that is outside the set of pages that the OS has assigned for the program's data section, where the compiler and linker arranged for the array x[]
to be stored. A seg fault will then occur when the write operation is attempted.
If x
were instead a local variable, then the same problem would occur in your stack section.
Violations related to execute permission can occur in more subtle ways. In an assembly language program, for instance, you might have a data item named sink
and a function named sunk()
. When calling the function, you may accidentally write
call sink
instead of
call sunk
This would cause a seg fault because the program would attempt to execute an instruction at the address of sink
, which lies in the data section, and the pages of the data section do not have execute permission enabled.
The exact analog of this coding error would not lead to a seg fault in C, since the compiler would object to a line like
z = sink(5);
when sink
has been declared as a variable. But this bug could easily occur when pointers to functions are used. Consider code like this:
int f(int x) { return x*x; } int (*p)(int); int main( void ) { p = f; u = (*p)(5); printf("%d\n", u); return 0; }
If you were to forget the statement p = f;
then p
would be 0, and you would attempt to execute instructions lying in page 0, a page for which you would not have execute (or other) permission for (recall Figure 4-1).
In order to deepen your understanding of how seg faults occur, consider the following code, whose behavior when executed shows that seg faults do not always occur in situations where you might expect them to:
int q[200]; main() { int i; for (i = 0; i < 2000; i++) { q[i] = i; } }
Notice that the programmer has apparently made a typographical error in the loop, setting up 2,000 iterations instead of 200. The C compiler will not catch this at compile time, nor will the machine code generated by the compiler check at execution time, whether the array index is out of bounds. (This is GCC's default, although it also offers a -fmudflap
option that does provide such run-time index checking.)
At execution time, a seg fault is quite likely to occur. However, the timing of the error may surprise you. The error is not likely to appear at the "natural" time, that is, when i
= 200; rather, it is likely to happen much later than that.
To illustrate this, we ran this program on a Linux PC under GDB, in order to conveniently query addresses of variables. It turned out that the seg fault occurred not at i
= 200, but at i
= 728. (Your system may give different results, but the principles will be the same.) Let's see why.
From queries to GDB we found that the array q[]
ended at address 0x80497bf
; that is, the last byte of q[199]
was at that memory location. Taking into account the Intel page size of 4,096 bytes and the 32-bit word size of this machine, a virtual address breaks down into a 20-bit page number and a 12-bit offset. In our case, q[]
ended in virtual page number 0x8049
= 32841, offset 0x7bf
= 1983. So there were still 4,096 – 1,984 = 2,112 bytes on the page of memory on which q
was allocated. That space can hold 2112 / 4 = 528 integer variables (since each is 4 bytes wide on the machine used here), and our code treated it as if it contained elements of q
at "positions" 200 through 727.
Those elements of q[]
don't exist, of course, but the compiler did not complain. Neither did the hardware, since the writes were still being performed to a page for which we certainly had write permission (because some of the actual elements of q[]
lay on it, and q[]
is allocated in the data segment). Only when i
became 728 did q[i]
refer to an address on a different page. In this case, it was a page for which we didn't have write (or any other) permission; the virtual memory hardware detected this and triggered a seg fault.
Since each integer variable is stored in 4 bytes, this page then contains 528 (2,112 / 4) additional "phantom" elements that the code treats as belonging to the array q[]
. So, although we didn't intend that it should be done, it is still legal to access q[200]
, q[201]
, and so on, all the way up to element 199 + 528 = 727, that is, q[727]
—without triggering a seg fault! Only when you try to access q[728]
do you encounter a new page, for which you may or may not have the required access permissions. Here, we did not, and so the program seg faulted. However, the next page might, by sheer luck, actually have had the proper privileges assigned to it, and then there would have been even more phantom array elements.
The moral: As stated earlier, we can't conclude from the absence of a seg fault that a memory operation is correct.
In the discussion above, we said that a seg fault normally results in the termination of the program. That is correct, but for serious debugging, there is a bit more you should be aware of, in connection with Unix signals.
Signals indicate exceptional conditions and are reported during program execution to allow the OS (or your own code) to react to a variety of events. A signal can be raised on a process by the underlying hardware of the system (as with SIGSEGV
or SIGFPE
), by the operating system (as with SIGTERM
or SIGABRT
), or by another process (as with SIGUSR1
or SIGUSR2
), or it can even be self-sent by the process itself (via the raise()
library call).
The simplest example of a signal results from hitting CTRL-C on your keyboard while a program is running. Pressing (or releasing) any key on your keyboard generates a hardware interrupt that causes an OS routine to run. When you hit CTRL-C, the OS recognizes this key combination as a special pattern and raises a signal called SIGINT
for the process on the controlling terminal. In common parlance, it's said that the OS "sends a signal to the process." We will use that phrase, but it's important to realize that nothing is actually "sent" to the process. All that happens is that the OS records the signal in its process table, so that the next time the process receiving the signal gets a timeslice on the CPU, the appropriate signal handler function will be executed, as explained below. (However, given the presumed urgency of signals, the OS may also decide to give the receiving process its next timeslice sooner than it would have otherwise.)
There are many different types of signals that can be raised on a process. In Linux, you can view the entire list of signals by typing
man 7 signal
at the shell prompt. Signals have been defined under various standards, such as POSIX.1, and these signals will be present on all operating systems that are compliant. There are also signals that are unique to individual operating systems.
Each signal has its own signal handler, which is a function that is called when that particular signal is raised on a process. Going back to our CTRL-C example, when SIGINT
is raised, the OS sets the current instruction of the process to the beginning of the signal handler for that particular signal. Thus, when the process resumes, it will execute the handler.
There is a default signal handler for each type of signal, which conveniently frees you from having to write them yourself unless you need to. Most harmless signals are ignored by default. More serious types of signals, like ones arising from violations of memory-access permissions, indicate conditions that make it inadvisable or even impossible for the program to continue to execute. In such cases, the default signal handler simply terminates the program.
Some signal handlers cannot be overriden, but in many cases you can write your own handler to replace the default handler provided by the OS. This is done in Unix by using either the signal()
or sigaction()
system calls.[15] Your custom handler function may, for instance, ignore the signal or even ask the user to choose a course of action.
Just for fun, we wrote a program that illustrates how you can write your own signal handler and, using signal()
, invoke or override the default OS handler, or ignore the signal. We picked SIGINT
, but you can do the same thing for any signal that can be caught. The program also demonstrates how raise()
is used.
#include <signal.h> #include <stdio.h> void my_sigint_handler( int signum ) { printf("I received signal %d (that's 'SIGINT' to you).\n", signum); puts("Tee Hee! That tickles!\n"); } int main(void) { char choicestr[20]; int choice; while ( 1 ) { puts("1. Ignore control-C"); puts("2. Custom handle control-C"); puts("3. Use the default handler control-C"); puts("4. Raise a SIGSEGV on myself."); printf("Enter your choice: "); fgets(choicestr, 20, stdin); sscanf(choicestr, "%d", &choice); if ( choice == 1 ) signal(SIGINT, SIG_IGN); // Ignore control-C else if ( choice == 2 ) signal(SIGINT, my_sigint_handler); else if ( choice == 3 ) signal(SIGINT, SIG_DFL); else if ( choice == 4 ) raise(SIGSEGV); else puts("Whatever you say, guv'nor.\n\n"); } return 0; }
When a program commits a memory-access violation, a SIGSEGV
signal is raised on the process. The default seg fault handler terminates the process and writes a "core file" to disk, which we will explain shortly.
If you wish to keep the program alive, instead of allowing it to be terminated, you can write a custom handler for SIGSEGV
. Indeed, you may want to deliberately cause seg faults in order to get some kind of work done. For example, some parallel-processing software packages use artificial seg faults, to which a special handler responds, to maintain consistency between the various nodes of the system, as you will see in Section 5. Another use for specialized handlers for SIGSEGV
, to be discussed in Chapter 7, involves tools for detecting and gracefully reacting to seg faults.
However, custom signal handlers may cause complications when using GDB/DDD/Eclipse. Whether it is used on its own or through the DDD GUI, GDB stops a process whenever any signal occurs. In the case of applications that operate like the parallel-processing software just mentioned, this means that GDB will halt very frequently for reasons not related to your debugging work. In order to deal with this, you will need to tell GDB not to stop when certain signals occur, using the handle
command.
There are other sources of crashes besides segmentation faults. Floating-point exceptions (FPEs) cause a SIGFPE
signal to be raised. Although it's called a "floating-point" exception, this signal covers integer arithmetic exceptions as well, like overflow and divide-by-zero conditions. On GNU and BSD systems, FPE handlers are passed a second argument that gives the reason for the FPE. The default handler will ignore a SIGFPE
under some circumstances, such as floating point overflow, and terminate the process in other circumstances, such as integer divide-by-zero.
A bus error occurs when the CPU detects an anomalous condition on its bus while executing machine instructions. Different architectures have different requirements for what should be happening on the bus, and the exact cause of the anomaly is architecture dependent. Some examples of situations that might cause a bus error include the following:
Accessing a physical address that does not exist. This is distinct from a seg fault, in that a seg fault involves access to memory for which there is insufficient privilege. Seg faults are a matter of permissions; bus errors are a matter of an invalid address being presented to the processor.
On many architectures, machine instructions that access 32-bit quantities are required to be word aligned, meaning that the memory address of the quantity must be a multiple of 4. A pointer error that results in trying to access a 4-byte number at an odd-numbered address might cause a bus error:
int main(void) { char *char_ptr; int *int_ptr; int int_array[2]; // char_ptr points to first array element char_ptr = (char *) int_array; // Causes int_ptr to point one byte past the start of an existing int. // Since an int can't be only one byte, int_ptr is no longer aligned. int_ptr = (int *) (char_ptr+1); *int_ptr = 1; // And this might cause a bus error. return 0; }
This program will not cause a bus error under Linux running on the x86 architecture, because on these processors nonaligned memory accesses are legal; they just execute more slowly than aligned accesses do.
In any event, a bus error is a processor-level exception that causes a SIGBUS
signal to be raised on a Unix system. By default, SIGBUS
will cause a process to dump core and terminate.
[15] There are two functions that are used to override default signal handlers because Linux, as with other Unixes, conforms to multiple standards. The signal()
function,which is eaiser to use than sigaction()
, conforms to the ANSI standard, whereas the sigaction()
function is more complicated,but also more versatile, and conforms to the POSIX standard.