Debugging Dynamically Allocated Memory

As you may know, dynamically allocated memory (DAM) is memory that a program requests from the heap with functions like malloc() and calloc().^[25] Dynamically allocated memory is typically used for data structures like binary trees and linked lists, and it is also at work behind the scenes when you create an object in object-oriented programming. Even the standard C library uses DAM for its own internal purposes. You may also recall that dynamic memory must be freed when you're done with it.^[26]

DAM problems are notoriously difficult to find and fall into a few general categories:

Dynamically allocated memory is not freed.
The call to malloc() fails (this is easy to detect by checking the return value of malloc()).
A read or write is performed to an address outside the DAM segment.
A read or write is performed to memory within a DAM region after the segment has been freed.
free() is called twice on the same segment of dynamic memory.

These errors may not cause your program to crash in an obvious way. Let's discuss this a bit further. To make the discussion more concrete, here's an example illustrating these problems:

Example Listing 7-6. memprobs.c

int main( void )
{
   int *a = (int *) malloc( 3*sizeof(int) );  // malloc return not checked
   int *b = (int *) malloc( 3*sizeof(int) );  // malloc return not checked

   for (int i = -1; i <= 3; ++i)
      a[i] = i; // a bad write for i = -1 and 3

   free(a);
   printf("%d\n", a[1]); // a read from freed memory
   free(a); // a double free on pointer a

   return 0; // program ends without freeing *b.
}

The first problem is called a memory leak. For example, consider the following code:

Example Listing 7-7. Example of a memory leak

int main( void )
{
       ... lots of previous code ...

       myFunction();

       ... lots of future code ...
}


void myFunction( void )
{
       char *name = (char *) malloc( 10*sizeof(char) );
}

When myFunction() executes, you allocate memory for 10 char values. The only way you can refer to this memory is by using the address returned by malloc(), which you've stored in the pointer variable name. If, for some reason, you lose track of the address—for example, name goes out of scope when myFunction() exits, and you haven't saved a copy of its value elsewhere—then you have no way to access the allocated memory, and, in particular, you have no way to free it.

But this is exactly what happens in the code. Dynamically allocated memory doesn't simply disappear or go out of scope, the way that a storage for a stack-allocated variable like name does, and so each time myFunction() is called, it gobbles up memory for 10 chars, which then is never released. The net result is that the available heap space gets smaller and smaller. That's why this bug is called a memory leak.

Memory leaks decrease the amount of memory available to the program. On most modern systems, like GNU/Linux, this memory is reclaimed by the operating system when the application with the memory leak terminates. On ancient systems, like Microsoft DOS and Microsoft Windows 3.1, leaked memory is lost until the operating system is rebooted. In either case, memory leaks cause degradation in system performance due to increased paging. Over time, they can cause the program with the leak, or even the entire system, to crash.

The second problem encountered with dynamically allocated memory is that the call to malloc() may fail. There are lots of ways this can happen. For example, a bug in a computation might cause a request for an amount of DAM that is too large or is negative. Or perhaps the system really is out of memory. If you don't realize this has happened and continue to try to read and write to what you mistakenly believe to be valid DAM, complications develop in an already unpleasant situation. This is a type of access violation that we'll discuss shortly. However, to avoid it, you should always check whether or not malloc() returns a non-NULL pointer, and try to exit the program gracefully if it does not.

The third and fourth problems are called access errors. Both are basically versions of the same thing: The program tries to read from or write to a memory address that is not available to it. The third problem involves accessing a memory address above or below the DAM segment. The fourth problem involves accessing a memory address that used to be available, but was freed prior to the access attempt.

Calling free() on the same segment of DAM twice is colloquially known as a double free. The C library has internal memory management structures that describe the boundaries of each allocated DAM segment. When you call free() twice on the same pointer to dynamic memory, the program's memory management structure becomes corrupted, which can lead to the program crashing, or in some cases, can allow a malicious programmer to exploit the bug to produce a buffer overflow. In a sense, this is also an access violation, but for the C library itself, rather than the program.

Access violations cause one of two things to happen: The program can crash, possibly writing a core file^[27] (usually after receiving the segmentation fault signal), or, much worse, it can continue to execute, leading to data corruption.

Of these two consequences, the former is infinitely more desirable. In fact, there are a number of tools available that will cause your program to seg fault and dump core whenever any problem with DAM is detected, rather than risk the alternative!

You may wonder, "Why in the world would I want my program to seg fault?" You should get comfortable with the idea that if your code's handling of DAM is buggy, it is a Very Good Thing^TM when it crashes, because the other option is intermittent, puzzling, and irreproducible bad behavior. Memory corruption can go unnoticed for a long time before its effects are felt. Often, the problem manifests itself in parts of your program that are quite far from the bug, and tracking it down can be a nightmare. If that weren't bad enough, memory corruption can give rise to breaches of security. Applications known to cause buffer overruns and double frees can, in some instances, be exploited by malicious crackers to run arbitrary code and are responsible for many operating system security vulnerabilities.

On the other hand, when your program seg faults and dumps core, you can perform a postmortem on the core file and learn the precise source file and line number of the code that caused the seg fault. And that, dear reader, is preferable to bug hunting.

In short, catching DAM problems as soon as possible is of extreme importance.

Strategies for Detecting DAM Problems

In this section we'll discuss Electric Fence, a library that enforces a "fence" around allocated memory addresses. Access to memory outside of these fences typically results in a seg fault and core dump. We'll also discuss two GNU tools, mtrace() and MALLOC_CHECK_, that add hooks into the standard libc allocation functions to keep records about currently allocated memory. This allows libc to perform checks on memory you're about to read, write, or free. Keep in mind that care is needed when using several software tools, each of which uses hooks to heap-related function calls, because one facility may install one of its hooks over a previously installed hook.^[28]

Electric Fence

Electric Fence, or EFence, is a library written by Bruce Perens in 1988 and released under the GNU GPL license while he worked at Pixar. When linked into your code, it causes the program to immediately seg fault and dump core^[29] when any of the following occur:

A read or write is performed outside the boundary of DAM.
A read or write is performed to DAM that has already been freed.
A free() is performed on a pointer that doesn't point to DAM allocated by malloc() (this includes double frees as a special case).

Let's see how to use Electric Fence to track down malloc() problems. Consider the program outOfBound.c:

Example Listing 7-8. outOfBound.c

int main(void)
{
   int *a = (int *) malloc( 2*sizeof(int) );

   for (int i=0; i<=2; ++i) {
      a[i] = i;
      printf("%d\n ", a[i]);
   }

   free(a);
   return 0;
}

Although the program contains an archetypal malloc() bug, it will probably compile without warnings. Chances are, it will even run without problems:^[30]

$ gcc -g3 -Wall -std=c99 outOfBound.c -o outOfBound_without_efence -lefence
$ ./outOfBound_without_efence
0
1
2

We were able to write beyond the last element of the array a[]. Everything looks fine now, but that just means this bug manifests itself unpredictably and will be difficult to nail down later on.

Now we'll link outOfBound with EFence and run it. By default, EFence only catches reads or writes beyond the last element of a dynamically allocated region. That means outOfBound should seg fault when you try to write to a[2]:

$ gcc -g3 -Wall -std=c99 outOfBound.c -o outOfBound_with_efence -lefence
$ ./outOfBound_with_efence
  Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.
0
1
Segmentation fault (core dumped)

Sure enough, EFence found the write operation past the last element of the array.

Accidentally accessing memory before the first element of an array (for example, specifying the "element" a[-1]) is less common, but it can certainly occur as a result of buggy index calculations. EFence provides a global int named EF_PROTECT_BELOW. When you set this variable to 1, EFence catches only array underruns and does not check for array overruns:

extern int EF_PROTECT_BELOW;


double myFunction( void )
{
   EF_PROTECT_BELOW = 1;  // Check from below

        int *a = (int *) malloc( 2*sizeof(int) );

        for (int i=-2; i<2; ++i) {
                a[i] = i;
                printf("%d\n", a[i]);
        }
   ...
}

Because of the way EFence works, you can catch either attempts to access memory beyond dynamically allocated blocks or attempts to access memory before allocated blocks, but not both types of access errors at the same time.

To be thorough, you should run your program twice using EFence: once in the default mode to check for dynamic memory overruns and a second time with EF_PROTECT_BELOW set to 1 to check for underruns.^[31]

In addition to EF_PROTECT_BELOW, EFence has a few other global integer variables that you can set in order to control its behavior:

EF_DISABLE_BANNER: Setting this variable to 1 hides the banner that is displayed when you run a program linked with EFence. Doing this is not recommended, because the banner warns you that EFence is linked into the application and that the executable should not be used for a production release, because executables linked to EFence are larger, run more slowly, and produce very large core files.
EF_PROTECT_BELOW: As discussed, EFence checks for DAM overruns by default. Setting this variable to 1 will cause EFence to check for memory underruns.
EF_PROTECT_FREE: By default, EFence will not check for access to DAM that has already been freed. Setting this variable to 1 enables protection of freed memory.
EF_FREE_WIPES: By default, Efence will not change the values stored in memory that is freed. Setting this variable to a nonzero value causes EFence to fill segments of dynamically allocated memory with 0xbd before they are released. This makes improper accesses to freed memory easier to detect by EFence.
EF_ALLOW_MALLOC_0: By default, EFence will trap any call to malloc() you make that has an argument of 0 (i.e., any request for zero bytes of memory). The rationale is that writing something like char *p = (char *) malloc(0); is probably a bug. However, if for some reason you really do mean to pass zero to malloc(), then setting this variable to a nonzero value will cause EFence to ignore such calls.

As an exercise, try writing a program that accesses DAM that has already been freed, and use EFence to catch the error.

Whenever you change one of these global variables, you need to recompile the program, which can be inconvenient. Thankfully, there's an easier way. You can also set shell environment variables with the same names as EFence's global variables. EFence will detect the shell variables and take the appropriate action.

As a demonstration, we'll set the environment variable EF_DISABLE_BANNER to supress the printing of the EFence banner page. (As mentioned before, you shouldn't do this; do as I say, not as I do!) If you use Bash, execute

$ export EF_DISABLE_BANNER=1

C shell users should execute

% setenv EF_DISABLE_BANNER 1

Then re-run Example Listing 7-8, and verify that the banner is disabled.

Another trick is to set the EFence variables from within GDB during a debugging session. This works because the EFence variables are global; however, it also means that the program needs to be executing, but paused.

Debugging DAM Problems with GNU C Library Tools

If you're working on a GNU platform, such as GNU/Linux, there are some GNU C library-specific features, similar to EFence, that you can use to catch and recover from dynamic memory problems. We'll discuss them briefly here.

The MALLOC_CHECK_ Environment Variable

The GNU C library provides a shell environment variable named MALLOC_CHECK_ that can be used, like EFence, to catch DAM access violations, but you don't need to recompile your program to use it. The settings and their effects are as follows:

All DAM checking is turned off (this is also the case if the variable is undefined).
A diagnostic message is printed on stderr when heap corruption is detected.
The program aborts immediately and dumps core when heap corruption is detected.
The combined effects of 1 and 2.^[32]

Since MALLOC_CHECK_ is an environment variable, using it to find heap-related problems is as simple as typing:

$ export MALLOC_CHECK_=3

Although MALLOC_CHECK_ is more convenient to use than EFence, it has a few serious drawbacks. First, MALLOC_CHECK_ only reports a dynamic memory problem upon the next execution of a heap-related function (such as malloc(), realloc(), or free(), for example) following an illegal memory access. This means that not only do you not know the source file and line number of the problematic code, you often don't even know which pointer is the problem variable. To illustrate, consider this code:

Example Listing 7-9. malloc-check-0.c

 1  int main(void)
 2  {
 3     int *p = (int *) mallo c(sizeof(int));
 4     int *q = (int *) malloc(sizeof(int));
 5
 6     for (int i=0; i<400; ++i)
 7       p[i] = i;
 8
 9     q[0] = 0;
10
11     free(q);
12     free(p);
13     return 0;
14  }

The program aborts at line 11 when the problem really occurs on line 7. Examining the core file might lead you to believe the problem lies with q, not p:

$ MALLOC_CHECK_=3 ./malloc-check-0
malloc: using debugging hooks
free(): invalid pointer 0x8049680!
Aborted (core dumped)
$ gdb malloc-check-0 core
Core was generated by `./malloc-check-0'.
Program terminated with signal 6, Aborted.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  0x40046a51 in kill () from /lib/libc.so.6
(gdb) bt
#0  0x40046a51 in kill () from /lib/libc.so.6
#1  0x40046872 in raise () from /lib/libc.so.6
#2  0x40047986 in abort () from /lib/libc.so.6
#3  0x400881d2 in _IO_file_xsputn () from /lib/libc.so.6
#4  0x40089278 in free () from /lib/libc.so.6
#5  0x080484bc in main () at malloc-check-0.c:13

You might be able to live with this drawback when debugging 14-line programs, but it can be a serious issue when working with a real application. Nevertheless, knowing that a DAM problem exists at all is useful information.

Second, this implies that if no heap-related function is called after an access error occurs, MALLOC_CHECK_ will not report the error at all.

Third, the MALLOC_CHECK_ error messages don't seem to be very meaningful. Although the previous program listing had an array overrun error, the error message was simply "invalid pointer." Technically true, but not useful.

Finally, MALLOC_CHECK_ is disabled for setuid and setgid programs, because this combination of features could be used in a security exploit. It can be re-enabled by creating the file /etc/suid-debug. The contents of this file aren't important, only the file's existence matters.

In conclusion, MALLOC_CHECK_ is a convenient tool to use during code development to catch heap-related programming blunders. However, if you suspect a DAM problem or want to carefully scan your code for possible DAM problems, you should use another utility.

Using the mcheck() Facility

An alternative to the MALLOC_CHECK_ facility for catching DAM problems is the mcheck() facility. We've found this method to be more satisfactory than MALLOC_CHECK_. The prototype for mcheck() is

#include <mcheck.h>
int mcheck (void (*ABORTHANDLER) (enum mcheck_status STATUS))

You must call mcheck() before calling any heap-related functions, otherwise the call to mcheck() will fail. Therefore, this function should be invoked very early in your program. A call to mcheck() returns a 0 upon success and a -1 if it is called too late.

The argument, *ABORTHANDLER, is a pointer to a user-supplied function that's called when an inconsistency in DAM is detected. If you pass NULL to mcheck(), then the default handler is used. Like MALLOC_CHECK_, this default handler prints an error message to stdout and calls abort() to produce a core file. Unlike MALLOC_CHECK_, the error message is useful. For instance, trampling past the end of a dynamically allocated segment in the following example:

Example Listing 7-10. mcheckTest.c

int main(void)
{
   mcheck(NULL);
   int *p = (int *) malloc(sizeof(int));
   p[1] = 0;
   free(p);
   return 0;
}

produces the error message shown here:

$ gcc -g3 -Wall -std=c99 mcheckTest.c -o mcheckTest -lmcheck
$ ./mcheckTest
memory clobbered past end of allocated block
Aborted (core dumped)

Other types of problems have similarly descriptive error messages.

Using mtrace() to Catch Memory Leaks and Double Frees

The mtrace() facility is part of the GNU C library and is used to catch memory leaks and double frees in C and C++ programs. Using it involves five steps:

Set the environment variable MALLOC_TRACE to a valid filename. This is the name of the file in which mtrace() places its messages. If this variable isn't set to a valid filename or write permissions are not set for the file, mtrace() will do nothing.
Include the mcheck.h header file.
Call mtrace() at the top of your program. Its prototype is
```
#include <mcheck.h>
void mtrace(void);
```
Run the program. If any problems are detected, they'll be documented, in a non human-readable form, in the file pointed to by MALLOC_TRACE. Also, for security reasons, mtrace() won't do anything for setuid or setgid executables.
The mtrace() facility comes with a Perl script called mtrace that's used to parse the log file and print the contents to standard output in human-readable form.

Note that there's also a muntrace() call, which is used to stop memory tracing, but the glibc info page recommends not using it. The C library, which may also use DAM for your program, is notified that your program has terminated only after main() has returned or a call to exit() has been made. Memory that the C library uses for your program is not released until this happens. A call to muntrace() before this memory is released may lead to false positives.

Let's take a look at a simple example. Here's some code that illustrates both of the problems that mtrace() catches. In the following code, we never free the memory allocated on line 6 and pointed to by p, and on line 10 we call free() on the pointer q, even though it doesn't point to dynamically allocated memory.

Example Listing 7-11. mtrace1.c

 1  int main(void)
 2  {
 3     int *p, *q;
 4
 5     mtrace();
 6     p = (int *) malloc(sizeof(int));
 7     printf("p points to %p\n", p);
 8     printf("q points to %p\n", q);
 9
10     free(q);
11     return 0;
12  }

We compile this program and run it, after setting the MALLOC_TRACE variable.

$ gcc -g3 -Wall -Wextra -std=c99 -o mtrace1 mtrace1.c
$ MALLOC_TRACE="./mtrace.log" ./mtrace1
p points to 0x8049a58
q points to 0x804968c

If you look at the contents of mtrace.log, it makes no sense at all. However, running the Perl script mtrace() produces understandable output:

$ cat mtrace.log
= Start
@ ./mtrace1:(mtrace+0x120)[0x80484d4] + 0x8049a58 0x4
@ ./mtrace1:(mtrace+0x157)[0x804850b] - 0x804968c
p@satan$ mtrace mtrace.log
- 0x0804968c Free 3 was never alloc'd 0x804850b

Memory not freed:
-----------------
   Address     Size     Caller
0x08049a58      0x4  at 0x80484d4

However, this is only slightly helpful, because although mtrace() found the problems, it reported them as pointer addresses. Fortunately, mtrace() can do better. The mtrace() script also takes the executable's filename as an optional argument. Using this option, you get line numbers along with the associated problems.

- 0x0804968c Free 3 was never alloc'd
/home/p/codeTests/mtrace1.c:15

Memory not freed:
-----------------
   Address     Size     Caller
0x08049a58      0x4  at /home/p/codeTests/mtrace1.c:11

Now this is what we wanted to see!

Like the MALLOC_CHECK_ and mcheck() utilities, mtrace() won't prevent your program from crashing. It simply checks for problems. If your program crashes, some of the output of mtrace() may become lost or garbled, which could produce puzzling error reports. The best way to cope with this is to catch and handle seg faults in order to give mtrace() a shot at shutting down gracefully. The following example illustrates how to do so.

Example Listing 7-12. mtrace2.c

void sigsegv_handler(int signum);

int main(void)
{
   int *p;

   signal(SIGSEGV, sigsegv_handler);
   mtrace();
   p = (int *) malloc(sizeof(int));


   raise(SIGSEGV);
   return 0;
}

void sigsegv_handler(int signum)
{
   printf("Caught sigsegv: signal %d. Shutting down gracefully.\n", signum);
   muntrace();
   abort();
}

^[25] For the rest of this section, we'll refer only to malloc(), but we really mean malloc() and friends, like calloc() and realloc().

^[26]One notable exception is the alloca() function, which requests dynamic memory from the current stack frame rather than from the heap. The memory in the frame is automatically freed when the function returns. Thus, you don't have to free memory allocated by alloca().

^[27]This is colloquially known as dumping core.

^[28]Actually, you can use mtrace() and MALLOC_CHECK_ together safely, because mtrace() is careful to preserve any existing hooks it finds.

^[29]If you run a program linked with EFence from within GDB, rather than invoking it on the command line, the program will seg fault without dumping core. This is desirable because core files of executables linked to EFence can be quite large, and you don't need the core file anyway, because you'll already be inside of GDB and staring at the source code file and line number where the seg fault occured.

^[30]That doesn't mean malloc() overruns won't wreak havoc with your code! This example is contrived to show how you use EFence. In a real program, writing beyond an array's bounds can cause some serious problems!

^[31]If you want to be really careful, read the "Word-Alignment and Overrun Detection" and "Instructions for Debugging Your Program" sections of the EFence man page.

^[32]This is undocumented on the authors' system. Thanks to Gianluca Insolvibile for reading the glibc sources and finding this option!