Debugging Parallel Applications

There are two main types of parallel programming architectures—shared memory and message passing.

The term shared memory means exactly that: Multiple CPUs all have access to some common physical memory. Code running on one CPU communicates with code running on the others by reading from and writing to this shared memory, much as threads in a multithreaded application communicate with one another through a shared address space. (Indeed, threaded programming has become the standard way to write application code for shared memory systems.)

By contrast, in a message passing environment, code running on each CPU can only access that CPU's local memory, and it communicates with the others by sending strings of bytes called messages over a communication medium. Typically this is some kind of network, running either a general-purpose protocol like TCP/IP or a specialized software infrastructure that is tailored to message-passing applications.

Figure 5-3. Setting a thread-specific breakpoint in Eclipse

Message-Passing Systems

We will discuss message passing first, using the popular Message Passing Interface (MPI) package as an example. We use the MPICH implementation here, but the same principles apply to LAM and other MPI implementations.

Let us again consider a prime-number finding program:

 1  #include <mpi.h>
 2
 3  // MPI sample program; not intended to be efficient; finds and reports
 4  // the number of primes less than or equal to n
 5
 6  // Uses a pipeline approach:  node 0 looks at all the odd numbers (i.e.,
 7  // we assume multiples of 2 are already filtered out) and filters out
 8  // those that are multiples of 3, passing the rest to node 1; node 1
 9  // filters out the multiples of 5, passing the rest to node 2; node 2
10  // filters out the rest of the composites and then reports the number
11  // of primes
12
13  // the command-line arguments are n and debugwait
14
15  #define PIPE_MSG 0  // type of message containing a number to
16                      // be checked
17  #define END_MSG 1  // type of message indicating no more data will
18                    // be coming
19
20  int nnodes,  // number of nodes in computation
21      n,  // find all primes from 2 to n
22      me;  // my node number
23
24  init(int argc,char **argv)
25  {  int debugwait;  // if 1, then loop around until the
26                     // debugger has been attached
27
28     MPI_Init(&argc,&argv);
29     n = atoi(argv[1]);
30     debugwait = atoi(argv[2]);
31
32     MPI_Comm_size(MPI_COMM_WORLD,&nnodes);
33     MPI_Comm_rank(MPI_COMM_WORLD,&me);
34
35     while (debugwait) ;
36  }
37
38  void node0()
39  {  int i,dummy,
40         tocheck; // current number to check for passing on to next node
41     for (i = 1; i <= n/2; i++) {
42        tocheck = 2 * i + 1;
43        if (tocheck > n) break;
44        if (tocheck % 3 > 0)
45           MPI_Send(&tocheck,1,MPI_INT,1,PIPE_MSG,MPI_COMM_WORLD);
46     }
47     MPI_Send(&dummy,1,MPI_INT,1,END_MSG,MPI_COMM_WORLD);
48  }
49
50  void node1()
51  {  int tocheck, // current number to check from node 0
52         dummy;
53     MPI_Status status; // see below
54
55     while (1) {
56        MPI_Recv(&tocheck,1,MPI_INT,0,MPI_ANY_TAG,
57           MPI_COMM_WORLD,&status);
58        if (status.MPI_TAG == END_MSG) break;
59        if (tocheck % 5 > 0)
60         MPI_Send(&tocheck,1,MPI_INT,2,PIPE_MSG,MPI_COMM_WORLD); 
61    }
62    // now send our end-of-data signal, which is conveyed in the
63    // message type, not the message itself
64    MPI_Send(&dummy,1,MPI_INT,2,END_MSG,MPI_COMM_WORLD);
65 }
66
67 void node2()
68 {  int tocheck, // current number to check from node 1
69        primecount,i,iscomposite;
70    MPI_Status status;
71
72    primecount = 3; // must account for the primes 2, 3 and 5, which
73                    // won't be detected below
74    while (1) {
75       MPI_Recv(&tocheck,1,MPI_INT,1,MPI_ANY_TAG,
76          MPI_COMM_WORLD,&status);
77       if (status.MPI_TAG == END_MSG) break;
78       iscomposite = 0;
79       for (i = 7; i*i <= tocheck; i += 2)
80          if (tocheck % i == 0) {
81             iscomposite = 1;
82             break;
83          }
84       if (!iscomposite) primecount++;
85    }
86    printf("number of primes = %d\n",primecount);
87 }
88
89 main(int argc,char **argv)
90 {  init(argc,argv);
91    switch (me) {
92       case 0: node0();
93           break;
94       case 1: node1();
95           break;
96       case 2: node2();
97    };
98    MPI_Finalize();
99 }

As explained in the comments at the beginning of the program, here our Sieve of Eratosthenes runs on three nodes of a parallel system and works in a pipelined manner. The first node starts with odd numbers and removes all multiples of 3, passing on the remaining values; the second node takes the output of the first and removes all multiples of 5; and the third node takes the output of the second, removes the rest of the nonprimes, and reports the number of primes that are left.

Here the pipelining is achieved by having each node pass one number at a time to the next. (Much greater efficiency could be attained by passing groups of numbers in each MPI message, thus reducing communications overhead.) When sending a number on to the next node, a node sends a message of type PIPE_MSG. When a node has no more numbers to send, it indicates this by sending a message of type END_MSG.

As a debugging example here, suppose we forget to include the latter notification at the first node—that is, we forget line 46 in the code for node0():

MPI_Send(&dummy,1,MPI_INT,1,END_MSG,MPI_COMM_WORLD);

The program will hang at the "downstream" nodes. Let's see how we can track down this bug. (Keep in mind that some line numbers in the GDB session below will differ by 1 from those in the above listing.)

You run an MPICH application program by invoking a script named mpirun on one node of the system. The script then starts the application program at each node, via SSH. Here we did this on a network of three machines, which we'll call Node 0, Node 1, and Node 2, with n equal to 100. The bug causes the program to hang at the latter two nodes. The program also hangs at the first node, because no instance of an MPI program will exit until all have executed the MPI_FINALIZE() function.

We would like to use GDB, but because we used mpirun to invoke the application at each of the three nodes, rather than running them directly on the nodes, we cannot run GDB directly. However, GDB allows you to dynamically attach the debugger to an already-running process, using the process number. So let's run ps on Node 1 to determine the number of the process that is executing our application there:

$ ps ax
...
 2755 ?        S      0:00 tcsh -c /home/matloff/primepipe node 1 3
 2776 ?        S      0:00 /home/matloff/primepipe node1 32812   4
 2777 ?        S      0:00 /home/matloff/primepipe node1 32812   4

The MPI program is running as process 2776, so we attach GDB to the program at Node 1:

$ gdb primepipe 2776
...
0xffffe002 in ?? ()

This is not very informative! So, let's see where we are:

(gdb) bt
#0  0xffffe002 in ?? ()
#1  0x08074a76 in recv_message ()
#2  0x080748ad in p4_recv ()
#3  0x0807ab46 in MPID_CH_Check_incoming ()
#4  0x08076ae9 in MPID_RecvComplete ()
#5  0x0806765f in MPID_RecvDatatype ()
#6  0x0804a29f in PMPI_Recv ()
#7  0x08049ce8 in node1 () at PrimePipe.c:56
#8  0x08049e19 in main (argc=8, argv=0xbffffb24) at PrimePipe.c:94
#9  0x420156a4 in __libc_start_main () from /lib/tls/libc.so.6

We see from frame 7 that the program is hanging at line 56, waiting to receive from Node 0.

Next, it would be useful to know how much work has been done by the function running at Node 1, node1(). Has it just started, or is it almost done? We can gauge the progress by determining the last value processed for the variable tocheck:

(gdb) frame 7
#7  0x08049ce8 in node1 () at PrimePipe.c:56
56          MPI_Recv(&tocheck,1,MPI_INT,0,MPI_ANY_TAG,
(gdb) p tocheck
$1 = 97

Note

We needed to move to the stack frame for node1() first, using GDB's frame command.

This indicates that Node 1 is at the end of execution, as 97 should be the last number that Node 0 passes to it for prime checking. So, currently we would be expecting a message from Node 0 of type END_MSG. The fact that the program is hanging would suggest that Node 0 might not have sent such a message, which would in turn lead us to check whether it had. In this manner, we hopefully would zero in quickly on the bug, which was the accidental omission of line 46.

By the way, keep in mind that when GDB is invoked with the command

$ gdb primepipe 2776

as we did above, GDB's command-line processing first checks for a core file named 2776. In the unlikely event that such a file exists, GDB will load it instead of attaching to the intended process. Alternatively, GDB also has an attach command.

In this example, the bug caused the program to hang. The approach to debugging a parallel program like this one is somewhat different when the symptom is incorrect output. Suppose, for example, that in line 71 we incorrectly initialized primecount to 2 instead of 3. If we try to follow the same debugging procedure, the programs running on each node would finish execution and exit too quickly for you to attach GDB. (True, we could use a very large value of n, but it is usually better to debug with simple cases at first.) We need some device that can be used to make the programs wait and give you a chance to attach GDB. This is the purpose of line 34 in the init() function.

As can be seen in the source code, the value of debugwait is taken from the command line supplied by the user, with 1 meaning wait and 0 meaning no wait. If we specify 1 for the value of debugwait, then when each invocation of the program reaches line 34, it remains there. This gives us time to attach GDB. We can then break out of the infinite loop and proceed to debug. Here is what we do at Node 0:

node1:~$ gdb primepipe 3124
...
0x08049c53 in init (argc=3, argv=0xbfffe2f4) at PrimePipe.c:34
34         while (debugwait) ;
(gdb) set debugwait = 0
(gdb) c
Continuing.

Ordinarily, we dread infinite loops, but here we deliberately set one up in order to facilitate debugging. We do the same thing at Node 1 and Node 2, and at the latter we also take the opportunity to set a breakpoint at line 77 before continuing:

[matloff@node3 ~]$ gdb primepipe 2944
34         while (debugwait) ;
(gdb) b 77
Breakpoint 1 at 0x8049d7d: file PrimePipe.c, line 77.
(gdb) set debugwait = 0
(gdb) c
Continuing.

Breakpoint 1, node2 () at PrimePipe.c:77
77            if (status.MPI_TAG == END_MSG) break;
(gdb) p tocheck
$1 = 7
(gdb) n
78            iscomposite = 0;
(gdb) n
79            for (i = 7; i*i <= tocheck; i += 2)
(gdb) n
84            if (!iscomposite) primecount++;
(gdb) n
75            MPI_Recv(&tocheck,1,MPI_INT,1,MPI_ANY_TAG,
(gdb) p primecount
$2 = 3

At this point, we notice that primecount should be 4, not 3—the primes through 7 are 2, 3, 5, and 7—and thus we have found the location of the bug.

Shared-Memory Systems

Now, what about the shared-memory type of parallel programming? Here we have separate cases for true shared-memory machines and software-distributed shared-memory settings.

True Shared Memory

As mentioned earlier, in a true shared-memory environment, application programs are often developed using threads. Our material in Debugging Threaded Code on debugging with GDB/DDD then applies.

OpenMP has become a popular programming environment on such machines. OpenMP supplies the programmer with high-level parallel programming constructs, which in turn make use of threads. The programmer still has thread-level access if needed, but for the most part the threaded implementation of the OpenMP directives is largely transparent to the programmer.

We present an extended example in Section 5.4 of debugging an OpenMP application.

Software Distributed Shared-Memory Systems

Prices of machines with dual-core CPUs are now within reach of ordinary consumers, but large-scale shared-memory systems with many processors still cost hundreds of thousands of dollars. A popular, inexpensive alternative is a network of workstations (NOW). NOW architectures use an underlying library that gives the illusion of shared memory. The library, which is largely transparent to the application programmer, engages in network transactions that maintain consistency of copies of shared variables across the different nodes.

This approach is called software distributed shared memory (SDSM). The most widely used SDSM library is Treadmarks, developed and maintained by Rice University. Another excellent package is JIAJIA, available from the Chinese Academy of Sciences (http://www-users.cs.umn.edu/~tianhe/paper/dist.htm).

SDSM applications exhibit certain kinds of behavior that may baffle the unwary programmer. These are highly dependent on the particular system, so a general treatment cannot be given here, but we will briefly discuss a few issues common to many of them.

Many SDSMs are page based, meaning that they rely on the underlying virtual memory hardware at the nodes. The actions are complex, but we can give a quick overview. Consider a variable X that is to be shared among the NOW nodes. The programmer indicates this intention by making a certain call to the SDSM library, which in turn makes a certain Unix system call requesting the OS to replace its own seg fault handler with a function in the SDSM library for page faults involving the page containing X. The SDSM sets things up in such a way that only NOW nodes with valid copies of X have the corresponding memory pages marked as resident. When X is accessed at some other node, a page fault results, and the underlying SDSM software fetches the correct value from a node that has it.

Again, it's not essential to know the precise workings of the SDSM system; rather, the important thing is to simply understand that there is an underlying VM-based mechanism that's being used to maintain consistency of local copies of shared data across the NOW nodes. If you don't, you will be mystified when you try to debug SDSM application code. The debugger will seem to mysteriously stop for nonexistent seg faults, because the SDSM infrastructure deliberately generates seg faults, and when an SDSM application program is run under a debugging tool, the tool senses them. Once you realize this, there is no problem at all—in GDB, you'd merely issue a continue command to resume execution when one of these odd pauses occurs.

You may be tempted to order GDB not to stop or issue warning messages whenever any seg faults occur, using the GDB command

handle SIGSEGV nostop noprint

You should use this approach with caution, though, as it may result in your missing any genuine seg faults caused by bugs in the application program.

Yet another, related difficulty with debugging applications that run on page-based SDSMs arises as follows. If a node on the network changes the value of a shared variable, then any other node that needs the value of that variable must obtain the updated value through a network transaction. Once again, the details of how this happens depends on the SDSM system, but this means that if you are single-stepping through the code executing on one node, you may find that GDB mysteriously hangs because the node is now waiting for an update to its local copy of a variable that was recently modified by another node. If you happen to be running a separate GDB session to step through the code on that other node as well, the update will not occur on the first node until the debugging session on the second node progresses far enough. In other words, if the programmer is not alert and careful during the debugging of an SDSM application, he can cause his own deadlock situation through the debugging process itself.

The SDSM situation is similar to that of the message-passing case in one sense—the need to have a variable like debugwait in the MPI example above, which allows you to have the program pause at all nodes, giving you a chance to attach GDB at each node and step through the program from the beginning.