Debugging is a difficult task, and it becomes even more challenging when the misbehaving application is trying to coordinate multiple, simultaneous activities; client/server network programming, programming with threads, and parallel processing are examples of this paradigm. This chapter presents an overview of the most commonly used multiprogramming techniques and offers some tips on how to deal with bugs in these kinds of programs, focusing on the use of GDB/DDD/Eclipse in the debugging process.
Computer networks are extremely complex systems, and rigorous debugging of networked software applications can sometimes require the use of hardware monitors to collect detailed information about the network traffic. An entire book could be written on this debugging topic alone. Our goal here is to simply introduce the subject.
Our example consists of the following client/server pair. The client application allows a user to check the load on the machine on which the server application runs, even if the user does not have an account on the latter machine. The client sends a request for information to the server—here, a query about the load on the server's system, via the Unix w
command—over a network connection. The server then processes the request and returns the results, capturing the output of w
and sending it back over the connection. In general, a server can accept requests from multiple remote clients; to keep things simple in our example, let's assume there is only one instance of the client.
The code for the server is shown below:
1 // srvr.c 2 3 // a server to remotely run the w command 4 // user can check load on machine without login privileges 5 // usage: svr 6 7 #include <stdio.h> 8 #include <sys/types.h> 9 #include <sys/socket.h> 10 #include <netinet/in.h> 11 #include <netdb.h> 12 #include <fcntl.h> 13 #include <string.h> 14 #include <unistd.h> 15 #include <stdlib.h> 16 17 #define WPORT 2000 18 #define BUFSIZE 1000 // assumed sufficient here 19 20 int clntdesc, // socket descriptor for individual client 21 svrdesc; // general socket descriptor for server 22 23 char outbuf[BUFSIZE]; // messages to client 24 25 void respond() 26 { int fd,nb; 27 28 memset(outbuf,0,sizeof(outbuf)); // clear buffer 29 system("w > tmp.client"); // run 'w' and save results 30 fd = open("tmp.client",O_RDONLY); 31 nb = read(fd,outbuf,BUFSIZE); // read the entire file 32 write(clntdesc,outbuf,nb); // write it to the client 33 unlink("tmp.client"); // remove the file 34 close(clntdesc); 35 } 36 37 int main() 38 { struct sockaddr_in bindinfo; 39 40 // create socket to be used to accept connections 41 svrdesc = socket(AF_INET,SOCK_STREAM,0); 42 bindinfo.sin_family = AF_INET; 43 bindinfo.sin_port = WPORT; 44 bindinfo.sin_addr.s_addr = INADDR_ANY; 45 bind(svrdesc,(struct sockaddr *) &bindinfo,sizeof(bindinfo)); 46 47 // OK, listen in loop for client calls 48 listen(svrdesc,5); 49 50 while (1) { 51 // wait for a call 52 clntdesc = accept(svrdesc,0,0); 53 // process the command 54 respond(); 55 } 56 }
Here is the code for the client:
1 // clnt.c 2 3 // usage: clnt server_machine 4 5 #include <stdio.h> 6 #include <sys/types.h> 7 #include <sys/socket.h> 8 #include <netinet/in.h> 9 #include <netdb.h> 10 #include <string.h> 11 #include <unistd.h> 12 13 #define WPORT 2000 // server port number 14 #define BUFSIZE 1000 15 16 int main(int argc, char **argv) 17 { int sd,msgsize; 18 19 struct sockaddr_in addr; 20 struct hostent *hostptr; 21 char buf[BUFSIZE]; 22 23 // create socket 24 sd = socket(AF_INET,SOCK_STREAM,0); 25 addr.sin_family = AF_INET; 26 addr.sin_port = WPORT; 27 hostptr = gethostbyname(argv[1]); 28 memcpy(&addr.sin_addr.s_addr,hostptr->h_addr_list[0],hostptr->h_length); 29 30 // OK, now connect 31 connect(sd,(struct sockaddr *) &addr,sizeof(addr)); 32 33 // read and display response 34 msgsize = read(sd,buf,BUFSIZE); 35 if (msgsize > 0) 36 write(1,buf,msgsize); 37 printf("\n"); 38 return 0; 39 }
For those unfamiliar with client/server programming, here is an overview of how the programs work:
On line 41 of the server code, you create a socket, which is an abstraction similar to a file descriptor; just as one uses a file descriptor to perform I/O operations on a filesystem object, one reads from and writes to a network connection via a socket. On line 45, the socket is bound to a specific port number, arbitrarily chosen to be 2000. (User-level applications such as this one are restricted to port numbers of 1024 and higher.) This number identifies a "mailbox" on the server's system to which clients send requests to be processed for this particular application.
The server "opens for business" by calling listen()
on line 48. It then waits for a client request to come in by calling accept()
on line 52. That call blocks until a request arrives. It then returns a new socket for communicating with the client. (When there are multiple clients, the original socket continues to accept new requests even while an existing request is being serviced, hence the need for separate sockets. This would require the server to be implemented in a threaded fashion.) The server processes the client request with the respond()
function and sends the machine load information to the client by locally invoking the w
command and writing the results to the socket in line 32.
The client creates a socket on line 24 and then uses it on line 31 to connect to the server's port 2000. On line 34, it reads the load information sent by the server and then prints it out.
Here is what the output of the client should look like:
$ clnt laura.cs.ucdavis.edu 13:00:15 up 13 days, 39 min, 7 users, load average: 0.25, 0.13, 0.09 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT matloff :0 - 14Jun07 ?xdm? 25:38 0.15s -/bin/tcsh -c / matloff pts/1 :0.0 14Jun07 17:34 0.46s 0.46s -csh matloff pts/2 :0.0 14Jun07 18:12 0.39s 0.39s -csh matloff pts/3 :0.0 14Jun07 58.00s 2.18s 2.01s /usr/bin/mutt matloff pts/4 :0.0 14Jun07 0.00s 1.85s 0.00s clnt laura.cs.u matloff pts/5 :0.0 14Jun07 20.00s 1.88s 0.02s script matloff pts/7 :0.0 19Jun07 4days 22:17 0.16s -csh
Now suppose the programmer had forgotten line 26 in the client code, which specifies the port on the server's system to connect to:
addr.sin_port = WPORT;
Let's pretend we don't know what the bug is and see how we might track it down.
The client's output would now be
$ clnt laura.cs.ucdavis.edu $
It appears that the client received nothing at all back from the server. This of course could be due to a variety of causes in either the server or the client, or both.
Let's take a look around, using GDB. First, check to see that the client actually did succeed in connecting to the server. Set a breakpoint at the call to connect()
, and run the program:
(gdb) b 31 Breakpoint 1 at 0x8048502: file clnt.c, line 31. (gdb) r laura.cs.ucdavis.edu Starting program: /fandrhome/matloff/public_html/matloff/public_html/Debug /Book/DDD/clnt laura.cs.ucdavis.edu Breakpoint 1, main (argc=2, argv=0xbf81a344) at clnt.c:31 31 connect(sd,(struct sockaddr *) &addr,sizeof(addr));
Use GDB to execute the connect()
and check the return value for an error condition:
(gdb) p connect(sd,&addr,sizeof(addr)) $1 = -1
It is indeed -1
, the code for failure. That is a big hint. (Of course, as a matter of defensive programming, when we wrote the client code, we would have checked the return value of connect()
and handled the case of failure to connect.)
By the way, note that in manually executing the call to connect()
, you have to remove the cast. With the cast retained, you'd get an error:
(gdb) p connect(sd,(struct sockaddr *) &addr,sizeof(addr)) No struct type named sockaddr.
This is due to a quirk in GDB, and it arises because we haven't used the struct elsewhere in the program.
Also note that if the connect()
attempt had succeeded in the GDB session, you could not have then gone ahead and executed line 31. Attempting to open an already-open socket is an error.
You would have had to skip over line 31 and go directly to line 34. You could do this using GDB's jump
command, issuing jump 34
, but in general you should use this command with caution, as it might result in skipping some machine instructions that are needed further down in the code. So, if the connection attempt had succeeded, you would probably want to rerun the program.
Let's try to track down the cause of the failure by checking the argument addr
in the call to connect()
:
(gdb) p addr ... connect(3, {sa_family=AF_INET, sin_port=htons(1032), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused) ...
Aha! The value htons(1032)
indicates port 2052 (see below), not the 2000 we expect. This suggests that you either misspecified the port or forgot to specify it altogether. If you check, you'll quickly discover that the latter was the case.
Again, it would have been prudent to include a bit of machinery in the source code to help the debugging process, such as checking the return values of system calls. Another helpful step is inclusion of the line
#include <errno.h>
which, on our system, creates a global variable errno
, whose value can be printed out from within the code or from within GDB:
(gdb) p errno $1 = 111
From the file /usr/include/linux/errno.h, you find that this error number codes a connection refused error.
However, the implementation of the errno
library may differ from platform to platform. For example, the header file may have a different name, or errno
may be implemented as a macro call instead of a variable.
Another approach would be to use strace
, which traces all system calls made by a program:
$ strace clnt laura.cs ... connect(3, {sa_family=AF_INET, sin_port=htons(1032), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused) ...
This gives you two important pieces of information for the price of one. First, you see right away that there was an ECONNREFUSED
error. Second, you also see that the port was htons(1032)
, which has the value 2052. You can check this latter value by issuing a command like
(gdb) p htons(1032)
from within GDB, which shows the value to be 2052, which obviously is not 2000, as expected.
You will find strace
to be a handy tool in many contexts (networked and otherwise) for checking the results of system calls.
As another example, suppose that you accidentally omit the write to the client in the server code (line 32):
write(clntdesc,outbuf,nb); // write it to the client
In this case, the client program would hang, waiting for a reply that is not forthcoming. Of course, in this simpleminded example you'd immediately suspect a problem with the call to write()
in the server and quickly find that we had forgotten it. But in more complex programs the cause may not be so obvious. In such cases, you would probably set up two simultaneous GDB sessions, one for the client and one for the server, stepping through both of the programs in tandem. You would find that at some point in their joint operation that the client hangs, waiting to hear from the server, and thus obtain a clue to the likely location of the bug within the server. You'd then focus your attention on the server GDB session, trying to figure out why it did not send to the client at that point.
In really complex network debugging cases, the open source ethereal program can be used to track individual TCP/IP packets.