Chapter 11. Troubleshooting

Troubleshooting is a topic that is near and dear to me. While there are many other areas of system administration that I enjoy, I don’t think anything compares to the excitement of tracking down the root cause of an obscure problem. Good troubleshooting is a combination of Sherlock Holmes–style detective work, intuition, and a little luck. You might even argue that some people have a knack for troubleshooting while others struggle with it, but in my mind it’s something that all sysadmins get better at the more problems they run into.

While this chapter discusses troubleshooting, there are a number of common problems that can cause your Ubuntu system to not boot or to run in an incomplete state. I have moved all of these topics into their own chapter on rescue and recovery and have provided specific steps to fix common problems with the Ubuntu rescue CD. So if you are trying to solve a problem at the moment, check Chapter 12, Rescue and Recovery, first to see if I have already outlined a solution. If not, come back here to get the more general steps to isolate the cause of your problem and work out its solution.

In this chapter I discuss some aspects of my general philosophy on troubleshooting that could be applied to a wide range of problems. Then I cover a few common problems that you might run into and introduce some tools and techniques to help solve them. By the end of the chapter you should have a head start the next time a problem turns up. After all, in many organizations downtime is measured in dollars, not minutes, so there is a lot to be said for someone who can find a root cause quickly.

General Troubleshooting Philosophy

While there are specific steps you can take to address certain computer problems, most troubleshooting techniques rely on the same set of rules. Here I discuss some of these rules that will help make you a better troubleshooter.

Divide the Problem Space

When I’m faced with an unknown issue, I apply the same techniques as when I have to pick a number between 1 and 100. If you have ever played this game, you know that most people fall into one of two categories: the random guessers and the narrowers. The random guessers might start by choosing 15, then hear that the number is higher and pick 23, then hear it is still higher. Eventually they might either luck into the right number or pick so many numbers that only the right number remains. In either case they use far more guesses than they need to. Many people approach troubleshooting the same way: They choose solutions randomly until one happens to work. Such a person might eventually find the problem, but it takes way longer than it should.

In contrast to the random guessers, the narrowers strategically choose numbers that narrow the problem in half each time. Let’s say the number is 80, for instance; their guesses would go as follows: 50, 75, 88, 82, 78, 80. With each guess, the list of numbers that could contain the answer is reduced by half. When people like this troubleshoot a computer problem, their time is spent finding ways to divide the problem space in half as much as possible. As I go through specific problems in this chapter, you will see this methodology in practice.

Favor Quick, Simple Tests over Slow, Complex Tests

What I mean here is that as you narrow down the possible causes of a problem, you will often end up with a few hypotheses that are equally likely. One hypothesis can be tested quickly but the other takes some time. For instance, if a machine can’t seem to communicate with the network, a quick test could be to see if the network cable is plugged in, while a longer test would involve more elaborate software tests on the host. If the quick test isolates the problem, you get the solution that much faster. If you still need to try the longer test, you aren’t out that much extra time.

Favor Past Solutions

Unless you absolutely prevent a problem from ever happening again, it’s likely that when a symptom that you’ve seen before pops up, it could have the same solution. Over the years you’ll find that you develop a common list of things you try first when you see a particular problem to rule out all of the common causes before you move on to more exotic hypotheses. Of course, you will have problems you’ve never seen before, too—that’s part of the fun of troubleshooting—but when you test some of your past solutions first, you will find you solve problems faster.

Good Communication Is Critical When Collaborating

If you are part of a team that is troubleshooting a problem, you absolutely must have good communication among team members. That could be as simple as yelling across cubicle walls, or it could mean setting up a chat room. A common problem when a team works an issue is multiple members testing the same hypothesis. With good communication each person can tackle a different hypothesis and report the results. These results can then lead to new hypotheses that can be divided among the team members. One final note: Favor communication methods that allow multiple people to communicate at the same time. This means that often chat rooms work much better than phones for problem solving, since over the phone everyone has to wait for a turn to speak; in a chat room multiple people can communicate at once.

Understand How Systems Work

The more deeply you understand how a system works, the faster you can rule out causes of problems. Over the years I’ve noticed that when a problem occurs, people first tend to blame the technology they understand the least. At one point in my career, every time a network problem occurred, everyone immediately blamed DNS, even when it appeared obvious (at least to me) that not only was DNS functioning correctly, it never had actually been the cause of any of the problems. One day we decided to hold a lecture to explain how DNS worked and traced an ordinary DNS request from the client to every DNS server and back. Afterward everyone who attended the class stopped jumping to DNS as the first cause of network problems. There are core technologies with which every sysadmin deals on a daily basis, such as TCP/IP networking, DNS, Linux processes, programming, and memory management; it is crucial that you learn about these in as much depth as possible if you want to find a solution to a problem quickly.

Document Your Problems and Solutions

Many organizations have as part of their standard practice a postmortem meeting after every production issue. A postmortem allows the team to document the troubleshooting steps they took to arrive at a root cause as well as what solution ultimately fixed the issue. Not only does this help make sure that there is no disagreement about what the root cause is, but when everyone is introduced to each troubleshooting step, it helps make all the team members better problem solvers going forward. When you document your problem-solving steps, you have a great guide you can go to the next time a similar problem crops up so it can be solved that much faster.

Use the Internet, but Carefully

The Internet is an incredibly valuable resource when you troubleshoot a problem, especially if you are able to articulate it in search terms. After all, you are rarely the only person to face a particular problem, and in many cases other people have already come up with the solution. Be careful with your Internet research, though. Often your results are only as good as your understanding of the problem. I’ve seen many people go off on completely wrong paths to solve a problem because of a potential solution they found on the Internet. After all, a search for “Ubuntu server not on network” will turn up all sorts of completely different problems irrelevant to your issue.

Resist Rebooting

OK, so those of us who have experience with Windows administration have learned over the years that when you have a weird problem, a reboot often fixes it. Resist this “technique” on your Ubuntu servers! I’ve had servers with uptimes measured in years because most problems found on a Linux machine can be solved without a reboot. The problem with rebooting a machine (besides ruining your uptime) is that if the problem does go away, you may never know what actually caused it. That means you can’t solve it for good and will ultimately see the problem again. As attractive as rebooting might be, keep it as your last resort.

Localhost Troubleshooting

While I would say that a majority of problems you will find on a server have some basis in networking, there is still a class of issues that involves only the localhost. What makes this tricky is that some local and networking problems often create the same set of symptoms, and in fact local problems can create network problems and vice versa. In this section I will cover problems that occur specifically on a host and leave issues that impact the network to the next section.

Host Is Sluggish or Unresponsive

Probably one of the most common problems you will find on a host is that it is sluggish or completely unresponsive. Often this can be caused by network issues, but here I will discuss some local troubleshooting tools you can use to tell the difference between a loaded network and a loaded machine.

When a machine is sluggish, it is often because you have consumed all of a particular resource on the system. The main resources are CPU, RAM, disk I/O, and network (which I will leave to the next section). Overuse of any of these resources can cause a system to bog down to the point that often the only recourse is your last resort—a reboot. If you can log in to the system, however, there are a number of tools you can use to identify the cause.

System Load

System load average is probably the fundamental metric you start from when troubleshooting a sluggish system. One of the first commands I run when I’m troubleshooting a slow system is uptime: