Chapter 9. Troubleshooting

Throughout this book, the notion that Hadoop is a distributed system made up of layered services has been a repeating theme. The interaction between these layers is what makes a system like this so complex and so difficult to troubleshoot. With complex moving parts, interdependent systems, sensitivity to environmental conditions and external factors, and numerous potential causes for numerous potential conditions, Hadoop starts to look like the human body. We’ll treat it as such (with my apologies to the medical field as a whole).

Differential Diagnosis Applied to Systems

A significant portion of problems encountered with systems, in general, remain so because of improper diagnosis.^[21] You cannot fix something when you don’t know what’s truly wrong. The medical field commonly uses a differential diagnosis process as a way of investigating symptoms and their likelihood in order to properly diagnose a patient with a condition. Differential diagnoses, for those of us without an MD, is essentially a knowledge- and data-driven process of elimination whereby tests are performed to confirm or reject a potential condition. In fact, this isn’t as exotic a concept as it initially sounds when you think about how our brains attack these kinds of problems. It does, however, help to formalize such an approach and follow it, especially when things go wrong and your boss is standing over your shoulder asking you every five minutes whether it’s fixed yet. When things go wrong, take a long, deep, breath and put on your white coat.

Develop a patient history.
Each patient (a host, cluster, network) has a story to tell. Gather a history of what has occurred most recently to the various components of the system that may help in diagnosis. Recent maintenance operations, configuration changes, new hardware, and changes in load are all points of interest.
Develop a list of potential diagnoses.
Write them down on a whiteboard. Nothing is dumb, but each diagnosis should pass the sniff test. Nothing comes off the list without a test that disproves the diagnosis as a possibility.
Sort the list by likelihood.
Given what you know about the probability that a condition occurs in the wild, but also in the context of the patient and the environment, sort the list of diagnoses. The most common ailments should float to the top of the list; they’re common for a reason. Examine the patient history for anything that would change your mind about the probability something is the root cause of the problem. Hold tight to Occam’s razor, and when you hear hooves, look for horses, not zebras—unless you’re in an area where zebras are incredibly common.
Test!
Systematically work down the list, performing tests that either confirm or reject a given condition. Tests that eliminate multiple potential conditions should be performed sooner rather than later. Update the list if you find new information that would indicate you missed a potential condition. If you do update the list, go back to step 3 and repeat the process.
Diagnosis.
As you work through various tests, you’ll either find the problem or come out with no diagnosis. If you eliminate all possibilities, you’ve either missed a possible diagnosis in step 2 or incorrectly eliminated one based on a bad test (or misinterpretation of the results).

The tests performed when diagnosing systems are actually faster (and certainly cheaper) than what doctors need to do. The complete battery usually includes OS utilities such as top, vmstat, sar, iostat, and netstat, but also Hadoop-specific tools such as hadoop dfsadmin, and hadoop fsck. Log files are how machines communicate with us, so make sure you’re always listening to what they have to say.

All of this sounds silly—and we’re not necessarily saving lives. That said, the process of critical evaluation of data to arrive at an informed decision is nothing to laugh at. This method of looking at problems and solutions works, but it’s only as good as the information you have. We’ll use this approach to diagnosing problems encountered with Hadoop and point out some specific examples where it was used to find rather unintuitive problems in production systems.

Common Failures and Problems

There’s an abundance of things that can go wrong with Hadoop. Like other systems, it’s subject to the host environment on which its daemons run, but additionally so on the union of all hosts within the cluster. It’s this latter part that complicates any system to such a degree. All of the standard things that can happen on a single host are magnified when hosts and services become dependent upon one another. The management and configuration of any distributed system dramatically increases the universe of things that can go wrong, and Hadoop is no exception to that.

Just as doctors have the Diagnostic and Statistical Manual of Mental Disorders, or DSM, to describe and codify the known universe of disorders, administrators too need a set of known conditions and criteria for classification. What follows is a short list of some of the more prevalent conditions found when diagnosing Hadoop.

Humans (You)

More than anything, humans tend to cause the most havoc when it comes to the health of systems and machines. Even the most innocuous, mundane tasks can easily result in downtime. This isn’t specifically a Hadoop problem as much as it is a general system administration issue. I’ll put my money where my mouth is on this one and share a story with you.

I was once dispatched to a data center to fix a redundant loop that failed on a large SAN backing a production relational database. The system was up and running but had simply lost a redundant link; it was degraded. After all due diligence and planning, I left the office and went to the data center, laptop in hand, to fix the problem. Once in the data center, I carefully opened the racks, checked the physical state of everything, and confirmed that there was sufficient capacity on one of power distribution units in the racks with the storage before I plugged in my laptop…or so I thought. I plugged in the laptop and powered it on and immediately tripped a circuit in the rack. I panicked for a second until I realized that all power was redundant, fed by different circuits. Unfortunately, everything switching over at once must have caused a spike in load, or maybe some fans spun up, but within a few seconds, the redundant circuit popped as well and in the cabinet containing the SAN controller. Everything went quiet in the aisle, as if in observance of a moment of silence for what I had just done. Turns out, I just wasn’t careful enough when I read the PDU display on power consumption and someone else had run a couple power cables from the next rack over when installing some machines a few days earlier. We worked fast and fixed the problem, but we all learned a couple of lessons that day (especially me) about the real cause of failures.

In retrospect, it’s easy to see all of the things that went wrong and how silly the whole situation was. It was entirely preventable (in fact, nothing like it ever happened again). You can never be careful enough. Every system administrator has a horror story. What’s truly important is recognizing the likelihood of such a situation so that it takes a proper place in the diagnostic process.

Misconfiguration

Kathleen Ting, a manager on Cloudera’s support team gave a talk at HadoopWorld 2011 in New York City, supported by research done by Ari Rabkin, where she talked about common failures and their cause. She revealed that 35% of tickets handled by the Cloudera support team were due to some kind of misconfiguration, within either Hadoop or the operating system. Further, these tickets accounted for 40% of the time spent by the team resolving issues with customers. This is not to say that the users were necessarily twiddling configuration parameters randomly—in fact, many are advanced users of multiple projects in the Hadoop ecosystem—but that the interaction between the numerous parameters leads to all kinds of unintended behavior.

It’s possible to dismiss that argument and say that Hadoop is still young technology, that this situation will improve in time. Take the Oracle relational database, one of the most widely deployed relational database systems today. Many agree that with its myriad parameters, it can be incredibly difficult to configure optimally. In fact, it’s so specialized that it even has its own job title: Oracle Database Administrator. Now consider that Oracle, as a product, is about 32 years old. This kind of story is not unique (although maybe Oracle is an extreme example) in large, complex systems.

So what can be done to mitigate misconfiguration? Here are a few tips:

Develop a solid understanding of the precedence of configuration value overrides.
Start with the basic parameters required for operation, such as storage locations, hostnames, and resource controls. Make sure that the cluster works with the minimal necessary set before graduating to the performance- and security-related parameters.
Before setting anything, make sure you have a good idea about what it does.
Double-check the unit of each parameter you set. Many parameters are expressed in bytes; a few are in megabytes.
With each release, look for any parameters that changed in meaning or scope. Audit your configuration files frequently.
Use a configuration management and deployment system to ensure that all files are up-to-date on all hosts and that daemons are properly restarted to affect changes.
Before and after making configuration changes, run a subset of MapReduce jobs to evaluate the performance and resource consumption impact of the changes.

Beyond Hadoop proper, operating system misconfiguration is an equal source of pain in maintaining health clusters. Common problems include incorrect permissions on log and data directories and resource limits such as the maximum allowed number of simultaneously opened files being set to their defaults. This problem tends to happen when the initial setup and configuration of machines is not automated and the cluster is expanded. Clusters configured by hand run a significant risk of a single configuration step being accidentally missed. Of course, those who read through the recommendations in Chapters 4 and 5 should be able to dodge most of the fun and excitement of misconfiguration, but it happens to the best of us. Always have it on your list of potential culprits when generating your list of potential conditions.

Hardware Failure

As sure as the sun will rise, hardware will fail. It’s not unheard of to have problems with memory, motherboards, and disk controllers, but the shining gem of the problematic component kingdom is the hard drive. One of the few components with moving parts in the modern server, the hard drive suffers from physical wear and tear during normal operations. All drive manufacturers advertise one of a number of different measures of drive failure rates: mean time to failure (MTTR), mean time between failures (MTBF), or annualized failure rate (AFR). The way these numbers are calculated can be confusing, and it’s worth noting that they all apply to averages over a given period of time for a specific population, not the specific devices living in your machines. In other words, expect failures at all times, regardless of advertised metrics. Have spare components at the ready whenever possible, especially for mission-critical clusters.

Unfortunately, hardware rarely fails outright. Instead, it tends to degrade over time, leading to subtle, temporary failures that are compensated for by software components. Hadoop is excellent at masking, by way of compensation, impending hardware failures. HDFS will detect corrupt data blocks and automatically create new, correct replicas from other healthy copies of the data without human intervention. MapReduce automatically retries failed tasks, temporarily blacklists misbehaving hosts, and uses speculative execution to compensate for under-performing hardware. All of these features double as masking agents, so although this functionality is critical, it is not a panacea. You shouldn’t need to wake up in the middle of the night for a corrupt HDFS block, but you should always track the rate of anomalous conditions in an effort to root out bad hardware.

Resource Exhaustion

CPU cycles, memory, disk space and IO, and network bandwidth are all finite resources for which various processes contend in a cluster. Resource exhaustion can be seen as a specialized subclass of misconfiguration. After all, an administrator is responsible for controlling resource allocation by way of configuration. Either way, it does occur and it tends to be high on the list of things that go wrong in the wild.

Resource allocation can be seen as a hierarchy. A cluster contains many hosts, which contain various resources that are divided amongst any tasks that need to run. Orthogonally, both the Hadoop daemons and user tasks consume these resources. It’s possible that a disk running out of space causes a Hadoop daemon to fail or a task hits its maximum JVM heap size and is killed as a result; these are both equal examples of resource exhaustion. Because Hadoop accepts arbitrary code from users, it’s extremely difficult to know in advance what they might do. This is one of the reasons Hadoop has so many parameters to control and sandbox user tasks. The framework inherently does not trust user-supplied code and for good reason, as task failures due to bugs or job-level misconfiguration are extremely common.

You should measure and track task failures to help users identify and correct misbehaving processes. Repetitive task failures occupy task slots and take resources away from other jobs and should be seen as a drain on overall capacity. Conversely, starving the Hadoop daemons for resources is detrimental to all users and can negatively affect throughput and SLAs. Proper allocation of system resources to the framework and your users is just as critical in Hadoop as it is any other service in the data center.

Host Identification and Naming

Like resource exhaustion, host identification and naming must be explicitly called out as a special kind of misconfiguration. The way that a worker host identifies itself to the namenode and jobtracker is the same way clients will attempt to contact it, which leads to interesting types of failures where a datanode, because of a misconfigured /etc/hosts entry, reports its loopback device IP address to namenode and successfully heartbeats, only to create a situation in which clients can never communicate with it. These types of problems commonly occur at initial cluster setup time, but they can burn countless hours for administrators while they try and find the root cause of the problem. The Hadoop mailing lists, for instance, are full of problems that can ultimately be traced back to identification and name resolution problems.

Network Partitions

A network partition is (informally) described as any case in which hosts on one segment of a network cannot communicate with hosts on another segment of the network. Trivially, this can mean that host A on switch 1 cannot send messages to host B on switch 2. The reason why they cannot communicate is purposefully left undefined here because it usually doesn’t matter;^[22] a switch, cable, NIC, or even host failure of the recipient all look the same to the sender participating in the connection (or vice versa). More subtly, a delay in delivering messages from one host to another above a certain threshold is functionality identical to not delivering the message at all. In other words, if a host is unable to get a response within its acceptable timeout from another machine, this case is indistinguishable from its partner simply being completely unavailable. In many cases, that’s exactly how Hadoop will treat such a condition.

Network partitions are dubious in that they can just as easily be the symptom or the condition. For example, if one were to disable the switch port to which a host were connected, it would certainly be the root cause of a number of failures in communication between clients and the host in question. On the other hand, if the process that should receive messages from a client were to garbage collect for an excessive period of time, it might appear to the client as if the host became unavailable through some fault of the network. Further, short of absolute hardware failure, many network partitions are transient, leaving you with only symptoms remaining by the time you investigate. An historical view of the system, with proper collection of various metrics, is the only chance one has to distinguish these cases from one another.

“Is the Computer Plugged In?”

Let’s be honest: there’s no simple script to follow when performing tests to confirm or reject a diagnosis. There are different frameworks you can use, though, to make sure you cover all bases. It isn’t feasible to walk through a multihundred-line-long call-center-style script when you’re in the middle of a crisis. Again, in contrast to the medical field, doctors are taught silly little mnemonics to help them remember the various systems to consider when troubleshooting human beings. Given that they (arguably) have a lot more on the line than most administrators dealing with a problematic cluster, we’ll build on their wisdom and experience. Hang on—things are about to get a little cheesy.

E-SPORE

In the midst of all the action when a failure does occur, it’s important to follow some kind of process to make sure you don’t miss performing a critical test. Just as coming up with the complete list of potential causes of a failure is important, you must be able to correctly confirm or reject each using the appropriate tests. It’s not unusual that administrators forget to check something, and it’s not always clear where to start. When you’re at a loss or when you want to make sure you’ve thought of everything, try E-SPORE. E-SPORE is a mnemonic device to help you remember to examine each part of a distributed system while troubleshooting:

Environment: Look at what’s currently happening in the environment. Are there any glaring, obvious issues? Usually something has drawn your attention to a failure, such as a user filing a ticket or a monitoring system complaining about something. Is this unusual, given the history of the system? What is different about the environment now from the last time everything worked?
Stack: Consider the dependencies in the stack. The MapReduce service depends on HDFS being up, running, and healthy. Within each service lives another level of dependencies. All of HDFS depends on the namenode, which depends on its host OS, for example. There’s also the more specific dependencies within the services like the jobtracker’s dependency on the namenode for discovering block locations for scheduling, or an HBase region server’s dependency on the ZooKeeper quorum. The entire cluster also has shared dependency on data center infrastructure such as the network, DNS, and other services. If the jobtracker appears to be failing to schedule jobs, maybe it’s failing to communicate with the namenode.
Patterns: Look for a pattern in the failure. When MapReduce tasks begin to fail in a seemingly random way, look closer. Are the tasks from the same job? Are they all assigned to the same tasktracker? Do they all use a shared library that was changed recently? Patterns exist at various levels within the system. If you don’t see one within a failed job, zoom out to the larger ETL process, then the cluster.
Output: Hadoop communicates its ailments to us by way of its logs. Always check log output for exceptions. Sometimes the error is logically far from the original cause (or the symptom is not immediately indicative of the disease). For instance, you might see a Java NullPointerException that caused a task to fail, but that was only the side effect of something that happened earlier that was the real root cause of the error. In fact, in distributed systems like Hadoop, it’s not uncommon for the cause of the error to be on the other side of the network. If a datanode can’t connect to the namenode, have you tried looking in the namenode’s logs as well as the datanode?
Resources: All daemons need resources to operate, and as you saw earlier, resource exhaustion is far too common. Make sure local disks have enough capacity (and don’t forget about /var/log), the machine isn’t swapping, the network utilization looks normal, and the CPU utilization looks normal given what the machine is currently doing. This process extends to intra-process resources such as the occupied heap within a JVM and the time spent garbage collecting versus providing service.
Event correlation: When none of these steps reveal anything interesting, follow the series of events that led to the failure, which usually involves intermachine, interprocess event correlation. For example, did a switch fail that caused additional replication traffic due to lost datanodes, or did something happen to the datanodes that caused a switch to be overrun with replication traffic, causing its failure? Knowing the correct order of events and having this kind of visibility into the system can help you understand byzantine failures.

There’s nothing cheesier than a retrofit mnemonic device, but some form of regimented approach is necessary to properly evaluate the state of a system when troubleshooting. At each stage, there are some obvious test cases to be performed. Resource consumption is simple and relatively easy to understand, for instance, with tools such as df, du, sar, vmstat, iostat, and so forth. At the other end of the spectrum are more advanced techniques such as event correlation that require far more Hadoop-specific knowledge about the noteworthy events that occur within a system and some infrastructure to be able to extract and visualize that data in various ways. At its simplest, this can be cluster topology changes (nodes joining and leaving the cluster), rapid changes in the observed metric values (when standard deviation of the last N samples is greater than some threshold), and user actions such as job submission or HDFS file changes. Exactly which tool you use is less important than having an understanding of the type of data to look at and how.

Treatment and Care

Once the root cause of a problem has been identified, only then should a specific course of action be taken. Far too often, so-called shotgun debugging^[23] occurs in moments of crisis. This approach usually ends in a series of corrective actions being taken that aren’t necessarily needed, which increases the risk of additional disruption or damage to the system. You should absolutely avoid the urge to make changes until you know there’s a high likelihood you’ve found the root cause of a particular issue. It’s also important that detailed notes are kept about what actions are eventually taken and why. Always ask yourself if you or another member of the team would be able to solve the same problem if it happened again in a month.

Often there are multiple options for treatment, each with associated risk and side effects. Clearly the best scenario is one in which a problem is detected, diagnosed, and “cured” permanently. That, of course, is not always possible. Resolution can be classified into a number of different buckets:

Permanent eradication: The ideal case, in which a problem is simply solved. Usually, these are trivial, expected issues such as disk failures or an overly verbose log consuming disk space. An administrator fixes the issue, makes a note of what was done, and moves on with life. We should all be so lucky.
Mitigation by configuration: Some problems have no immediate solution and must be mitigated until the root cause can be stamped out. These tend to be slightly more complicated cases in which the risk of making a permanent change is either too risky or would take too long. A common instance of this is a user’s MapReduce job that repeatedly runs out of memory while processing a given dataset. Even if it’s decided that there is a better way of writing the job so as to not consume so much RAM, doing so at 3:00 in the morning without proper code review and testing probably isn’t the time to do so. It may, however, be possible to mitigate the problem by temporarily increasing the amount of memory allotted to MapReduce jobs until such time the code can be fixed.
Mitigation by architecture: Similar to mitigating a problem by changing configuration, it’s sometimes better to change the architecture of a particular system to solve a problem. One slightly more advanced instance of this being the case is in solving slot utilization and job runtime problems. Let’s assume that a MapReduce job processing a large dataset runs hourly, and every time it runs, it produces tens of thousands of tasks that would otherwise monopolize the cluster. As you learned in Chapter 7, we can use either the Fair or Capacity Scheduler to prevent the job from taking over. The problem is that the job still may not complete within a desired window of time because it takes a lot of time to churn through all the tasks. Rather than try to figure out how to handle so many tasks, we can try to mitigate the problem by looking at ways of reducing the number of tasks. We can’t very well magically reduce the size of the data, but we can look at how much data is processed by each task. Remember that in many cases, there’s roughly one map task per HDFS block of data. It’s possible that there are so many tasks because there are a large number of very small, one-block files. The solution to this problem is architectural and involves changing the way data is written to involve a smaller number of larger files. Look at the size of the input split processed by each map task in the job (contained in the task logs) and work with the development team to find a more efficient way of storing and processing the data.
Mitigation by process: Another way to look at the previous problem is to say that it’s not a code or architecture problem but a process problem. Maybe the job and its input data is as optimized as things are going to get and there’s no remaining room for improvement. Another possibility to consider is shifting the time at which the offending job executes. Maybe there’s simply a better time it can run. If it’s an ad hoc job, maybe there should be an upper bound on how much data a single job can process at once before it loses an SLA. Alternatively, it could be that a summarized or highly compressed version of the dataset should be built to facilitate the offending job, assuming that it’s so critical that it must run. Some of these options are as much architectural changes as they are about changing the process by which the cluster is managed.

Detecting, reacting to, and correcting an issue is generally seen as the problem-solving process. Unfortunately, this is a problem in and of itself because it misses the real opportunity: to learn from the failure. Incorporating new experiences can, and should, influence future behavior and ability. This approach means detecting related problems faster and reducing the time to resolution, of course, but even that isn’t the true win.

Imagine you’re on a bike going down an enormous hill without brakes. You lose control of the bike, crash, and wind up in the hospital. Don’t worry, though, because eventually you make a complete recovery. Months later, you find yourself at the top of a hill that looks remarkably similar to the one where you were injured. Worse, the bike has an odd familiarity to it as well. You begin to realize something. The uneasy feeling in your stomach is experience, and it’s screaming at you to not go down the hill. That is the true value of experience: preventative maintenance. Being able to recognize and prevent a similar situation from occurring is what saves us from the repetition of unpleasant situations. It’s true that you could buy a helmet or learn to roll when you hit the ground, but you could also not ride a bike with no brakes down a hill.

With administration of complex production systems, it’s not as simple as deciding to not go down the hill. By the time you’re on the bike at the hill, it’s too late. Instead, you need to take action long before you find yourself in such a situation. Action, in this context, means updating processes, code, and monitoring to incorporate what you’ve learned as a result of the incident. Taken in aggregate, most agree that preventative maintenance is cheaper (expressed in terms of code, time, money, and availability) than emergency maintenance and is well worth the investment. Of course, this line of reasoning can be taken too far, to the point at which overreaction can mean that heavyweight processes are put in place that impede productivity and flexibility. Like most things, preventative care can be overdone, so care should be exercised. Each change should be evaluated in the context of the likelihood that the given problem would repeat in the future, or if it were truly annomolous. In other words, do not build systems and processes around the most uncommon case you have encountered. Every decision should be a function of probability of occurrence and the associated impact when it does occur.

One incredibly useful tool in the preventative maintenance toolbox is the postmortem. It doesn’t need to be a formal process, beyond actually holding it, and it should be open to anyone who wishes to attend. Anyone involved in the incident should attend, and the person with the most knowledge of the situation should lead it. Start by summarizing the problem and the end result, and then walk through the timeline of events, calling out anything interesting along the way. If the incident involved multiple systems, have the person with the most knowledge of each system handle that part of the walkthrough. Allow attendees to ask questions along the way about the decisions that were made or comment on an approach that might have worked better. At the end, write a synopsis of what happened, why, what was done about it, and what you’re doing to help prevent it in the future. Everyone should ask questions about the likelihood of a repeat occurrence, the resultant damage (in terms of time, money, data loss, and so on), and any other pertinent points that would be useful to know in the future.

Worth noting is that the postmortem is commonly avoided because it tends to make people feel like they’re being singled out. This is a valid concern: no one likes to think they’ve done a bad job. But it’s also extremely unfortunate, as the postmortem is one of the best ways to prevent similar situations going forward. Walking through the timeline can help everyone, including anyone who maybe made a less-than-stellar decision along the way, understand where they can change their thinking and improve in the future. The postmortem is absolutely not, nor should it ever be permitted to be, a blame session. It’s critical that those participating be both respectful and cognizant of the fact that others are putting it out there for all the group to see, so to speak. Create an environment where making mistakes is forgivable, honesty is praised, and blame and derision are not tolerated, and you will have a stronger team and better systems as a result.

At the time of this writing, Amazon Web Services, a popular infrastructure as a service cloud provider, had a large power outage that impacted a large swath of EC2 users. One of these customers was Netflix, a rather public and large-scale customer of the service. Although the incident itself was short-lived (comparatively), Netflix had an extended outage due to the way some of their systems were built. In fact, one of the reasons they had such an outage was a fix they had made to their code as the result of a different failure they previously observed. The Netflix team rather bravely posted a fantastic example of a postmortem of the incident on their public tech blog.

War Stories

Talking about the theory and process of troubleshooting is useful, even necessary. However, it’s equally important to see how these techniques apply to real-world scenarios. What follows is a number of cases of real production problems with some depth on what it took to detect and resolve each case. In certain cases, details have been changed or omitted to protect the innocent, but the crux of the issue remains.

A Mystery Bottleneck

A cluster of a few hundred nodes running HDFS and MapReduce, responsible for streaming data ingest of all user activity from various web applications, presented with “intermittent slow HDFS write rates.” Most streams would write “fast” (later learned to be at a rate greater than 50MB per second), but occasionally, the write rate would drop for some but not all streams to less than 5MB per second. All nodes were connected via a near-line rate, nonblocking 1GbE network, with each 48-port top-of-rack switch connected to a core via four trunked 10GbE fibre links for a mild over-subscription rate of 1.2. Commonly used in large clusters, the core switch in use theoretically supported the necessary fabric bandwidth to not be a bottleneck. Each host in the cluster contained 8 JBOD-configured, 2TB SATA II drives, connected via a single SAS controller to a standard dual-socket motherboard sporting 48GB of memory.

What was interesting about the problem was the intermittent nature of the degraded write performance. This was a new cluster, which always brings a fair amount of suspicion, mostly because there’s no existing data with which to compare. However, the hardware itself was a known quantity and the intermittent nature immediately suggested an outlier somewhere in the network. Discussing the problem with the cluster administrator, it was revealed that it wasn’t actually that an entire HDFS write stream would be slow, but that within a write, the speed would appear to drop for a short period of time and then resolve itself. At a high level, it immediately sounded like a network problem, but where and why were the open questions.

Larger organizations, at which system and network administrators are in two separate groups, tend to suffer from an us and them problem, each claiming that various ephemeral issues are the fault of the other. Network administrators view the network as a service, happy to remain ignorant of the content of the packets that fly back and forth, and systems folks assuming that the network is a given. Things are further complicated by the opacity of the network. It’s not always clear what the physical topology is, or what’s shared versus dedicated bandwidth. Without a view of the network from the switch, it’s impossible to see what the overall traffic pattern looks like, so black box−style testing is usually necessary.

Once the hardware, configuration, and cluster history information was available, a list of possible problems was made. Obviously some kind of network issue could be causing this kind of behavior. It could be a global network problem in which all traffic was affected, possibly due to dropped packets and retransmission. Of course, this seemed unlikely, given the aggregate traffic that the cluster was handling with the given equipment. This could really be the case only if the core switch were shared with some other, extremely spiky application, and even then, the drop in transfer rate seemed to affect only some streams. It could be that it was local to a single top-of-rack switch. That would account for why only some streams were affected. We’d need to know more about which datanodes were involved to say whether it was contained on a single rack or if it spanned the whole network. As much as the network was possible, it could be a problem with a specific group of datanodes on the host side. A group of datanodes could be misconfigured somehow, causing any write that included one or more of those nodes in the replication pipeline to be affected. The rule around network partitions—that a machine failure and a network failure are indistinguishable from one another when it comes to message loss—could easily be generalized to a situation such as this, in which a pattern of failure could be a switch or a group of machines, and that a slow machine is the same as a slow segment of the network (where the segment could be a small as a single port). Either way, it seemed like the pattern was key. If we could find when and where the degraded performance occurred, only then would it even be possible to find out why.

Unfortunately, there was no per-host network monitoring in place, so it wasn’t a simple matter of looking at the distribution of traffic to machines. Even if it was available, it’s still difficult to distinguish artificially limited traffic from an underutilized machine in the cluster. Luckily, the issue was easy to observe and replicate. The cluster administrator pointed out that he could perform an HDFS transfer with the command hadoop -put some_file.log /user/joe and with some luck, hit the issue. Using the output of the hadoop fsck command, it’s possible to find all hosts involved in an HDFS write operation, which would be critical. Remember from Chapter 2 that when clients write to HDFS, a replication pipeline is formed for each block. This means that a different set of datanodes is selected for each block written to HDFS, which would easily explain the intermittent nature of the problem. Together, we constructed a test to help us find the pattern, shown in Example 9-1; we would run the command in a loop, writing to numbered files in HDFS, timing each iteration of the command, as there was a noticeable difference in speed when the issue occurred. Note that we were interested in the results for all commands rather than just slow commands, because if a node appeared in both a slow and a normal speed write, that would show us that it probably wasn’t a problem on that machine, or that the problem wasn’t a specific machine, but possibly a component within a machine, such as a single disk. If we looked at just the hosts involved in the slow iterations, we might incorrectly focus on a machine. It was also entirely possible that there would be no pattern here and the problem wasn’t related to a specific machine (but I admit I was pretty convinced).

To simplify the test, we used a file that was just under the block size. This meant there was a single block per file and a single triplet of datanodes as a result.

Example 9-1. Test script used to generate HDFS traffic

#!/bin/sh

opt_source_file="$1"
opt_dest="$2"
opt_iterations="$3"

[ -f "$opt_source_file" ] || {
  echo "Error: source file not provided or doesn't exist: $opt_source_file"
  exit 1
}

[ -n "$opt_dest" ] || {
  echo "Error: destination path not provided"
  exit 2
}

[ -n "$opt_iterations" ] || {
  echo "Error: number of iterations not provided"
  exit 3
}

filename=$(basename $opt_source_file)
i=0

while [ "$i" -lt "$opt_iterations" ] ; do
  echo "Writing $opt_dest/$filename.$i"
  time hadoop fs -put "$opt_source_file" "$opt_dest/$filename.$i"
  i=$(($i + 1))
done

We ran our test, writing the output to a file for later reference. The next step was to build a list of which datanodes participated in each replication pipeline. The information we used was produced by the hadoop fsck utility. The output from this specific incident isn’t available, but Example 9-2 gives you an idea of how the output can be used for this purpose. Note the options passed to the command and the full listing of each block and the three datanodes on which a replica resides.

Example 9-2. HDFS fsck output displaying block locations per file

[esammer@hadoop01 ~]$ hadoop fsck /user/esammer -files -blocks -locations
FSCK started by hdfs (auth:SIMPLE) from /10.0.0.191 for path ↵
  /user/esammer at Mon Jul 09 15:35:41 PDT 2012
/user/esammer <dir>
...
/user/esammer/1tb/part-00004 10995116200 bytes, 1 block(s):  OK
0. BP-875753299-10.1.1.132-1340735471649:blk_6272257110614848973_83827 ↵
  len=134217728 repl=3 [10.1.1.134:50010, 10.1.1.135:50010, 10.1.1.139:50010]
...

The fsck output was parsed, and the list of datanodes for each single-block file was sorted to make comparisons simpler, as in Example 9-3.

Example 9-3. Parsed fsck output with sorted datanodes

/user/esammer/1tb/part-00004  10.1.1.134:50010  10.1.1.135:50010  10.1.1.139:50010
/user/esammer/1tb/part-00005  10.1.1.135:50010  10.1.1.137:50010  10.1.1.140:50010
/user/esammer/1tb/part-00006  10.1.1.131:50010  10.1.1.133:50010  10.1.1.134:50010
...

Performing some basic counts of which datanodes were involved in slow files versus fast files, it became obvious that a single datanode was present in all slow writes and no normal speed writes. Direct scp file copies to the other two machines that appeared with the problematic host in each instance also ran at an expected rate, while the host in question was significantly slower. Using scp and bypassing Hadoop altogether eliminated it as a potential contributor to the problem. It was clearly the single host that was the issue.

All host-level checks looked normal. Resource consumption, hardware, and software health all looked normal on the host. Eventually, we convinced the network team to take one last look at the switch to which the host was connected. As it turned out, either a human being or a bad cable (we never got an answer) had caused the single switch port to negotiate at 100Mb, rather than 1Gb—a simple problem to fix.

This was a situation in which a lot of lessons were learned, and there was obviously room for improvement. Here are some of the take-aways this particular company had, as a result:

Always assume that the problem is yours. If each group had made such an assumption and spent a few extra minutes when performing initial troubleshooting, this problem would have been caught much, much sooner. Specifically, when the networking group confirmed that there was no issue, they checked only a subset of the total number of ports that made up the cluster, missing the misconfigured port. The same may have also been true on the host side of the link (we never actually saw this).
Hadoop is, as promised, really good at covering up odd degradations in hardware performance. The problem went unnoticed for some time before it was clear that it was even an issue.
Implement comprehensive monitoring. Having all monitoring in place, and under a single pane of glass, would have saved quite a bit of time writing ad hoc scripts to locate the problem.

There’s No Place Like 127.0.0.1

Mike (not his real name) was tasked with setting up a new Hadoop cluster. He used the standard golden CentOS image used for all other machines in the data center as a base from which to start, installed the software, and configured Hadoop as he had done before. Starting with HDFS, he formatted the namenode and proceeded to fire up the namenode daemon. As expected, the web user interface immediately responded, showing zero datanodes in the cluster. Next, he started a datanode process on the same machine that, within a few seconds, showed up in the user interface. Everything looked like it was on track.

The next step was to start each datanode, in turn, checking that the total available HDFS capacity grew with each new node that connected. At this point, some strange behavior was observed. No other datanodes seemed to be able to connect to the namenode. Mike started making a list of things that could cause this result:

It’s possible that the machine running the namenode was running a firewall that prevented other hosts from connecting to it. This firewall would explain why the local datanode could connect but no other process could.
It’s possible that the namenode was binding to the wrong IP address. We know that the namenode uses the value of fs.default.name is used to decide which IP to bind to. This case seemed unlikely because Mike used the correct hostname by which the machine should be addressed. (In our examples, we’ll use es-op-n1 as the namenode and es-op-n2 as a separate worker machine.)
The datanodes may have received incorrect configuration in fs.default.name, instructing them to communicate with the wrong machine. This case was even more unlikely because the same configuration files were deployed to all nodes.
Some network problem could have partitioned the namenode from the rest of the new workers. This possibility was easily rejected out of the gate because Mike was able to make ssh connections from one machine to the other.
The dfs.hosts and dfs.hosts.exclude lists could have been denying datanode registration. Again, this option was easily rejected because these files had not yet been configured and the default behavior of the namenode is to allow all datanodes to connect.

It was common for new CentOS images to have a default set of firewall rules configured, which seemed possible. Additionally, it was trivial to verify it, so Mike decided to knock it off the list right away:

[root@es-op-n1 ~]# iptables -nvL
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target prot opt in  out source     destination
   34  2504 ACCEPT all  --  *   *   0.0.0.0/0  0.0.0.0/0 state RELATED,ESTABLISHED
    0     0 ACCEPT icmp --  *   *   0.0.0.0/0  0.0.0.0/0
    1   382 ACCEPT all  --  lo  *   0.0.0.0/0  0.0.0.0/0
    0     0 ACCEPT tcp  --  *   *   0.0.0.0/0  0.0.0.0/0 state NEW tcp dpt:22
    4   240 REJECT all  --  *   *   0.0.0.0/0  0.0.0.0/0 reject-with
                                                         icmp-host-prohibited

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target prot opt in  out source     destination
    0     0 REJECT all  --  *   *   0.0.0.0/0  0.0.0.0/0 reject-with
                                                         icmp-host-prohibited

Chain OUTPUT (policy ACCEPT 24 packets, 2618 bytes)
 pkts bytes target prot opt in  out source     destination

Sure enough, the standard set of rules allowing only established and related connections, ICMP, loopback device, and SSH traffic were permitted. All other traffic was being rejected. Also, the packet counter column for the REJECT rule clearly showed that it was increasing at a steady rate. Without even checking the logs, Mike knew this wouldn’t work. He decided to temporarily disable iptables while working on the new configuration:

[root@es-op-n1 ~]# /etc/init.d/iptables stop
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Unloading modules:                               [  OK  ]
[root@es-op-n1 conf]# iptables -nvL
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target prot opt in  out source     destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target prot opt in  out source     destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target prot opt in  out source     destination

With that issue resolved, the user interface was checked again. After checking the datanode processes’ logs and confirming that they were still alive and retrying, Mike checked the namenode user interface again. Still, though, no new datanodes connected. It was obvious that this would have been a problem, but it wasn’t actually the sole issue. Something was still wrong, so Mike moved on to the next possibility in the list: binding to the wrong IP address. Again, this possibility was also easy to check using the netstat command:

root@es-op-n1 ~]# netstat -nlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address    Foreign Address   State       PID/Program name
tcp        0      0 127.0.0.1:8020   0.0.0.0:*         LISTEN      5611/java
tcp        0      0 0.0.0.0:50070    0.0.0.0:*         LISTEN      5611/java
tcp        0      0 0.0.0.0:22       0.0.0.0:*         LISTEN      4848/sshd
...

Process id 5611 had the two ports open he expected—the namenode RPC port (8020) and the web user interface (50070)—however, the RPC port was listening only on the loopback IP address 127.0.0.1. That was definitely going to be a problem. If the hostname specified by fs.default.name was, in fact, the proper hostname to use—after all, Mike had used it when creating an SSH connection to the machine—why was this happening? As a first step, Mike checked on what the machine thought its hostname was:

[root@es-op-n1 ~]# hostname
es-op-n1

Not the fully qualified name, but that should work just fine. Next, he checked what the data center DNS had to say about the hostname:

[root@es-op-n1 ~]# host -t A -v es-op-n1
Trying "es-op-n1.xyz.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36912
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;es-op-n1.xyz.com.    IN      A

;; ANSWER SECTION:
es-op-n1.xyz.com. 0   IN      A       10.20.194.222

Received 60 bytes from 10.20.76.73#53 in 0 ms

According to DNS, es-op-n1 lived at 10.20.194.222, which was correct. It was possible that something in /etc/hosts (which normally appears before DNS in the list of sources of name resolution) was to blame:

[root@es-op-n1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 es-op-n1

Aha! The name es-op-n1 was listed as an alias for localhost, which resolves to 127.0.0.1. That could easily cause the issue. Mike fixed this, after some grumbling about why anyone would configure a machine this way, restarted the namenode process, and checked netstat again:

[root@es-op-n1 ~]# netstat -nlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address        Foreign Address   State    PID/Program name
tcp        0      0 10.20.194.222:8020   0.0.0.0:*         LISTEN   6282/java
tcp        0      0 0.0.0.0:50070        0.0.0.0:*         LISTEN   6282/java
tcp        0      0 0.0.0.0:22           0.0.0.0:*         LISTEN   4848/sshd
...

Everything was looking much better now. Checking the namenode web user interface revealed two datanodes connected, which made sense: one from the local machine and the first worker that had been retrying all along. Just to make sure, Mike started copying some files into HDFS to confirm that everything worked properly before moving on:

[root@es-op-n1 ~]# sudo -u hdfs hadoop fs -put /etc/hosts /hosts1
[root@es-op-n1 ~]# sudo -u hdfs hadoop fs -put /etc/hosts /hosts2
[root@es-op-n1 ~]# sudo -u hdfs hadoop fs -put /etc/hosts /hosts3
[root@es-op-n1 ~]# sudo -u hdfs hadoop fs -ls /
Found 3 items
-rw-r--r--   3 hdfs supergroup        158 2012-07-10 15:17 /hosts1
-rw-r--r--   3 hdfs supergroup        158 2012-07-10 15:18 /hosts2
-rw-r--r--   3 hdfs supergroup        158 2012-07-10 15:18 /hosts3
[root@es-op-n1 ~]# sudo -u hdfs hadoop fs -rm /hosts\*
Deleted /hosts1
Deleted /hosts2
Deleted /hosts3

Once again, everything appeared to be working, and Mike was happy.

^[21] I have no empirical data to back this argument up. Anecdotally, however, it has been true in teams I’ve been a part of, managed, or talked to, and I’d put money on it being universally so.

^[22]Cloudera engineer Henry Robinson wrote a fantastic blog post covering this in the context of the CAP theorem.

^[23]Shotgun debugging describes an unfocused series of corrective actions being attempted in the hope that you hit the target. Like a shotgun, the lack of precision usually has significant unintended consequences.