Chapter 3. MapReduce

MapReduce refers to two distinct things: the programming model (covered here) and the specific implementation of the framework (covered later in Introducing Hadoop MapReduce). Designed to simplify the development of large-scale, distributed, fault-tolerant data processing applications, MapReduce is foremost a way of writing applications. In MapReduce, developers write jobs that consist primarily of a map function and a reduce function, and the framework handles the gory details of parallelizing the work, scheduling parts of the job on worker machines, monitoring for and recovering from failures, and so forth. Developers are shielded from having to implement complex and repetitious code and instead, focus on algorithms and business logic. User-provided code is invoked by the framework rather than the other way around. This is much like Java application servers that invoke servlets upon receiving an HTTP request; the container is responsible for setup and teardown as well as providing a runtime environment for user-supplied code. Similarly, as servlet authors need not implement the low-level details of socket I/O, event handling loops, and complex thread coordination, MapReduce developers program to a well-defined, simple interface and the “container” does the heavy lifting.

The idea of MapReduce was defined in a paper written by two Google engineers in 2004, titled "MapReduce: Simplified Data Processing on Large Clusters" (J. Dean, S. Ghemawat). The paper describes both the programming model and (parts of) Google’s specific implementation of the framework. Hadoop MapReduce is an open source implementation of the model described in this paper and tracks the implementation closely.

Specifically developed to deal with large-scale workloads, MapReduce provides the following features:

Simplicity of development: MapReduce is dead simple for developers: no socket programming, no threading or fancy synchronization logic, no management of retries, no special techniques to deal with enormous amounts of data. Developers use functional programming concepts to build data processing applications that operate on one record at a time. Map functions operate on these records and produce intermediate key-value pairs. The reduce function then operates on the intermediate key-value pairs, processing all values that have the same key together and outputting the result. These primitives can be used to implement filtering, projection, grouping, aggregation, and other common data processing functions.
Scale: Since tasks do not communicate with one another explicitly and do not share state, they can execute in parallel and on separate machines. Additional machines can be added to the cluster and applications immediately take advantage of the additional hardware with no change at all. MapReduce is designed to be a share nothing system.
Automatic parallelization and distribution of work: Developers focus on the map and reduce functions that process individual records (where “record” is an abstract concept—it could be a line of a file or a row from a relational database) in a dataset. The storage of the dataset is not prescribed by MapReduce, although it is extremely common, as we’ll see later, that files on a distributed filesystem are an excellent pairing. The framework is responsible for splitting a MapReduce job into tasks. Tasks are then executed on worker nodes or (less pleasantly) slaves.
Fault tolerance: Failure is not an exception; it’s the norm. MapReduce treats failure as a first-class citizen and supports reexecution of failed tasks on healthy worker nodes in the cluster. Should a worker node fail, all tasks are assumed to be lost, in which case they are simply rescheduled elsewhere. The unit of work is always the task, and it either completes successfully or it fails completely.

In MapReduce, users write a client application that submits one or more jobs that contain user-supplied map and reduce code and a job configuration file to a cluster of machines. The job contains a map function and a reduce function, along with job configuration information that controls various aspects of its execution. The framework handles breaking the job into tasks, scheduling tasks to run on machines, monitoring each task’s health, and performing any necessary retries of failed tasks. A job processes an input dataset specified by the user and usually outputs one as well. Commonly, the input and output datasets are one or more files on a distributed filesystem. This is one of the ways in which Hadoop MapReduce and HDFS work together, but we’ll get into that later.

The Stages of MapReduce

A MapReduce job is made up of four distinct stages, executed in order: client job submission, map task execution, shuffle and sort, and reduce task execution. Client applications can really be any type of application the developer desires, from command-line tools to services. The MapReduce framework provides a set of APIs for submitting jobs and interacting with the cluster. The job itself is made up of code written by a developer against the MapReduce APIs and the configuration which specifies things such as the input and output datasets.

As described earlier, the client application submits a job to the cluster using the framework APIs. A master process, called the jobtracker in Hadoop MapReduce, is responsible for accepting these submissions (more on the role of the jobtracker later). Job submission occurs over the network, so clients may be running on one of the cluster nodes or not; it doesn’t matter. The framework gets to decide how to split the input dataset into chunks, or input splits, of data that can be processed in parallel. In Hadoop MapReduce, the component that does this is called an input format, and Hadoop comes with a small library of them for common file formats. We’re not going to get too deep into the APIs of input formats or even MapReduce in this book. For that, check out Hadoop: The Definitive Guide by Tom White (O’Reilly).

In order to better illustrate how MapReduce works, we’ll use a simple application log processing example where we count all events of each severity within a window of time. If you’re allergic to writing or reading code, don’t worry. We’ll use just enough pseudocode for you to get the idea. Let’s assume we have 100 GB of logs in a directory in HDFS. A sample of log records might look something like this:

2012-02-13 00:23:54-0800 [INFO - com.company.app1.Main] Application started!
2012-02-13 00:32:02-0800 [WARN - com.company.app1.Main] Something hinky↵
    is going down...
2012-02-13 00:32:19-0800 [INFO - com.company.app1.Main] False alarm. No worries.
...
2012-02-13 09:00:00-0800 [DEBUG - com.company.app1.Main] coffee units remaining:zero↵
    - triggering coffee time.
2012-02-13 09:00:00-0800 [INFO - com.company.app1.Main] Good morning. It's↵
    coffee time.

For each input split, a map task is created that runs the user-supplied map function on each record in the split. Map tasks are executed in parallel. This means each chunk of the input dataset is being processed at the same time by various machines that make up the cluster. It’s fine if there are more map tasks to execute than the cluster can handle. They’re simply queued and executed in whatever order the framework deems best. The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs.

The input format is responsible for turning each record into its key-value pair representation. For now, trust that one of the built-in input formats will turn each line of the file into a value with the byte offset into the file provided as the key. Getting back to our example, we want to write a map function that will filter records for those within a specific timeframe, and then count all events of each severity. The map phase is where we’ll perform the filtering. We’ll output the severity and the number 1 for each record that we see with that severity.

function map(key, value) {
  // Example key: 12345 - the byte offset in the file (not really interesting).
  // Example value: 2012-02-13 00:23:54-0800 [INFO - com.company.app1.Main]↵
  //   Application started!

  // Do the nasty record parsing to get dateTime, severity,
  // className, and message.
  (dateTime, severity, className, message) = parseRecord(value);

  // If the date is today...
  if (dateTime.date() == '2012-02-13') {
    // Emit the severity and the number 1 to say we saw one of these records.
    emit(severity, 1);
  }
}

Notice how we used an if statement to filter the data by date so that we got only the records we wanted. It’s just as easy to output multiple records in a loop. A map function can do just about whatever it wants with each record. Reducers, as we’ll see later, operate on the intermediate key-value data we output from the mapper.

Given the sample records earlier, our intermediate data would look as follows:

DEBUG, 1
INFO, 1
INFO, 1
INFO, 1
WARN, 1

A few interesting things are happening here. First, we see that the key INFO repeats, which makes sense because our sample contained three INFO records that would have matched the date 2012-02-13. It’s perfectly legal to output the same key or value multiple times. The other notable effect is that the output records are not in the order we would expect. In the original data, the first record was an INFO record, followed by WARN, but that’s clearly not the case here. This is because the framework sorts the output of each map task by its key. Just like outputting the value 1 for each record, the rationale behind sorting the data will become clear in a moment.

Further, each key is assigned to a partition using a component called the partitioner. In Hadoop MapReduce, the default partitioner implementation is a hash partitioner that takes a hash of the key, modulo the number of configured reducers in the job, to get a partition number. Because the hash implementation used by Hadoop ensures the hash of the key INFO is always the same on all machines, all INFO records are guaranteed to be placed in the same partition. The intermediate data isn’t physically partitioned, only logically so. For all intents and purposes, you can picture a partition number next to each record; it would be the same for all records with the same key. See Figure 3-1 for a high-level overview of the execution of the map phase.

Figure 3-1. Map execution phase

Ultimately, we want to run the user’s reduce function on the intermediate output data. A number of guarantees, however, are made to the developer with respect to the reducers that need to be fulfilled.

If a reducer sees a key, it will see all values for that key. For example, if a reducer receives the INFO key, it will always receive the three number 1 values.
A key will be processed by exactly one reducer. This makes sense given the preceding requirement.
Each reducer will see keys in sorted order.

The next phase of processing, called the shuffle and sort, is responsible for enforcing these guarantees. The shuffle and sort phase is actually performed by the reduce tasks before they run the user’s reduce function. When started, each reducer is assigned one of the partitions on which it should work. First, they copy the intermediate key-value data from each worker for their assigned partition. It’s possible that tens of thousands of map tasks have run on various machines throughout the cluster, each having output key-value pairs for each partition. The reducer assigned partition 1, for example, would need to fetch each piece of its partition data from potentially every other worker in the cluster. A logical view of the intermediate data across all machines in the cluster might look like this:

worker 1, partition 2, DEBUG, 1
worker 1, partition 1, INFO, 1
worker 2, partition 1, INFO, 1
worker 2, partition 1, INFO 1
worker 3, partition 2, WARN, 1

Copying the intermediate data across the network can take a fair amount of time, depending on how much data there is. To minimize the total runtime of the job, the framework is permitted to begin copying intermediate data from completed map tasks as soon as they are finished. Remember that the shuffle and sort is being performed by the reduce tasks, each of which takes up resources in the cluster. We want to start the copy phase soon enough that most of the intermediate data is copied before the final map task completes, but not so soon that the data is copied leaving the reduce tasks idly taking up resources that could be used by other reduce tasks. Knowing when to start the copy process can be tricky, and it’s largely based on the available bandwidth of the network. See mapred.reduce.slowstart.completed.maps for information about how to configure when the copy is started.

Once the reducer has received its data, it is left with many small bits of its partition, each of which is sorted by key. What we want is a single list of key-value pairs, still sorted by key, so we have all values for each key together. The easiest way to accomplish this is by performing a merge sort of the data. A merge sort takes a number of sorted items and merges them together to form a fully sorted list using a minimal amount of memory. With the partition data now combined into a complete sorted list, the user’s reducer code can now be executed:

# Logical data input to the reducer assigned partition 1:
INFO, [ 1, 1, 1 ]

# Logical data input to the reducer assigned partition 2:
DEBUG, [ 1 ]
WARN, [ 1 ]

The reducer code in our example is hopefully clear at this point:

function reduce(key, iterator<values>) {
  // Initialize a total event count.
  totalEvents = 0;

  // For each value (a number one)...
  foreach (value in values) {
    // Add the number one to the total.
    totalEvents += value;
  }

  // Emit the severity (the key) and the total events we saw.
  // Example key: INFO
  // Example value: 3
  emit(key, totalEvents);
}

Each reducer produces a separate output file, usually in HDFS (see Figure 3-2). Separate files are written so that reducers do not have to coordinate access to a shared file. This greatly reduces complexity and lets each reducer run at whatever speed it can. The format of the file depends on the output format specified by the author of the MapReduce job in the job configuration. Unless the job does something special (and most don’t) each reducer output file is named part-<XXXXX>, where <XXXXX> is the number of the reduce task within the job, starting from zero. Sample reducer output for our example job would look as follows:

# Reducer for partition 1:
INFO, 3

# Reducer for partition 2:
DEBUG, 1
WARN, 1

Figure 3-2. Shuffle and sort, and reduce phases

For those that are familiar with SQL and relational databases, we could view the logs as a table with the schema:

CREATE TABLE logs (
  EVENT_DATE DATE,
  SEVERITY   VARCHAR(8),
  SOURCE     VARCHAR(128),
  MESSAGE    VARCHAR(1024)
)

We would, of course, have to parse the data to get it into a table with this schema, but that’s beside the point. (In fact, the ability to deal with semi-structured data as well as act as a data processing engine are two of Hadoop’s biggest benefits.) To produce the same output, we would use the following SQL statement. In the interest of readability, we’re ignoring the fact that this doesn’t yield identically formatted output; the data is the same.

SELECT SEVERITY,COUNT(*)
  FROM logs GROUP BY SEVERITY
  WHERE EVENT_DATE = '2012-02-13'
  GROUP BY SEVERITY
  ORDER BY SEVERITY

As exciting as all of this is, MapReduce is not a silver bullet. It is just as important to know how MapReduce works and what it’s good for, as it is to understand why MapReduce is not going to end world hunger or serve you breakfast in bed.

MapReduce is a batch data processing system: The design of MapReduce assumes that jobs will run on the order of minutes, if not hours. It is optimized for full table scan style operations. Consequently, it underwhelms when attempting to mimic low-latency, random access patterns found in traditional online transaction processing (OLTP) systems. MapReduce is not a relational database killer, nor does it purport to be.
MapReduce is overly simplistic: One of its greatest features is also one of its biggest drawbacks: MapReduce is simple. In cases where a developer knows something special about the data and wants to make certain optimizations, he may find the model limiting. This usually manifests as complaints that, while the job is faster in terms of wall clock time, it’s far less efficient in MapReduce than in other systems. This can be very true. Some have said MapReduce is like a sledgehammer driving a nail; in some cases, it’s more like a wrecking ball.
MapReduce is too low-level: Compared to higher-level data processing languages (notably SQL), MapReduce seems extremely low-level. Certainly for basic query-like functionality, no one wants to write, map, and reduce functions. Higher-level languages built atop MapReduce exist to simplify life, and unless you truly need the ability to touch terabytes (or more) of raw data, it can be overkill.
Not all algorithms can be parallelized: There are entire classes of problems that cannot easily be parallelized. The act of training a model in machine learning, for instance, cannot be parallelized for many types of models. This is true for many algorithms where there is shared state or dependent variables that must be maintained and updated centrally. Sometimes it’s possible to structure problems that are traditionally solved using shared state differently such that they can be fit into the MapReduce model, but at the cost of efficiency (shortest path−finding algorithms in graph processing are excellent examples of this). Other times, while this is possible, it may not be ideal for a host of reasons. Knowing how to identify these kinds of problems and create alternative solutions is far beyond the scope of this book and an art in its own right. This is the same problem as the “mythical man month,” but is most succinctly expressed by stating, “If one woman can have a baby in nine months, nine women should be able to have a baby in one month,” which, in case it wasn’t clear, is decidedly false.

Introducing Hadoop MapReduce

Hadoop MapReduce is a specific implementation of the MapReduce programming model, and the computation component of the Apache Hadoop project. The combination of HDFS and MapReduce is incredibly powerful, in much the same way that Google’s GFS and MapReduce complement each other. Hadoop MapReduce is inherently aware of HDFS and can use the namenode during the scheduling of tasks to decide the best placement of map tasks with respect to machines where there is a local copy of the data. This avoids a significant amount of network overhead during processing, as workers do not need to copy data over the network to access it, and it removes one of the primary bottlenecks when processing huge amounts of data.

Hadoop MapReduce is similar to traditional distributed computing systems in that there is a framework and there is the user’s application or job. A master node coordinates cluster resources while workers simply do what they’re told, which in this case is to run a map or reduce task on behalf of a user. Client applications written against the Hadoop APIs can submit jobs either synchronously and block for the result, or asynchronously and poll the master for job status. Cluster daemons are long-lived while user tasks are executed in ephemeral child processes. Although executing a separate process incurs the overhead of launching a separate JVM, it isolates the framework from untrusted user code that could—and in many cases does—fail in destructive ways. Since MapReduce is specifically targeting batch processing tasks, the additional overhead, while undesirable, is not necessarily a showstopper.

One of the ingredients in the secret sauce of MapReduce is the notion of data locality, by which we mean the ability to execute computation on the same machine where the data being processed is stored. Many traditional high-performance computing (HPC) systems have a similar master/worker model, but computation is generally distinct from data storage. In the classic HPC model, data is usually stored on a large shared centralized storage system such as a SAN or NAS. When a job executes, workers fetch the data from the central storage system, process it, and write the result back to the storage device. The problem is that this can lead to a storm effect when there are a large number of workers attempting to fetch the same data at the same time and, for large datasets, quickly causes bandwidth contention. MapReduce flips this model on its head. Instead of using a central storage system, a distributed filesystem is used where each worker is usually^[6] both a storage node as well as a compute node. Blocks that make up files are distributed to nodes when they are initially written and when computation is performed, the user-supplied code is executed on the machine where the block can be pushed to the machine where the block is stored locally. Remember that HDFS stores multiple replicas of each block. This is not just for data availability in the face of failures, but also to increase the chance that a machine with a copy of the data has available capacity to run a task.

Daemons

There are two major daemons in Hadoop MapReduce: the jobtracker and the tasktracker.

Jobtracker

The jobtracker is the master process, responsible for accepting job submissions from clients, scheduling tasks to run on worker nodes, and providing administrative functions such as worker health and task progress monitoring to the cluster. There is one jobtracker per MapReduce cluster and it usually runs on reliable hardware since a failure of the master will result in the failure of all running jobs. Clients and tasktrackers (see Tasktracker) communicate with the jobtracker by way of remote procedure calls (RPC).

Just like the relationship between datanodes and the namenode in HDFS, tasktrackers inform the jobtracker as to their current health and status by way of regular heartbeats. Each heartbeat contains the total number of map and reduce task slots available (see Tasktracker), the number occupied, and detailed information about any currently executing tasks. After a configurable period of no heartbeats, a tasktracker is assumed dead. The jobtracker uses a thread pool to process heartbeats and client requests in parallel.

When a job is submitted, information about each task that makes up the job is stored in memory. This task information updates with each tasktracker heartbeat while the tasks are running, providing a near real-time view of task progress and health. After the job completes, this information is retained for a configurable window of time or until a specified number of jobs have been executed. On an active cluster where many jobs, each with many tasks, are running, this information can consume a considerable amount of RAM. It’s difficult to estimate memory consumption without knowing how big each job will be (measured by the number of tasks it contains) or how many jobs will run within a given timeframe. For this reason, monitoring jobtracker memory utilization is absolutely critical.

The jobtracker provides an administrative web interface that, while a charming flashback to web (anti-)design circa 1995, is incredibly information-rich and useful. As tasktrackers all must report in to the jobtracker, a complete view of the available cluster resources is available via the administrative interface. Each job that is submitted has a job-level view that offers links to the job’s configuration, as well as data about progress, the number of tasks, various metrics, and task-level logs. If you are to be responsible for a production Hadoop cluster, you will find yourself checking this interface constantly throughout the day.

The act of deciding which tasks of a job should be executed on which worker nodes is referred to as task scheduling. This is not scheduling in the way that the cron daemon executes jobs at given times, but instead is more like the way the OS kernel schedules process CPU time. Much like CPU time sharing, tasks in a MapReduce cluster share worker node resources, or space, but instead of context switching—that is, pausing the execution of a task to give another task time to run—when a task executes, it executes completely. Understanding task scheduling—and by extension, resource allocation and sharing—is so important that an entire chapter (Chapter 7) is dedicated to the subject.

Tasktracker

The second daemon, the tasktracker, accepts task assignments from the jobtracker, instantiates the user code, executes those tasks locally, and reports progress back to the jobtracker periodically. There is always a single tasktracker on each worker node. Both tasktrackers and datanodes run on the same machines, which makes each node both a compute node and a storage node, respectively. Each tasktracker is configured with a specific number of map and reduce task slots that indicate how many of each type of task it is capable of executing in parallel. A task slot is exactly what it sounds like; it is an allocation of available resources on a worker node to which a task may be assigned, in which case it is executed. A tasktracker executes some number of map tasks and reduce tasks in parallel, so there is concurrency both within a worker where many tasks run, and at the cluster level where many workers exist. Map and reduce slots are configured separately because they consume resources differently. It is common that tasktrackers allow more map tasks than reduce tasks to execute in parallel for reasons described in MapReduce. You may have picked up on the idea that deciding the number of map and reduce task slots is extremely important to making full use of the worker node hardware, and you would be correct.

Upon receiving a task assignment from the jobtracker, the tasktracker executes an attempt of the task in a separate process. The distinction between a task and a task attempt is important: a task is the logical unit of work, while a task attempt is a specific, physical instance of that task being executed. Since an attempt may fail, it is possible that a task has multiple attempts, although it’s common for tasks to succeed on their first attempt when everything is in proper working order. As this implies, each task in a job will always have at least one attempt, assuming the job wasn’t administratively killed. Communication between the task attempt (usually called the child, or child process) and the tasktracker is maintained via an RPC connection over the loopback interface called the umbilical protocol. The task attempt itself is a small application that acts as the container in which the user’s map or reduce code executes. As soon as the task completes, the child exits and the slot becomes available for assignment.

The tasktracker uses a list of user-specified directories (each of which is assumed to be on a separate physical device) to hold the intermediate map output and reducer input during job execution. This is required because this data is usually too large to fit exclusively in memory for large jobs or when many jobs are running in parallel.

Tasktrackers, like the jobtracker, also have an embedded web server and user interface. It’s rare, however, that administrators access this interface directly since it’s unusual to know the machine you need to look at without first referencing the jobtracker interface, which already provides links to the tasktracker interface for the necessary information.

When It All Goes Wrong

Rather than panic when things go wrong, MapReduce is designed to treat failures as common and has very well-defined semantics for dealing with the inevitable. With tens, hundreds, or even thousands of machines making up a Hadoop cluster, machines—and especially hard disks—fail at a significant rate. It’s not uncommon to find that approximately 2% to 5% of the nodes in a large Hadoop cluster have some kind of fault, meaning they are operating either suboptimally or simply not at all. In addition to faulty servers, there can sometimes be errant user MapReduce jobs, network failures, and even errors in the data.

HDFS failures

For jobs whose input or output dataset is on HDFS, it’s possible that HDFS could experience a failure. This is the equivalent of the filesystem used by a relational database experiencing a failure while the database is running. In other words, it’s bad. If a datanode process fails, any task that is currently reading from or writing to it will follow the HDFS error handling described in Chapter 2. Unless all datanodes containing a block fail during a read, or the namenode cannot find any datanodes on which to place a block during a write, this is a recoverable case and the task will complete. When the namenode fails, tasks will fail the next time they try to make contact with it. The framework will retry these tasks, but if the namenode doesn’t return, all attempts will be exhausted and the job will eventually fail. Additionally, if the namenode isn’t available, new jobs cannot be submitted to the cluster since job artifacts (such as the JAR file containing the user’s code) cannot be written to HDFS, nor can input splits be calculated.

YARN

Hadoop MapReduce is not without its flaws. The team at Yahoo! ran into a number of scalability limitations that were difficult to overcome given Hadoop’s existing architecture and design. In large-scale deployments such as Yahoo!’s “Hammer” cluster—a single, 4,000-plus node Hadoop cluster that powers various systems—the team found that the resource requirements on a single jobtracker were just too great. Further, operational issues such as dealing with upgrades and the single point of failure of the jobtracker were painful. YARN (or “Yet Another Resource Negotiator”) was created to address these issues.

Rather than have a single daemon that tracks and assigns resources such as CPU and memory and handles MapReduce-specific job tracking, these functions are separated into two parts. The resource management aspect of the jobtracker is run as a new daemon called the resource manager,; a separate daemon responsible for creating and allocating resources to multiple applications. Each application is an individual MapReduce job, but rather than have a single jobtracker, each job now has its own jobtracker-equivalent called an application master that runs on one of the workers of the cluster. This is very different from having a centralized jobtracker in that the application master of one job is now completely isolated from that of any other. This means that if some catastrophic failure were to occur within the jobtracker, other jobs are unaffected. Further, because the jobtracker is now dedicated to a specific job, multiple jobtrackers can be running on the cluster at once. Taken one step further, each jobtracker can be a different version of the software, which enables simple rolling upgrades and multiversion support. When an application completes, its application master, such as the jobtracker, and other resources are returned to the cluster. As a result, there’s no central jobtracker daemon in YARN.

Worker nodes in YARN also run a new daemon called the node manager in place of the traditional tasktracker. While the tasktracker expressly handled MapReduce-specific functionality such as launching and managing tasks, the node manager is more generic. Instead, the node manager launches any type of process, dictated by the application, in an application container. For instance, in the case of a MapReduce application, the node manager manages both the application master (the jobtracker) as well as individual map and reduce tasks.

With the ability to run arbitrary applications, each with its own application master, it’s even possible to write non-MapReduce applications that run on YARN. Not entirely an accident, YARN provides a compute-model-agnostic resource management framework for any type of distributed computing framework. Members of the Hadoop community have already started to look at alternative processing systems that can be built on top of YARN for specific problem domains such as graph processing and more traditional HPC systems such as MPI.

The flexibility of YARN is enticing, but it’s still a new system. At the time of this writing, YARN is still considered alpha-level software and is not intended for production use. Initially introduced in the Apache Hadoop 2.0 branch, YARN hasn’t yet been battle-tested in large clusters. Unfortunately, while the Apache Hadoop 2.0 lineage includes highly desirable HDFS features such as high availability, the old-style jobtracker and tasktracker daemons (now referred to as MapReduce version one, or MRv1) have been removed in favor of YARN. This creates a potential conflict for Apache Hadoop users that want these features with the tried and true MRv1 daemons. CDH4, however, includes the HDFS features as well as both MRv1 and YARN. For more information on Hadoop distributions, versions, and features, see Picking a Distribution and Version of Hadoop. Since YARN is not yet stable, and the goal of this book is to provide pragmatic operational advice, the remainder of the content will focus exclusively on the MRv1 daemons and their configuration.

^[6] While it’s possible to separate them, this rarely makes sense because you lose the data locality features of Hadoop MapReduce. Those that wish to run only Apache HBase, on the other hand, very commonly run just the HDFS daemons along with their HBase counterparts.