GLOSSARY

activities
A logical grouping or classification of one or more jobs running on a cluster.
balancer
A service ensuring that all nodes in the cluster store contain about the same amount of data within a set range. Data is balanced over the nodes in the cluster, not over the disks in a node.
cluster (Hadoop)
A set of nodes configured to work together based on a common Hadoop ­component stack, with HDFS and MapReduce as the foundation.
components (Hadoop architecture)
The individual installed software products composing a complete Hadoop cluster. Some components are active and include servers, such as HDFS, and some are passive libraries. The servers of active components provide a service.
Components consist of roles that represent the different configurations required by the component. They have a role on each host server. For example, HDFS roles include NameNode, secondary NameNode, and DataNode.
DataNode
A server and component role of HDFS that stores data. A DataNode performs ­filesystem operations assigned by the NameNode. The DataNode stores the data within a Hadoop cluster. It is a slave node to the NameNode, which submits requests to all of the nodes within a ­cluster for filesystems operations.
distributed metadata
Distributed metadata means that the NameNode is eliminated by storing the metadata throughout the DataNodes in the cluster. This type of Hadoop architecture was developed to resolve the problem of a single point of failure within a Hadoop system, the NameNode.
Hadoop
A batch processing infrastructure that stores files and distributes work across a group of servers. The infrastructure is composed of HDFS and MapReduce components. Hadoop is an open source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop's strength lies in its ability to scale across thousands of commodity servers that don't share memory or disk space.
Hadoop assigns tasks across servers (called “worker nodes” or “slave nodes”), essentially ­simultaneously running them together. This gives it the ability to analyze large quantities of data. By balancing tasks across different location it allows bigger jobs to be completed faster.
Hadoop can be thought of as an ecosystem—it's composed of many different components that all work together to create a single platform. There are two key functional components within this ­ecosystem: the storage of data (Hadoop Distributed File System, or HDFS) and the framework for running parallel computations on this data (MapReduce).
Hadoop Common
Usually only referred to by programmers, Hadoop Common is a common utilities library that contains code to support some of the other modules within the Hadoop ecosystem. When Hive and HBase want to access HDFS, for example, they do so using JARs (Java archives), which are libraries of Java code stored in Hadoop Common.
HBase
HBase is a columnar database management system that is built on top of Hadoop and runs on HDFS. Like MapReduce, HBase applications are written in Java, as well as other languages via their Thrift database, which is a framework that allows cross-language services development. The key difference between MapReduce and HBase is that HBase is intended to work with random workloads.
Hcatalog
Table and storage management service for Hadoop data that presents a table ­abstraction so that you do not need to know where or how your data is stored.
HDFS
An open source filesystem designed to store extremely large (megabytes to petabytes) data files with streaming data access patterns. HDFS splits these files into data blocks and distributes the blocks across hosts (datanodes) in a cluster. HDFS enables Hadoop to store huge files. It's a scalable filesystem that distributes and stores data across all machines in a Hadoop cluster. Each HDFS cluster contains the following:
  • NameNode: Runs on a “master node” that tracks and directs the storage of the cluster.
  • DataNode: Runs on “slave nodes,” which make up the majority of the machines within a cluster. The NameNode instructs data files to be split into blocks, each of which are replicated three times and stored on machines across the cluster. These replicas ensure the entire system won't go down if one server fails or is taken offline—known as “fault tolerance.”
  • Client: Client machines have Hadoop installed on them. They're responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of the job once complete.
Hive
A data warehouse built on top of Hadoop providing data summarization, query, and analysis. A SQL-like syntax called Hive Query Language (HiveQL) is part of Hive. HiveQL is used to create programs that run just as MapReduce would on a cluster. In a very general sense, Hive is used for complex, long-running tasks and analyses on large sets of data. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
HiveQL
A SQL-like programming language used with Hive.
Impala
Like Hive, Impala also uses SQL syntax instead of Java to access data. The difference between Hive and Impala is speed: A query using Hive may take minutes, hours, or longer, yet a query using Impala usually take seconds (or less).
Impala is used for analysis that you want to run and return quickly on a small subset of your data, e.g. analyzing the sales of a large warehouse company for a single product. Impala is used as an analytic tool on top of prepared, more structured data.
hosts
Devices, such as a computer or a switch, attached to a computer or telecommunications network, or a point in a network topology where lines intersect or branch.
job
A mapper or reducer execution across a dataset. A job may split data to be processed across mapper tasks for parallel processing, with a master (JobTracker) scheduling and monitoring jobs across slaves (TaskTracker).
JobTracker
A service that assigns MapReduce tasks to specific nodes in the cluster, preferably those nodes functioning as a DataNode. JobTracker schedules mapper and reducer jobs among TaskTrackers, with an awareness of data location.
MapReduce
A process of distributing work across a cluster used by the MapReduce engine. It processes input dataset records, mapping input key-value pairs to a set of intermediate key-value pairs. Reducers merge a set of processed values, which share a key to smaller set of values, and combiners perform local (on the same host) aggregation of intermediate output, reducing the amount of data transferred from Mapper to Reducer.
MapReduce is the process used to process the large amount of data Hadoop stores in HDFS. Originally created by Google, its strength lies in the ability to divide a single large data processing job into smaller tasks.
Once the tasks have been created, they're spread across multiple nodes and run simultaneously. The “reduce” phase combines the results together. The following nodes are used in this process:
  • JobTracker: The JobTracker oversees how MapReduce jobs are split up into tasks and divided among nodes within the cluster.
  • TaskTracker: The TaskTracker accepts tasks from the JobTracker, and performs the work and alerts the JobTracker once it's done. TaskTrackers and DataNodes are located on the same nodes to improve performance.
  • Data locality: Map executing code on the node where the data resides. All clusters should have the appropriate topology. Hadoop map code must have the ability to read data locally. Hadoop must be aware of the topology of the nodes where tasks are executed. Tasktracker nodes are used to execute map tasks, and so the Hadoop scheduler needs information about node topology for proper task assignment. In other words, whenever you use a MapReduce program on a particular part of HDFS data, you always want to run that program on the node, or machine, that actually stores this data in HDFS. Doing so allows processes to be run much faster, since it prevents you from having to move large amounts of data around.

    When a MapReduce job is executed, part of what the JobTracker does is look to see which machines the information required for the task is located on. Once it is located, the NameNode splits data files into blocks, each one replicated three times: The first is stored on the same machine as the block, while the second and third are each stored on separate machines. This is part of Hadoop's distributive process.

    Storing the data across three machines thus gives you a much higher chance of achieving data locality, since it's likely that at least one of the machines will be freed up enough to process the data stored at that particular location.

NameNode
A service that maintains a directory of all files in HDFS and tracks where data is stored in the cluster. Maintaining master-to-slave data nodes.
nodes
An abstract unit that composes a cluster; a vertex in a graph.
Pig (Apache Pig)
A programming language designed to handle any type of data. Pig helps users to focus more on analyzing large datasets and less time writing map programs and reduce programs.
Like Hive and Impala, Pig is a high-level platform used for creating MapReduce programs more ­easily. The programming language Pig uses is called Pig Latin, and it allows you to extract, transform, and load (ETL) data at a very high level. This greatly reduces the effort if this was written in JAVA code; PIG is only a fraction of that.
While Hive and Impala require data to be more structured in order to be analyzed, Pig allows you to work with unstructured data. In other words, while Hive and Impala are essentially query engines used for more straightforward analysis, Pig's ETL capability means it can work on unstructured data, cleaning it up and organizing it so that queries can be run against it.
slot
A map or reduce computation unit on a node. Each active map or reduce task occupies one slot, which could be a map or a reduce slot. A TaskTracker has a configured number of slots available, and JobTracker allocates work to the TaskTracker with available slots nearest to the data.
Stack (Hadoop)
Hadoop software layers; applications that interact directly with Hadoop.
  • Data processing layer; encapsulates the MapReduce framework
  • Data storage layer; the file system (HDFS)
Sqoop
ETL tool to support transfer of data between Hadoop and structured data sources. A connection and transfer mechanism that moves data between Hadoop and relational databases.
task
A mapper or reducer instance operating on a slice of data. Tasks are executed by the Hadoop TaskTracker, which assigns tasks to nodes with resources available for executing the task. Each active map or reduce task occupies one slot.
TaskAttempt
An instance of a map or reduce task, which is identified by a task ID. The JobTracker may run a task on more than one node, either if it fails or to enable getting faster results from another node; this adds to the number of attempts.
TaskWaiting
A task state of waiting to be launched.
YARN
YARN is a resource manager that was created by separating the processing engine and resource management capabilities of MapReduce. It is an updated way of handling the delegation of resources for MapReduce jobs. It takes the place of the JobTracker and TaskTracker. YARN supports multiple processing models in addition to MapReduce. It is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop.