GLOSSARY

NameNode: Runs on a “master node” that tracks and directs the storage of the cluster.
DataNode: Runs on “slave nodes,” which make up the majority of the machines within a cluster. The NameNode instructs data files to be split into blocks, each of which are replicated three times and stored on machines across the cluster. These replicas ensure the entire system won't go down if one server fails or is taken offline—known as “fault tolerance.”
Client: Client machines have Hadoop installed on them. They're responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of the job once complete.

Hive

A data warehouse built on top of Hadoop providing data summarization, query, and analysis. A SQL-like syntax called Hive Query Language (HiveQL) is part of Hive. HiveQL is used to create programs that run just as MapReduce would on a cluster. In a very general sense, Hive is used for complex, long-running tasks and analyses on large sets of data. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

HiveQL

A SQL-like programming language used with Hive.

Impala

Like Hive, Impala also uses SQL syntax instead of Java to access data. The difference between Hive and Impala is speed: A query using Hive may take minutes, hours, or longer, yet a query using Impala usually take seconds (or less).

Impala is used for analysis that you want to run and return quickly on a small subset of your data, e.g. analyzing the sales of a large warehouse company for a single product. Impala is used as an analytic tool on top of prepared, more structured data.

hosts

Devices, such as a computer or a switch, attached to a computer or telecommunications network, or a point in a network topology where lines intersect or branch.

job

A mapper or reducer execution across a dataset. A job may split data to be processed across mapper tasks for parallel processing, with a master (JobTracker) scheduling and monitoring jobs across slaves (TaskTracker).

JobTracker

A service that assigns MapReduce tasks to specific nodes in the cluster, preferably those nodes functioning as a DataNode. JobTracker schedules mapper and reducer jobs among TaskTrackers, with an awareness of data location.

MapReduce

A process of distributing work across a cluster used by the MapReduce engine. It processes input dataset records, mapping input key-value pairs to a set of intermediate key-value pairs. Reducers merge a set of processed values, which share a key to smaller set of values, and combiners perform local (on the same host) aggregation of intermediate output, reducing the amount of data transferred from Mapper to Reducer.

MapReduce is the process used to process the large amount of data Hadoop stores in HDFS. Originally created by Google, its strength lies in the ability to divide a single large data processing job into smaller tasks.

Once the tasks have been created, they're spread across multiple nodes and run simultaneously. The “reduce” phase combines the results together. The following nodes are used in this process:

JobTracker: The JobTracker oversees how MapReduce jobs are split up into tasks and divided among nodes within the cluster.
TaskTracker: The TaskTracker accepts tasks from the JobTracker, and performs the work and alerts the JobTracker once it's done. TaskTrackers and DataNodes are located on the same nodes to improve performance.
Data locality: Map executing code on the node where the data resides. All clusters should have the appropriate topology. Hadoop map code must have the ability to read data locally. Hadoop must be aware of the topology of the nodes where tasks are executed. Tasktracker nodes are used to execute map tasks, and so the Hadoop scheduler needs information about node topology for proper task assignment. In other words, whenever you use a MapReduce program on a particular part of HDFS data, you always want to run that program on the node, or machine, that actually stores this data in HDFS. Doing so allows processes to be run much faster, since it prevents you from having to move large amounts of data around.

When a MapReduce job is executed, part of what the JobTracker does is look to see which machines the information required for the task is located on. Once it is located, the NameNode splits data files into blocks, each one replicated three times: The first is stored on the same machine as the block, while the second and third are each stored on separate machines. This is part of Hadoop's distributive process.

Storing the data across three machines thus gives you a much higher chance of achieving data locality, since it's likely that at least one of the machines will be freed up enough to process the data stored at that particular location.

NameNode