Chapter 6. Identity, Authentication, and Authorization

Multitenancy is a fancy word for supporting many independent entities within a single larger system. The ability to support multiple discreet entities in a system is generally useful when it is costly or complex to operate an instance of that system for each entity. An example of this is the ability to run multiple databases within a single database server, a feature supported by almost all RDBMS vendors. By bringing multiple users of a service together, we can take advantage of the economies of scale and offer greater service as a whole. A simple example of this in the context of Hadoop is that, if a large cluster is built to run hourly production MapReduce jobs, there are generally free resources between executions of those jobs. If we were to silo users by group or use case, these lulls in resource usage would be lost across all groups. Instead, it often (but admittedly not always) makes sense to combine smaller silo clusters into a single large cluster. Not only does this simplify operations, but it increases the available capacity to service consumers, on average, which improves system resource utilization.

Unfortunately, running multitenant systems comes with some obvious challenges. Controlling access to data and resources immediately becomes a point of concern, especially when the data is sensitive in nature. It may be that two different groups of users should not see each other’s data or even not know one another exist. Controlling access to resources also becomes critical. It shouldn’t be possible for one group to monopolize the resources of a shared system. Administrators and especially security and compliance staff expect to be able to monitor and audit user activity in multitenant systems to ensure policies are enforced. The degree of isolation within a multitenant system can be seen as the gating factor between silo and shared deployments. This is even more true of data storage and processing systems, and is the impetus for a deep review of identity, access, and authorization controls in a system such as Hadoop.

When a user performs an action in Hadoop, there are three significant questions:

Who does this user claim to be?
The identity of the entity interacting with the cluster (where entity means a human user or another system) is who they purport to be. As humans, we identify using our names. In Linux, we use usernames, whereas the relational database MySQL, for instance, has its own notion of a user. The identity is an arbitrary label that is unique to an entity, and something to which we can attach meaning.
Can this user prove they are who they say they are?
Anyone can claim to be your Uncle Larry or Aunt Susan, but can they prove it? We authenticate one another by confirming an identity using some kind of system. To enter a country, an individual must present a valid and authentic passport bearing a photo of the person, for instance (although some may say this is a weak form of subjective authentication). Linux provides multiple forms of authentication via plug-ins, although passwords are probably the most common. Authentication mechanisms vary in strength (the rigor with which they confirm a user’s identity).
Is this user allowed to do what they’re asking to do?
Once a user has identified themselves and we are reasonably sure they are who they claim to be, only then does it make sense to ensure they have been authorized to perform the requested action. It never makes sense for a system to support authorization without first authenticating users; a person could simply lie about who they are to gain privileges they wouldn’t otherwise have.

Hadoop operates in either the default, so called simple mode or secure mode, which provides strong authentication support via Kerberos. For many, the simple security mode is sufficient and offers reasonable protection from mistakes in a trusted environment. As its name implies, it’s simple to configure and manage, relying primarily on the host for authentication. If, however, you are running Hadoop in an untrusted, multitenant environment or where accidental data exposure would be catastrophic, secure mode is the appropriate option. In secure mode, Hadoop uses the well-known Kerberos protocol to authenticate users and daemons themselves during all operations. Additionally, MapReduce tasks in secure mode are executed as the same OS user as the job was submitted, whereas in simple mode, they are executed as the user running the tasktracker.

The most important aspect to understand is that, regardless of whether simple or secure mode is configured, it controls only how users are authenticated with the system. Authorization is inherently service specific. The evaluation of the authenticated user’s privileges in the context of the action they are asking to perform is controlled entirely by the service. In the case of HDFS, this means deciding if a user is permitted to read from or write to a file, for example. Authentication must always be performed before authorization is considered, and because it is commonly the same for all services, it can be built as a separate, generic service.

Identity

In Hadoop, there is a strong relationship between who a user is in the host operating system and who they are in HDFS or MapReduce. Furthermore, since there are many machines involved in a cluster, it may not be immediately obvious what is actually required in order to execute a MapReduce job. Hadoop, like most systems, uses the concepts of users and groups to organize identities. However—and this is the root of quite a bit of confusion—it uses the identity of the user according to the operating system. That is, there is no such thing as a Hadoop user or group. When an OS user, in my case, user esammer, executes a command using the hadoop executable, or uses any of the Java APIs, Hadoop accepts this username as the identity with no further checks. Versions of Apache Hadoop prior to 0.20.200 or CDH3u0 also allowed users to specify their identity by setting a configuration parameter when performing an action in HDFS or even running a MapReduce job, although this is no longer possible.

In simple mode, the Hadoop library on the client sends the username of the running process with each command to either the namenode or jobtracker, depending on the command executed. When in secure mode, the primary component of the Kerberos principal name is used as the identity of the user. The user must already have a valid Kerberos ticket in their cache, otherwise the command will fail with an incredibly cryptic message like Example 6-1.

Example 6-1. A typical failed authentication attempt with security enabled

WARN ipc.Client: Exception encountered while connecting to the
server: javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]

Kerberos and Hadoop

As mentioned earlier, Hadoop supports strong authentication using the Kerberos protocol. Kerberos was developed by a team at MIT to provide strong authentication of clients to a server and is well-known to many enterprises. When operating in secure mode, all clients must provide a valid Kerberos ticket that can be verified by the server. In addition to clients being authenticated, daemons are also verified. In the case of HDFS, for instance, a datanode is not permitted to connect to the namenode unless it provides a valid ticket within each RPC. All of this amounts to an environment where every daemon and client application can be cryptographically verified as a known entity prior to allowing any operations to be performed, a desirable feature of any data storage and processing system.

Kerberos: A Refresher

To say Kerberos is “well-known” could be an overstatement. For many, Kerberos is shrouded in dense, intimidating terminology and requires specific knowledge to configure properly. Many implementations of Kerberos exist, and though there are RFCs^[15] that describe the Kerberos protocol itself, management tools and methods have traditionally been vendor-specific. In the Linux world, one of the most popular implementations is MIT Kerberos version 5 (or MIT krb5 for short), an open source software package that includes the server, client, and admin tools. Before we dive into the details of configuring Hadoop to use Kerberos for authentication, let’s first take a look at how Kerberos works, as well as the MIT implementation.

A user in Kerberos is called a principal, which is made up of three distinct components: the primary, instance, and realm. The first component of the principal is called the primary, or sometimes the user component. The primary component is an arbitrary string and may be the operating system username of the user or the name of a service. The primary component is followed by an optional section called the instance, which is used to create principals that are used by users in special roles or to define the host on which a service runs, for example. An instance, if it exists, is separated from the primary by a slash and then the content is used to disambiguate multiple principals for a single user or service. The final component of the principal is the realm. The realm is similar to a domain in DNS in that it logically defines a related group of objects, although rather than hostnames as in DNS, the Kerberos realm defines a group of principals (see Table 6-1). Each realm can have its own settings including the location of the KDC on the network and supported encryption algorithms. Large organizations commonly create distinct realms to delegate administration of a realm to a group within the enterprise. Realms, by convention, are written in uppercase characters.

Table 6-1. Example Kerberos principals

Principal	Description
esammer@MYREALM.CLOUDERA.COM	A standard user principal. User esammer in realm MYREALM.CLOUDERA.COM.
esammer/admin@MYREALM.CLOUDERA.COM	The admin instance of the user esammer in the realm MYREALM.CLOUDERA.COM.
hdfs/hadoop01.cloudera.com@MYREALM.CLOUDERA.COM	The hdfs service on the host hadoop01.cloudera.com in the realm MYREALM.CLOUDERA.COM.

At its core, Kerberos provides a central, trusted service called the Key Distribution Center or KDC. The KDC is made up of two distinct services: the authentication server (AS), which is responsible for authenticating a client and providing a ticket granting ticket (TGT), and the ticket granting service (TGS), which, given a valid TGT, can grant a ticket that authenticates a user when communicating with a Kerberos-enabled (or Kerberized) service. The KDC contains a database of principals and their keys, very much like /etc/passwd and some KDC implementations (including MIT Kerberos) support storing this data in centralized systems like LDAP. That’s a lot of verbiage, but the process of authenticating a user is relatively simple.

Consider the case where user esammer wants to execute the command hadoop fs -get /user/esammer/data.txt. When operating in secure mode, the HDFS namenode and datanode will not permit any communication that does not contain a valid Kerberos ticket. We also know that at least two (and frequently many more) services must be contacted: one is the namenode to get the file metadata and check permissions, and the rest are the datanodes to retrieve the blocks of the file. To obtain any tickets from the KDC, we first retrieve a TGT from the AS by providing our principal name. The TGT, which is only valid for an administrator-defined period of time, is encrypted with our password and sent back to the client. The client prompts us for our password and attempts to decrypt the TGT. If it works, we’re ready to request a ticket from the TGS, otherwise we’ve failed to decrypt the TGT and we’re unable to request tickets. It’s important to note that our password has never left the local machine; the system works because the KDC has a copy of the password, which has been shared in advance. This is a standard shared secret or symmetric key encryption model.

It is still not yet possible to speak to the namenode or datanode; we need to provide a valid ticket for those specific services. Now that we have a valid TGT, we can request service specific tickets from the TGS. To do so, using our TGT, we ask the TGS for a ticket for a specific service, identified by the service principal (such as the namenode of the cluster). The TGS, which is part of the KDC, can verify the TGT we provide is valid because it was encrypted with a special key called the TGT key. If the TGT can be validated and it hasn’t yet expired, the TGS provides us a valid ticket for the service, which is also only valid for a finite amount of time. Within the returned ticket is a session key; a shared secret key that the service to which we speak can confirm with the KDC. Using this ticket, we can now contact the namenode and request metadata for /user/esammer/data.txt. The namenode will validate the ticket with the KDC and assuming everything checks out, then performs the operation we originally requested. Additionally, for operations that involve access to block data, the namenode generates a block token for each block returned to the client. The block token is then provided to the datanode by client, which validates its authenticity before providing access to the block data.

The TGT received from the KDC’s AS usually has a lifetime of 8 to 24 hours, meaning it is only necessary to provide a password once per time period. The TGT is cached locally on the client machine and reused during subsequent requests to the TGS. The MIT Kerberos implementation, for instance, caches ticket information in the temporary file /tmp/krb5cc_uid where uid is the Linux user’s uid. To perform the initial authentication and retrieve a TGT from the KDC with MIT Kerberos, use the kinit command; to list cached credentials, use the klist command as in Example 6-2.

Example 6-2. Obtaining a ticket granting ticket with kinit

[esammer@hadoop01 ~]$ klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_500)
[esammer@hadoop01 ~]$ kinit
Password for esammer@MYREALM.CLOUDERA.COM: 
[esammer@hadoop01 ~]$ klist 
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: esammer@MYREALM.CLOUDERA.COM

Valid starting     Expires            Service principal
03/22/12 15:35:50  03/23/12 15:35:50  krbtgt/MYREALM.CLOUDERA.COM@MYREALM.CLOUDERA.COM
        renew until 03/22/12 15:35:50
[esammer@hadoop01 ~]$ hadoop fs -get /user/esammer/data.txt
...

Kerberos is an enormous topic, complex in its own right. Prior to embarking on a Kerberos deployment, it’s critical to understand how hosts and services are accessed by users as well as other services. Without a coherent understanding of a system, it’s likely that you will find that services that used to be accessible no longer work. For a detailed explanation of Kerberos, see Kerberos: The Definitive Guide by Jason Garman (O'Reilly Media).

Kerberos Support in Hadoop

Now that we have some understanding of how Kerberos works conceptually, it’s worth looking at how this applies to Hadoop. There are two primary forms of authentication that occur in Hadoop with respect to Kerberos: nodes within the cluster authenticating with one another to ensure that only trusted machines are part of the cluster, and users, both human and system, that access the cluster to interact with services. Since many of the Hadoop daemons also have embedded web servers, they too must be secured and authenticated.

Within each service, both users and worker nodes are verified by their Kerberos credentials. HDFS and MapReduce follow the same general architecture; the worker daemons are each given a unique principal that identifies each daemon, they authenticate, and include a valid ticket in each RPC to their respective master daemon. The workers authenticate by using a keytab stored on the local disk. Though tedious, the act of creating a unique principal for each daemon, for each host, generating the keytab, and getting it to the proper machine, is absolutely necessary when configuring a secure Hadoop cluster. Workers must have their own unique principals because if they didn’t, the KDC would issue a similar TGT (based on the principal’s key and timestamp) to all nodes, and services would see potentially hundreds of clients all attempting to authenticate with the same ticket, falsely characterizing it as a replay attack.

Multiple principals are used by the system when Hadoop is operating in secure mode and take the form service-name/hostname@KRB.REALM.COM where the service-name is hdfs in the case of the HDFS daemons and mapred in the case of the MapReduce daemons. Since worker nodes run both a datanode as well as a tasktracker, each node requires two principals to be generated: one for the datanode and one for the tasktracker. The namenode and jobtracker also have principals, although in smaller clusters where the one or both of these daemons run on a node that is also a slave, it is not necessary to create a separate principal as namenode and datanode can share a principal and the tasktracker and jobtracker can share a principal.

Since it isn’t feasible to log into each machine and execute kinit as both user hdfs and mapred and provide a password, the keys for the service principals are exported to files and placed in a well-known location. These files are referred to as key tables or just keytabs. Exporting the keys to files may seem dangerous, but if the contents of the files are properly protected by filesystem permissions (that is, owned by the user the daemon runs as, with permissions set to 0400), the integrity of the key is not compromised. When the daemons start up, they use this keytab to authenticate with the KDC and get a ticket so they can connect to the namenode or jobtracker, respectively. When operating in secure mode, it is not possible for a datanode or tasktracker to connect to its constituent master daemon without a valid ticket.

Exporting keys to keytabs

With MIT Kerberos, exporting a key to a keytab will invalidate any previously exported copies of that same key unless the -norandkey option is used. It’s absolutely critical that you do not export a key that has already been exported unless that’s what you mean to do. This should only be necessary if you believe a keytab has become compromised or is otherwise irrevocably lost or destroyed.

Users performing HDFS operations and running MapReduce jobs also must authenticate prior to those operations being allowed (or, technically, checked for authorization). When an application uses the Hadoop library to communicate with one of the services and is running in secure mode, the identity of the user to Hadoop is the primary component of the Kerberos principal. This is different from simple mode where the effective uid of the process is the identity of the user. Additionally, the tasks of a MapReduce jobs execute as the authenticated user that submitted the job. What this means is that, in secure mode, each user must have a principal in the KDC database and a user account on every machine in the cluster. See Table 6-2.

Table 6-2. Comparison of secure and simple mode identity

	Simple	Secure
Identity comes from:	Effective uid of client process	Kerberos principal
MapReduce tasks run as:	Tasktracker user (e.g., mapred)	Kerberos principal

The requirement that all users have a principal can complicate otherwise simple tasks. For instance, assuming the HDFS super user is hdfs, it would normally be possible to perform administrative activities using sudo like in Example 6-3.

Example 6-3. Performing HDFS administrative commands with sudo

# Creating a new user's home directory in HDFS. Since /user is owned
# by user hdfs, it is necessary to become that user or the super user (which
# also happens to be hdfs).
[esammer@hadoop01 ~]$ sudo -u hdfs hadoop fs -mkdir /user/jane

Unfortunately, this doesn’t work in secure mode because the uid of the process doesn’t make us hdfs. Instead, it is necessary to authenticate as user hdfs with Kerberos. This is normally done using kinit, as we saw earlier. This has the unpleasant side effect of requiring that we share the password for the HDFS principal. Rather than share the HDFS principal password with all the cluster administrators, we can export the HDFS principal key to a keytab protected by restrictive filesystem permissions, and then use sudo to allow selective users to access it when they authenticate with kinit. HDFS also supports the notion of a super group that users can be a member of to perform administrative commands as themselves.

Running tasks as the user that submitted the MapReduce job solves a few potential problems, the first of which is that, if we were to allow all tasks to run as user mapred, each map task would produce its intermediate output as the same user. A malicious user would be able to simply scan through the directories specified by mapred.local.dir and read or modify the output of another unrelated task. This kind of lack of isolation is a non-starter for security-sensitive deployments.

Since the tasktracker runs as an unprivileged user (user mapred, by default, in the case of CDH and whatever user the administrator configures in Apache Hadoop), it isn’t possible for it to launch task JVMs as a different user. One way to solve this problem is to simply run the tasktracker process as root. While this would solve the immediate problem of permissions, any vulnerability in the tasktracker would open the entire system to compromise. Worse, since the tasktracker’s job is to execute user supplied code as a user indicated by the jobtracker, an attacker would trivially have full control over all worker nodes. Instead of running the tasktracker as root, when operating in secure mode, the tasktracker relies on a small setuid executable called the task-controller. The task-controller is a standalone binary implemented in C that sanity checks its environment and immediately drops privileges to the proper user before launching the task JVM. Configured by a small key value configuration file called taskcontroller.cfg in the Hadoop configuration directory, the task-controller is restricted to executing tasks for users with a uid above a certain value (as privileged accounts usually have low numbered uids). Specific users can also be explicitly prevented from running tasks, regardless of their uid, which is useful for denying Hadoop daemon users from executing tasks. For the task-controller to execute tasks as the user who submitted the job, each user must have accounts on all machines of the cluster. Administrators are expected to maintain these accounts, and because of the potentially large number of machines to keep in sync, admins are encouraged to either use centralized account management such as LDAP or an automated system to keep password files up-to-date.

Configuring Hadoop security

Configuring Hadoop to operate in secure mode can be a daunting task with a number of external dependencies. Detailed knowledge of Linux, Kerberos, SSL/TLS, and JVM security constructs are required. At the time of this book, there are also some known gotchas that exist in certain Linux distributions and versions of the JVM that can cause you grief. Some of those are exposed below.

The high-level process for enabling security is as follows.

Audit all services to ensure enabling security will not break anything.
Hadoop security is all or nothing; enabling it will prevent all non-Kerberos authenticated communication. It is absolutely critical that you first take an inventory of all existing processes, both automated and otherwise, and decide how each will work once security is enabled. Don’t forget about administrative scripts and tools!
Configure a working non-security enabled Hadoop cluster.
Before embarking on enabling Hadoop’s security features, get a simple mode cluster up and running. You’ll want to iron out any kinks in DNS resolution, network connectivity, and simple misconfiguration early. Debugging network connectivity issues and supported encryption algorithms within the Kerberos KDC at the same time is not a position that you want to find yourself in.
Configure a working Kerberos environment.
Basic Kerberos operations such as authenticating and receiving a ticket-granting ticket from the KDC should work before you continue. You are strongly encouraged to use MIT Kerberos with Hadoop; it is, by far, the most widely tested. If you have existing Kerberos infrastructure (such as provided by Microsoft Active Directory) that you wish to authenticate against, it is recommended that you configure a local MIT KDC with one way cross realm trust so Hadoop daemon principals exist in the MIT KDC and user authentication requests are forwarded to Active Directory. This is usually far safer as large Hadoop clusters can accidentally create distributed denial of service attacks against shared infrastructure when they become active.
Ensure host name resolution is sane.
As discussed earlier, each Hadoop daemon has its own principal that it must know in order to authenticate. Since the hostname of the machine is part of the principal, all hostnames must be consistent and known at the time the principals are created. Once the principals are created, the hostnames may not be changed without recreating all of the principals! It is common that administrators run dedicated, caching-only, DNS name servers for large clusters.
Create Hadoop Kerberos principals.
Each daemon on each host of the cluster requires a distinct Kerberos principal when enabling security. Additionally, the Web user interfaces must also be given principals before they will function correctly. Just as the first point says, security is all or nothing.
Export principal keys to keytabs and distribute them to the proper cluster nodes.
With principals generated in the KDC, each key must be exported to a keytab, and copied to the proper host securely. Doing this by hand is incredibly laborious for even small clusters and, as a result, should be scripted.
Update Hadoop configuration files.
With all the principals generated and in their proper places, the Hadoop configuration files are then updated to enable security. The full list of configuration properties related to security are described later.
Restart all services.
To activate the configuration changes, all daemons must be restarted. The first time security is configured, it usually makes sense to start the first few daemons to make sure they authenticate correctly and are using the proper credentials before firing up the rest of the cluster.
Test!
It’s probably clear by now that enabling security is complex and requires a fair bit of effort. The truly difficult part of configuring a security environment is testing that everything is working correctly. It can be particularly difficult on a large production cluster with existing jobs to verify that everything is functioning properly, but no assumptions should be made. Kerberos does not, by definition, afford leniency to misconfigured clients.

Creating principals for each of the Hadoop daemons and distributing their respective keytabs is the most tedious part of enabling Hadoop security. Doing this for each daemon by hand would be rather error prone, so instead, we’ll create a file of host names and use a script to execute the proper commands. These examples assume MIT Kerberos 1.9 on CentOS 6.2.^[16]

First, build a list of fully qualified host names, either by exporting them from an inventory system or generating them based on a well-known naming convention. For example, if all hosts follow the naming convention of hadoopN.mycompany.com, where N is a zero padded sequential number, a simple shell script will do:

[esammer@hadoop01 ~]$ for n in $(seq -f "%02g" 1 10) ; do
  echo "hadoop${n}.mycompany.com"
done > hostnames.txt
[esammer@hadoop01 ~]$ cat hostnames.txt
hadoop01.mycompany.com
hadoop02.mycompany.com
hadoop03.mycompany.com
hadoop04.mycompany.com
hadoop05.mycompany.com
hadoop06.mycompany.com
hadoop07.mycompany.com
hadoop08.mycompany.com
hadoop09.mycompany.com
hadoop10.mycompany.com

Using our host list as input, we can write a script to create the necessary principals, export the keys to keytabs, and bucket them by machine name.

Warning

This script will regenerate keys of any existing principals of the same name, which will invalidate any existing keytabs or passwords. Always measure twice and cut once when running scripts that affect the KDC!

#!/bin/sh

[ -r "hostnames.txt" ] || {
  echo "File hostnames.txt doesn't exist or isn't readable."
  exit 1
}

# Set this to the name of your Kerberos realm.
krb_realm=MYREALM.MYCOMPANY.COM

for name in $(cat hostnames.txt); do
  install -o root -g root -m 0700 -d ${name}

  kadmin.local <<EOF
addprinc -randkey host/${name}@${krb_realm}
addprinc -randkey hdfs/${name}@${krb_realm}
addprinc -randkey mapred/${name}@${krb_realm}
ktadd -k ${name}/hdfs.keytab -norandkey \
  hdfs/${name}@${krb_realm} host/${name}@${krb_realm}
ktadd -k ${name}/mapred.keytab -norandkey \
  mapred/${name}@${krb_realm} host/${name}@${krb_realm}
EOF

done

This script relies on a properly configured Kerberos KDC and assumes it is being run on the same machine as the KDC database. It also assumes /etc/krb5.conf is correctly configured and that the current user, root, has privileges to write to the KDC database files. It’s also important to use the -norandkey option to ktadd, otherwise each time you export the key, it changes, invalidating all previously created keytabs containing that key. Also tricky is that the -norandkey option to ktadd works only when using kadmin.local (rather than kadmin). This is because kadmin.local never transports the key over the network since it works on the local KDC database. If you are not using MIT Kerberos, consult your vendor’s documentation to ensure keys are protected at all times.

You should now have a directory for each hostname, each of which contains two keytab files: one named hdfs.keytab and one named mapred.keytab. Each keytab contains its respective service principal (for example, hdfs/hostname@realm) and a copy of the host keytab. Next, using a secure copy utility like scp or rsync tunnelled over ssh, copy the keytabs to the proper machines and place them in the Hadoop configuration directory. The owner of the hdfs.keytab file must be the user the namenode, secondary namenode, and datanodes run as, whereas the mapred.keytab file must be owned by the user the jobtracker and tasktrackers run as. Keytab files must be protected at all times and as such, should have the permissions 0400 (owning user read only).

With the keytabs distributed to the proper machines, the next step is to update the Hadoop configuration files to enable secure mode. First, Kerberos security is enabled in core-site.xml.

hadoop.security.authentication

The hadoop.security.authentication parameter defines the authentication mechanism to use within Hadoop. By default, it is set to simple, which simply trusts the client is who they claim to be, whereas setting it to the string kerberos enables Kerberos support. In the future, other authentication schemes may be supported, but at the time of this writing, these are the only two valid options.

Example value: kerberos

hadoop.security.authorization

Enabling hadoop.security.authorization causes Hadoop to authorize the client when it makes remote procedure calls to a server. The access control lists that affect these permissions are configured via the hadoop-policy.xml file and allow per-service level control. For instance, it is possible to permit only users placed in the mapred-admin Linux group to invoke APIs that are part of the administration service (the security.admin.operations.protocol.acl policy). When enabling security, this feature should be enabled as well and meaningful ACLs configured.

Example: true

Example 6-4 shows core-site.xml with the proper security properties set.

Example 6-4. Updating core-site.xml to enable Hadoop security

<?xml version="1.0"?>
<configuration>

  <!-- Add these properties to the existing core-site.xml configuration. -->

  <property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
  </property>

  <property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
  </property>

</configuration>

Next, hdfs-site.xml must be configured so HDFS knows the principals to use and the location of the keytabs. Hadoop datanodes will also refuse to start in secure mode unless the data transceiver port is below 1024 (a privileged port) so that must also be changed.

dfs.block.access.token

Block access tokens are temporary keys that allow an HDFS block to be read, written, deleted, or a host of other internal actions, by an authenticated user. This mechanism allows Hadoop to ensure that only the intended users are able to access data in HDFS. While disabled (false) by default, this parameter should be enabled (set to true) in a secure deployment.

dfs.namenode.keytab.file

The dfs.namenode.keytab.file parameter specifies the location of the keytab that contains the Kerberos principal key for the namenode. This is the file uploaded to each host and by convention, is placed in the Hadoop configuration directory.

Example: /etc/hadoop/conf/hdfs.keytab

dfs.namenode.kerberos.principal

The Kerberos principal the namenode should use to authenticate. The key for this principal must exist in the keytab specified by dfs.namenode.keytab.file. The special token _HOST can be used for the instance portion of the principal, in which case the fully qualified domain name will be interpolated. Note that the _HOST token cannot be used anywhere else in the principal.

Example: hdfs/_HOST@MYREALM.MYCOMPANY.COM

dfs.namenode.kerberos.https.principal

Example: host/_HOST@MYREALM.MYCOMPANY.COM

dfs.https.address

The hostname or IP address on which the embedded HTTPS server should be bound. It is valid to specify the wild card IP 0.0.0.0 to indicate the HTTPS server should listen on all interfaces.

Example: 0.0.0.0

dfs.https.port

The port on which the embedded HTTPS server should listen for requests.

Example: 50470

dfs.datanode.keytab.file

Exactly the same as dfs.namenode.keytab.file, the dfs.datanode.keytab.file specifies the keytab file containing the principal keys used by the datanode process. This can, and usually is, the same file as dfs.namenode.keytab.file.

Example: /etc/hadoop/conf/hdfs.keytab

dfs.datanode.kerberos.principal

The Kerberos principal the datanode should use to authenticate. The key for this principal must exist in the keytab specified by dfs.datanode.keytab.file. The special token _HOST can be used for the instance portion of the principal, in which case the fully qualified domain name will be interpolated. Note that the _HOST token cannot be used anywhere else in the principal. This is commonly the same principal as dfs.namenode.kerberos.principal.

Example: hdfs/_HOST@MYREALM.MYCOMPANY.COM

dfs.datanode.kerberos.https.principal

Example: host/_HOST@MYREALM.MYCOMPANY.COM

dfs.datanode.address

The hostname or IP address and port, separated by a colon, on which the data transceiver RPC server should be bound. It is valid to specify the wild card IP 0.0.0.0 to indicate the server should listen on all interfaces. With security enabled, this port must be below 1024 or the datanode will not start.

Example: 0.0.0.0:1004

dfs.datanode.http.address

The hostname or IP address and port, separated by a colon, on which the embedded HTTP server should be bound. It is valid to specify the wild card IP 0.0.0.0 to indicate the HTTP server should listen on all interfaces.

Example: 0.0.0.0:1006

dfs.datanode.data.dir.perm

When security is enabled, Hadoop performs extra checks to ensure HDFS block data cannot be read by unauthorized users. One of these checks involves making sure the directories specified by dfs.data.dir are set to restrictive permissions. This prevents user code from simply opening and reading block data directly from the local disk rather than using the HDFS APIs, which require a valid Kerberos ticket and perform authorization checks on the file. If the permissions are incorrect, the datanode will change the permissions to the value specified by this parameter.

Local read short-circuiting

HDFS supports a feature called local read short-circuiting (as implemented by HDFS-2246) in which a client application running on the same machine as the datanode can completely bypass the datanode server and read block files directly from the local filesystem. This can dramatically increase the speed of read operations, but at the cost of opening access to the underlying block data of all blocks to the client. When this feature is enabled, clients must be running as the same user as the datanode or be in a group that has read access to block data. Both scenarios break some of the invariants assumed by the security model and can inadvertently expose data to malicious applications. Take great care when enabling this feature on a secure cluster or setting dfs.datanode.data.dir.perm to anything other than 0700.

Example: 0700

See Example 6-5 for a sample hdfs-site.xml file configured for security.

Example 6-5. Updating hdfs-site.xml to enable Hadoop security

<?xml version="1.0"?>
<configuration>

  <property>
    <name>dfs.block.access.token.enable</name>
    <value>true</value>
  </property>

  <!-- NameNode security config -->

  <property>
    <name>dfs.namenode.keytab.file</name>
    <value>hdfs.keytab</value>
  </property>

  <property>
    <name>dfs.namenode.kerberos.principal</name>
    <value>hdfs/_HOST@MYREALM.MYCOMPANY.COM</value>
  </property>

  <property>
    <name>dfs.namenode.kerberos.https.principal</name>
    <value>host/_HOST@MYREALM.MYCOMPANY.COM</value>
  </property>

  <!-- DataNode security config -->

  <property>
    <name>dfs.datanode.keytab.file</name>
    <value>hdfs.keytab</value>
  </property>

  <property>
    <name>dfs.datanode.kerberos.principal</name>
    <value>hdfs/_HOST@MYREALM.MYCOMPANY.COM</value>
  </property>

  <property>
    <name>dfs.datanode.kerberos.https.principal</name>
    <value>host/_HOST@MYREALM.MYCOMPANY.COM</value>
  </property>

</configuration>

Lastly, mapred-site.xml must be configured; in many ways, it has similar parameters to hdfs-site.xml.

mapreduce.jobtracker.kerberos.principal

Just like dfs.namenode.kerberos.principal, this specifies the Kerberos principal the jobtracker uses during authentication. Again, the key for this principal must be in the keytab specified by mapreduce.jobtracker.keytab.file. The _HOST token may be used in the instance part of the principal, in which case, it will be interpolated with the fully qualified domain name of the machine on which the jobtracker is running.

Example: mapred/_HOST@MYREALM.MYCOMPANY.COM

mapreduce.jobtracker.kerberos.https.principal

This parameter is used to specify the Kerberos principal for the embedded HTTPS server, just as its HDFS siblings.

Example: host/_HOST@MYREALM.MYCOMPANY.COM

mapreduce.jobtracker.keytab.file

The keytab file containing the jobtracker principal key. See dfs.namenode.kerberos.principal for more information.

Example: /etc/hadoop/conf/mapred.keytab

mapreduce.tasktracker.kerberos.principal

The Kerberos principal used by the tasktracker. See mapreduce.jobtracker.kerberos.principal for more information.

Example: mapred/_HOST@MYREALM.MYCOMPANY.COM

mapreduce.tasktracker.kerberos.https.principal

The Kerberos principal used by the embedded HTTPS server. See mapreduce.jobtracker.kerberos.https.principal for more information.

Example: host/_HOST@MYREALM.MYCOMPANY.COM

mapreduce.tasktracker.keytab.file

The keytab file containing the tasktracker principal key. See mapreduce.jobtracker.keytab.file for more information.

Example: /etc/hadoop/conf/mapred.keytab

mapred.task.tracker.task-controller

As we’ve seen, the tasktracker has a few strategies for launching child task JVMs. In simple security mode, the tasks are launched as the same user as the tasktracker itself, while in secure mode, the setuid task-controller executable is used. The tasktracker uses a Java plug-in that controls which strategy is used, which is defined by this parameter. By default, mapred.task.tracker.task-controller specifies the class name org.apache.hadoop.mapred.DefaultTaskController, which simply executes the child task with the same permissions as the parent tasktracker. In the case of a secure cluster, this should instead be set to org.apache.hadoop.mapred.LinuxTaskController, which is the implementation that knows about the setuid task-controller executable.

Example: org.apache.hadoop.mapred.LinuxTaskController

mapreduce.tasktracker.group

In secure mode, the setuid task-controller, after sanitizing its environment, changes its effective user id and group id to the user running the MapReduce job and a specified group, respectively. This parameter is what specifies the group that should be used when this occurs. It’s important that the effective group be controlled so the user task has the proper permissions when it creates files (for example, in the Hadoop log directory). This value should always match the effective group of the tasktracker itself. If it doesn’t, the task-controller will exit with a non-zero status. This isn’t important with simple security mode because the task-controller isn’t used.

Example: mapred

The task-controller’s configuration is separate from the three primary XML files and is required for it to function properly. Located in the Hadoop configuration directory, taskcontroller.cfg is a plain text file that works very much like a Java property file: one key value pair per line, separated by an equals sign. Unlike a property file, however, whitespace is not permitted between the equals sign and its key or value. See Example 6-6 for a sample taskcontroller.cfg. The following parameters are supported.

mapred.local.dir (required): A comma separated list of directories used for temporary data during MapReduce jobs. This should be the same list as specified in mapred.local.dir in the mapred-site.xml file. See mapred.local.dir for more information.
hadoop.log.dir (required): The directory in which log data should be written. This should be the same path as specified in HADOOP_LOG_DIR in the hadoop-env.sh file. See Logging Configuration for more information.
mapred.tasktracker.group (required): The group that the tasktracker process runs as. In CDH, this is mapred and whatever user the administrator chooses in Apache Hadoop. This group should not contain any users other than the user the tasktracker runs as. If it does, those users will be able to impersonate any user with a user id greater than what is specified in min.user.id.
min.user.id (required): The minimum uid of the user to run a task as. If a user with uid below this value submits a MapReduce job, the task-controller will refuse to execute the tasks, exiting with a non-zero status code, which will cause all tasks to fail. The default value of this is 1000 in CDH while Apache Hadoop has no default.
Warning
A number of Linux distributions, notably CentOS and RHEL, start the uid numbering of user accounts at 500. This means that the CDH default of 500 will cause all tasks to fail by default.
banned.users: Specific users can be banned from executing tasks on the cluster. By default, CDH includes users mapred, hdfs, and bin in this list whereas Apache Hadoop leaves this empty. At a minimum, you should always include user hdfs. By doing so, you can avoid the potential attack vector where a task running as hdfs could read or modify HDFS blocks stored on the local disks, circumventing the permission checks performed by the namenode.

Example 6-6. Sample taskcontroller.cfg

mapred.local.dir=/data/1/hadoop/mapred/local,/data/2/hadoop/mapred/local
hadoop.log.dir=/var/log/hadoop
mapreduce.tasktracker.group=mapred
banned.users=mapred,hdfs,bin
min.user.id=1000

Finally, after distributing the changed configuration files to all hosts in the cluster, you are now ready to restart all daemons and begin testing. It is strongly recommended that you start with the namenode and a single datanode first to make sure everything is in working order. Watch the log files for errors, especially those related to authentication and resolve any issues you find. Given the complexity of setting up security, it is common to miss a step the first time and have a little debugging to do. Take it slow, one daemon at a time, and for production clusters, always have a rollback plan.

Authorization

So far, we’ve discussed only how clients identify themselves and how Hadoop authenticates them. Once a client is authenticated, though, that client is still subject to authorization when it attempts to perform an action. The actions that can be performed vary from service to service. An action in the context of HDFS, for example, may be reading a file, creating a directory, or renaming a filesystem object. MapReduce actions, on the other hand, could be submitting or killing a job. Evaluating whether or not a user is permitted to perform a specific action is the process of authorization.

HDFS

Every filesystem operation in HDFS is subject to authorization. In an effort to exploit existing knowledge, HDFS uses the same authorization model as most POSIX filesystems. Each filesystem object (such as a file or directory) has three classes of user: an owner, a group, and “other,” which indicates anyone who isn’t in one of the two previous classes. The available permissions that can be granted to each of the three classes on an object are read, write, and execute, just as with Linux or other Unix-like systems. For example, it is possible to grant the owner of a file both read and write privileges. These permissions are represented by a single octal (base-8) integer that is calculated by summing permission values (see Table 6-3).

Table 6-3. HDFS permission values

Permission	Value
Read	4
Write	2
Execute	1

Using our previous example, to indicate the file owner has both read and write privileges, we would sum the read permission value of 4 and the write permission value of 2, giving us 6, which represents the combination of the two privileges. This representation is always unambiguous. That is, it is impossible for two combinations to yield the same sum. The special value of 0 (zero) means that a user has no permissions. Noteworthy, it is not meaningful (nor is it illegal) for a file to be executable in HDFS, although directories use the execute permission to indicate that a user may access its contents (if they know the name of the contained file or directory already) and metadata information (then read permission is also required to retrieve the children’s names).

To represent the three classes of user—owner, group, and other—we use three integers, one for each class, in that order. For instance, a file that allows the owner to read and write, the group to read, and other users to read, would have the permissions 644; the 6 indicates read and write for the owner (4 + 2), whereas the subsequent fields are both 4, indicating only the read permission is available. To indicate the owner of a directory has read, write, and execute permissions, but no one else has access, the permissions 700 would be used.

In addition to the above permissions, three other special permissions exist: setuid, setgid, and sticky mode (or “the sticky bit”). The setuid permission in Linux changes the process effective user id to that of the file owner when the file is executed. Since files in HDFS cannot be executed, granting this means nothing. The setgid permission, when set on a directory in Linux, forces the group of all immediate child files and directories to that of the parent directory. However, this is the default behavior in HDFS, so there’s no need to explicitly set the setgid permission on a directory. Like setuid, there is no meaning to setting setgid on a file in HDFS because files cannot be executed. The final permission is called the sticky bit and when set on a directory, means that only the owner of a file in that directory may delete or rename the file, even if another user has access to do so (as granted by the write permission on the directory itself). The exception to this rule is that the HDFS super user and owner of the directory always have the ability to perform these actions. This is exactly the behavior one desires for temporary directories where all users need to be able to write data, but only the owner of that data should be able to remove or change it. These permissions are granted by using an optional fourth octal number placed to the left of the standard set of three. Just like the read, write, and execute permissions, setuid, setgid, and sticky bit each have a designed value that may be summed to grant multiple permissions. (See Table 6-4.)

Table 6-4. The setuid, setgid, and sticky bit permissions

Permission	Value
Setuid	4
Setgid	2
Sticky bit	1

A value of 0 (zero) removes these permissions, if they are set. Example 6-7 shows typical HDFS interaction with Kerberos authentication and permissions set.

Example 6-7. Using the sticky bit permission in HDFS

# Authenticate as user esammer
[esammer@hadoop01 ~]$ kinit esammer
Password for esammer@PE.HADOOP.CLOUDERA.COM:
# Create a directory called test
[esammer@hadoop01 ~]$ hadoop fs -mkdir /user/esammer/test
# Change the permissions to enable sticky bit and set:
# owner: read, write, execute
# group: read, write, execute
# other: read, write, execute
[esammer@hadoop01 ~]$ hadoop fs -chmod 1777 /user/esammer/test
# Create an empty file as esammer
[esammer@hadoop01 ~]$ hadoop fs -touch /user/esammer/test/foo
# Authenticate as user esammer2
[esammer@hadoop01 ~]$ kinit esammer2
Password for esammer2@PE.HADOOP.CLOUDERA.COM:
# Attempt to remove the file foo.
esammer@hadoop01 ~]$ hadoop fs -rmr /user/esammer/test/foo
rmr: org.apache.hadoop.security.AccessControlException: Permission denied by sticky ↵
  bit setting: user=esammer2, inode="/user/esammer/test/foo":esammer:hadoop:-rw-r--r--
# Oops. The sticky bit stopped us from removing the file. Let's create a file of
# our own called bar.
[esammer@hadoop01 ~]$ hadoop fs -touchz /user/esammer/test/bar
# Switch back to user esammer.
[esammer@hadoop01 ~]$ kinit esammer
Password for esammer@PE.HADOOP.CLOUDERA.COM:
# We can see both files...
[esammer@hadoop01 ~]$ hadoop fs -ls /user/esammer/test/
Found 2 items
-rw-r--r--   3 esammer2 hadoop          0 2012-04-25 13:53 /user/esammer/test/bar
-rw-r--r--   3 esammer  hadoop          0 2012-04-25 13:52 /user/esammer/test/foo
# But because user esammer owns the directory, we can delete esammer2's file!
[esammer@hadoop01 ~]$ hadoop fs -rmr /user/esammer/test/bar
Deleted hdfs://hadoop01.cloudera.com/user/esammer/test/bar

MapReduce

Hadoop MapReduce, like HDFS, has a few different classes of users (four, to be exact).

Cluster owner: The cluster owner is the OS user that started the cluster. In other words, this is the user the jobtracker daemon is running as. This is normally user hadoop for Apache Hadoop and mapred for CDH. Like the HDFS super user, the MapReduce cluster owner is granted all permissions implicitly and should be used rarely, if ever, by administrators.
Cluster administrator: One or more users may be specified as cluster administrators. These users have all of the same powers as the cluster owner, but do not need to be able to authenticate as the Linux user that started the cluster. Granting users this power allows them to perform administrative operations while still retaining the ability to audit their actions individually, rather than having them use a shared account. Use of a shared account is also discouraged as it creates the need to share authentication credentials.
Queue administrator: When a job is submitted to the jobtracker, the user specifies a queue. The queue has an access control list (ACL) associated with it that defines which users and groups may submit jobs, but also which users may administer the queue. Administrative actions may be changing the priority of a job, killing tasks, or even killing entire jobs.
Job owner: Finally, the job owner is the user who submitted a job. A job owner always has the ability to perform administrative actions on their own jobs.

Defining the cluster owner is a side effect of simply starting the jobtracker so there’s no explicit configuration of this user, per se. Cluster and queue administrators, on the other hand, are defined by the mapred-site.xml and mapred-queue-acls.xml files, respectively. Since these files define access control, it’s important to make sure the files and their parent directory are writable only by the cluster owner or root. If users will be submitting jobs on the same machine as the jobtracker, the files and their parent directory must be world readable to allow users to submit jobs because they contain information used by the client as well as the server.

An access control list in Hadoop is a space-separated pair of comma-separated lists. The first list is the users that allowed to perform a given action; the second is the list of groups that are allowed to perform an action. It is possible for either or both of the lists to be empty, in which case, no users or groups may perform an action. If the ACL is simply an asterisk, everyone is allowed to perform an action. The table in Example 6-8 shows a few examples of setting access control lists for services.

Example 6-8. Sample Hadoop access control lists and their meanings

ACL	Description
“*”	Everyone is permitted.
" " (a single space)	No one is permitted.
“user1,user2 group1,group2,group3”	Users 1 and 2, and groups 1, 2, and 3 are permitted.
“user1 " (a user followed by a single space)	User 1 is permitted, but no groups.
" group1” (a single space followed by a group)	Group 1 is permitted, but no users.

Note

The quotes are not part of the ACL and should not be typed. They are shown here to highlight the space character.

By default, Hadoop MapReduce is wide open; any user may submit a job to any queue, and all users may perform administrative actions. For single-purpose clusters in a trusted environment, this is fine, but for multitenant clusters, this doesn’t work. There are a number of parameters and extra files that need to be configured to define and enable queue ACLs. The first step is to modify mapred-site.xml to enable ACLs, define cluster administrators, and specify the queues that should exist.

mapred.acls.enabled

Access control lists must be globally enabled prior to use. Hadoop does not enforce ACLs, even if they’re defined, if this parameter is not set to true.

Example: true. Default: false

mapred.cluster.administrators

An ACL that defines users and groups that are cluster administrators.

Example: “esammer mradmins,ops” (without quotes). Default: undefined.

mapred.queue.names

Before attaching ACLs to queues, all queues must be predefined. This is done by simply providing a comma-separated list of queue names to the mapred.queue.names parameter. There should be no spaces between names. Some scheduler plug-ins—notably the FIFO scheduler, which is the default—do not support multiple queues. Both the capacity scheduler and fair scheduler plug-ins support multiple queues.

Example: research,production-etl,adhoc. Default: default.

Note

While the mapred-queue-acls.xml file is reloaded dynamically, changes to mapred-site.xml require daemons be restarted before changes take effect.

The next step is to define ACLs for the queues named in mapred.queue.names. This must be done in the mapred-queue-acls.xml file, also located in the Hadoop configuration directory. Like the primary three configuration files, mapred-queue-acls.xml follows the same format; an XML file with a top level configuration element, which contains zero or more property elements, each of which is a name value pair. Each MapReduce queue is configured with an ACL, as described above, in a separate configuration property where the name of the property is mapred.queue.queue-name.privilege and the value is the ACL. The queue-name should be one of the queue names from the mapred.queue.names property, and privilege is one of the available queue privileges (see Table 6-5).

Table 6-5. MapReduce queue privileges

Privilege	Description
acl-submit-job	An ACL that controls who may submit MapReduce jobs to a queue.
acl-administer-jobs	An ACL that controls who may perform administrative actions on jobs in a queue.

If, for example, you want to create a production-etl queue and grant user jane and anyone in the prod group access to submit jobs to it, production-etl must be listed in mapred.queue.names in mapred-site.xml, and mapred.queue.production-etl.acl-submit-job should be set to "jane prod" (without quotes) in mapred-queue-acls.xml, as in Example 6-9.

Example 6-9. Sample mapred-queue-acls.xml file

<?xml version="1.0"?>
<configuration>

  <!--
    Allow user jane and group prod to submit jobs to the
    production-etl queue.
  -->

  <property>
    <name>mapred.queue.production-etl.acl-submit-job</name>
    <value>jane prod</value>
  </property>

  <!--
    Allow no users, and groups prod and hadoop-admins, to
    administer jobs in the production-etl queue. Note the
    space after the opening value element. This means the
    user list is empty.

    The cluster owner, cluster admins, and job owner can
    always administer their jobs.
  -->

  <property>
    <name>mapred.queue.production-etl.acl-administer-jobs</name>
    <value> prod,hadoop-admins</value>
  </property>

  <!-- Many queues may be defined... -->

</configuration>

The queues described thus far are independent of the capacity and fair scheduler plug-ins, although they are related. None of the three scheduler plug-ins (including the default FIFO scheduler) provide access control over which jobs are submitted to which queues. If you need this kind of functionality, the expectation is that the queue names, as defined by mapred.queue.names, will match the queues configured in the scheduler plug-in. This allows job submission and queue ACLs to be implemented generically, while the order and machines on which tasks are executed is controlled by the plug-in. In Chapter 7, we’ll cover how each of the schedulers may be configured to suit different environments and use cases.

Other Tools and Systems

Many of the projects commonly used with Hadoop (the so-called ecosystem projects) rely simply on the HDFS and MapReduce access controls. This makes a lot of sense because some of these tools are just clients of the two Hadoop services. Pig, for instance, is a good example of this. A user writes a Pig script, which is interpreted on the client machine, logical and physical execution plans are created, and a series of MapReduce jobs are submitted to the cluster. The data accessed by the job is commonly a directory of files in HDFS and is therefore subject to the filesystem permissions of the user who submitted the job. In other words, there’s nothing else to which access must be controlled.

But that, of course, is not always the case. Other ecosystem projects do have additional information or higher level constructs that must be secured in multiuser environments. Hive is a good example of this. While also a client-side application, Hive stores additional metadata about the data in HDFS and presents a relational database style abstraction^[17] to the end user. This presents an interesting case where, even if we were to restrict a user from accessing the data contained in HDFS, they would still be able to see the metadata—the table names, field names and types, and so forth—managed by Hive. In multitenant environments, this may not be acceptable. Disclosing information about users’ datasets to one another may be sufficient to reveal what they are doing or how. Because of this risk, each of the ecosystem projects, including Hive, are responsible for protecting any metadata they maintain.

Each ecosystem project has different controls for protecting metadata and must be configured independently. It isn’t feasible to cover all of the nuances of each of these projects in detail here. This is meant to provide references for further reading.

Apache Hive

Because HiveQL bakes down into MapReduce jobs, operations are subject to all of the access controls of MapReduce queue ACLs and HDFS file permissions. Additionally, Hive provides RDBMS style users, groups, and roles, as well as GRANT statements to control access to its so-called database objects. This enables finegrained access control to higher order constructs like tables, columns, or even views. Of course, this is enforced only at the time the HiveQL query is written and submitted; the files in HDFS have no knowledge of this metadata and as a result, it can’t be enforced further down the stack. Remember that for MapReduce jobs to work, the user submitting the job must be able to access the files in HDFS. This creates an interesting problem where a user may not have permission to view the data in a column of a table from Hive’s perspective, but the user must have access to the complete files in HDFS, thereby allowing them to circumvent the object-level policy. We’ll see possible solutions to this problem in Tying It Together.

For more information about configuring Hive object level access control, see http://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization.

Tip

The Sqoop community is currently working on a proposal for Sqoop 2, which maintains the functionality of Sqoop, but realizes it as a persistent service rather than a command-line tool. For more information on Sqoop 2, be sure to check out https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2.

Apache Flume

Apache Flume is a distributed, fault tolerant, scale out streaming data collection system commonly used for collecting log events. Unlike many of the other Hadoop ecosystem projects, Apache Flume does not strictly require Hadoop, although it complements HDFS very well. Flume uses the abstraction of sources, channels, and sinks to wire together data flows that transport data from one end point to another. Some of the more common sources are an Avro RPC source to which applications can speak directly, syslog, a raw netcat-style source, and a source that executes a command and reads from its standard output. Sources put data into channels that act as queues that support different strategies that effect durability. Sinks take events from channels and depending on the implementation, store them in HDFS, forward them to another Flume agent over the network, and insert them into HBase or any other data store. It is also simple to implement custom sources, channels, sinks, and other components within Flume for special use cases.

Flume’s role in the ecosystem is one of data ingestion and for most users, this means writing to HDFS or HBase. Since data is written to these kinds of systems, it’s important that administrators be able to control the permissions on the resultant files. The proper way of dictating the permissions with which data is written is controlled by the configured sink and will vary from implementation to implementation. The HDFS sink, for example, uses the Hadoop library and just as any other HDFS client, takes on the identity of the effective user ID of the process. Flume’s HDFS sink also fully supports Kerberos authentication using a keytab, in which case, the principal becomes the owner of files in HDFS.

Flume supports a wide array of data ingest configuration options. For more information, see http://flume.apache.org/FlumeUserGuide.html or http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html for CDH users.

Apache ZooKeeper

Like Flume, Apache ZooKeeper doesn’t have a direct dependency on Hadoop, but a number of ecosystem projects do make use of it. ZooKeeper is a quorum-based, highly available, distributed coordination service used to build centralized configuration storage, naming, synchronization, leader election, and group membership services. Those building specialized distributed systems may find ZooKeeper appealing, but for many, it is seen as infrastructure for HBase and some features of HDFS.

ZooKeeper organizes data into ZNodes. Like files in a traditional filesystem, ZNodes can store arbitrary bytes, have access control lists associated with them, and can be organized hierarchically. In ZooKeeper, there is no distinction between a “file” and a “directory” ZNode; a single ZNode may have data (like a file) but also have child ZNodes (like a directory). Since ZNodes are very commonly used to store internal state of other applications and critical coordination information, they can be protected by ACLs, as mentioned earlier. In keeping with the filesystem metaphor, ZooKeeper offers a shell where administrators can execute commands to perform the equivalent of the chmod and chown commands.

For more information about ZooKeeper, see http://zookeeper.apache.org/doc/r3.4.3/ or http://archive.cloudera.com/cdh4/cdh/4/zookeeper/ for CDH users.

Apache Pig, Cascading, and Crunch

Many of the Hadoop ecosystem projects are client side tools or languages that bake down to MapReduce jobs. This class of projects, of which Pig, Cascading, and Crunch are all members, run entirely on a client machine and maintain no external metadata. These projects can be seen as a facade for Hadoop MapReduce, simplifying the programming model and providing a higher level language in which to perform operations. Each with a slightly different way of interpreting the problem of making MapReduce accessible, they are fundamentally similar in their implementation. For Hadoop administrators, these projects are trivial to support as they provide no additional operational overhead beyond MapReduce, and by extension, no additional security constructs.

Tying It Together

With an ever-growing ecosystem of projects forming around Hadoop, each with its own requirements, it’s hard to understand how these pieces come together to form a single data processing platform. Beyond simply making these tools function, administrators are responsible for ensuring a singular identification, authentication, and authorization scheme is applied consistently and in accordance with data handling policies. This is a significant challenge, to use a well-worn, albeit applicable, cliche. However, there are a few things one can do to reduce the pain of building a secure, shared Hadoop platform.

To secure or not to secure

Do your homework and make an informed decision about whether or not you need Hadoop security. It is not as simple as setting hadoop.security.authentication to kerberos and getting on with life. The repercussions of Kerberizing a cluster are significant in that every client of the cluster must be able to properly handle authentication. It’s easy to say and difficult to make a reality. Can you trust clients to identify themselves truthfully? If not, you have your work cut out for you. Create a map of all access points to the cluster, understand which must support multiple users through a single process, and create a strategy to handle authentication. There’s no universal answer, although there are some established patterns you can use.

All users and groups on all nodes

Enabling secure mode means that every (OS) user that submits a MapReduce job must have an account on every machine in the cluster. The account may be locked so that it isn’t possible to log in to the machine, but the account must exist and have a valid shell. One way to do this is by brute force; maintain user accounts by hand on each machine. This simply doesn’t scale and will cause trouble for you down the road. Instead, use a configuration management system like Puppet or Chef, use LDAP for centralized account management, or use a commercial tool to maintain accounts. You can also save yourself some heartache by ensuring users have the same UIDs on all machines as well.

Run a local Kerberos KDC

Set up a local installation of MIT Kerberos and configure the Hadoop cluster as its own realm. Configure each machine in the cluster to use the realm you’ve chosen as its default realm (this will save you endless pain later as it obviates the need to configure principal rewrite rules). If you have a global identity management system like Active Directory, configure a one way cross realm trust from the local KDC to the global system so users can successfully authenticate. While this seems odd at first, Hadoop has proven its ability to perform a distributed denial of service attack on corporate identity infrastructure simply by running a MapReduce job. Remember that every RPC requires communication with the KDC in secure mode and with a large enough cluster, with many tasks it’s possible to create a noticeable impact on shared infrastructure (one such incident rendered users incapable of authenticating for email for a short time during an unusual spike in Hadoop activity).

Establish an HDFS hierarchy that reflects access

Remember that the HDFS directory structure, like any other shared filesystem, is not just an organization construct, but the way you enforce access to data. Think about applications and access, and design accordingly. Like a relational database schema, the filesystem directory structure can be difficult to change for production systems. This is an area where administrators and developers must work in tandem.

Use gateway services to federate access

In scenarios where finegrained access control is required, it is useful to create gateway services. A gateway (or sometimes, proxy) service is a daemon that receives a (usually higher level) request and performs an action on behalf of a user when allowing the user to directly perform the action wouldn’t provide enough control.

A gateway, for instance, can be used to simulate column level access to files. Because all columns are stored in a single file, and the file is the level granularity with which access control can be controlled, it is necessary to intercept operations, apply the necessary checks, execute a modified version of the user’s request, and place the result in a location they can access. It’s imperative that the user not be given direct access to the source file, as this would allow access control to be circumvented. If, for instance, the preferred method of access was Hive, the gateway service could take a Hive query, rewrite it, and submit it to Hive proper as a privileged user, on their behalf.^[18] In many ways, this is the same idea as the sudo command, which is familiar to many administrators. This does require custom development and application specific knowledge, so it’s hard to make a recommendation that’s generally applicable, but the technique is what is important.

^[15]IETF RFC 4120 - The Kerberos Network Authentication Service (version 5) - http://tools.ietf.org/html/rfc4120

^[16]You can install the MIT Kerberos 1.9 client and server packages on CentOS 6.2 using the commands yum install krb5-workstation and yum install krb5-server, respectively.

^[17]This does not mean Hive makes Hadoop a relational database. Instead, it presents a familiar SQL-like interface to users to query data, but its execution engine is still MapReduce and subject to all the inherent functional and performance characteristics therein.

^[18]Some may note that Hive has basic constructs for controlling table level access as well as creating views, so it should be possible to create a view that doesn’t select the restricted column and give access to that instead. The problem with this is that since the user’s MapReduce job generated by Hive runs as them, they still must have access to the underlying file, which means if they were to go to the command line, they could simply read the original file, circumventing the higher level controls.