Multitenancy is a fancy word for supporting many independent entities within a single larger system. The ability to support multiple discreet entities in a system is generally useful when it is costly or complex to operate an instance of that system for each entity. An example of this is the ability to run multiple databases within a single database server, a feature supported by almost all RDBMS vendors. By bringing multiple users of a service together, we can take advantage of the economies of scale and offer greater service as a whole. A simple example of this in the context of Hadoop is that, if a large cluster is built to run hourly production MapReduce jobs, there are generally free resources between executions of those jobs. If we were to silo users by group or use case, these lulls in resource usage would be lost across all groups. Instead, it often (but admittedly not always) makes sense to combine smaller silo clusters into a single large cluster. Not only does this simplify operations, but it increases the available capacity to service consumers, on average, which improves system resource utilization.
Unfortunately, running multitenant systems comes with some obvious challenges. Controlling access to data and resources immediately becomes a point of concern, especially when the data is sensitive in nature. It may be that two different groups of users should not see each other’s data or even not know one another exist. Controlling access to resources also becomes critical. It shouldn’t be possible for one group to monopolize the resources of a shared system. Administrators and especially security and compliance staff expect to be able to monitor and audit user activity in multitenant systems to ensure policies are enforced. The degree of isolation within a multitenant system can be seen as the gating factor between silo and shared deployments. This is even more true of data storage and processing systems, and is the impetus for a deep review of identity, access, and authorization controls in a system such as Hadoop.
When a user performs an action in Hadoop, there are three significant questions:
Who does this user claim to be?
The identity of the entity interacting with the cluster (where entity means a human user or another system) is who they purport to be. As humans, we identify using our names. In Linux, we use usernames, whereas the relational database MySQL, for instance, has its own notion of a user. The identity is an arbitrary label that is unique to an entity, and something to which we can attach meaning.
Can this user prove they are who they say they are?
Anyone can claim to be your Uncle Larry or Aunt Susan, but can they prove it? We authenticate one another by confirming an identity using some kind of system. To enter a country, an individual must present a valid and authentic passport bearing a photo of the person, for instance (although some may say this is a weak form of subjective authentication). Linux provides multiple forms of authentication via plug-ins, although passwords are probably the most common. Authentication mechanisms vary in strength (the rigor with which they confirm a user’s identity).
Is this user allowed to do what they’re asking to do?
Once a user has identified themselves and we are reasonably sure they are who they claim to be, only then does it make sense to ensure they have been authorized to perform the requested action. It never makes sense for a system to support authorization without first authenticating users; a person could simply lie about who they are to gain privileges they wouldn’t otherwise have.
Hadoop operates in either the default, so called simple mode or secure mode, which provides strong authentication support via Kerberos. For many, the simple security mode is sufficient and offers reasonable protection from mistakes in a trusted environment. As its name implies, it’s simple to configure and manage, relying primarily on the host for authentication. If, however, you are running Hadoop in an untrusted, multitenant environment or where accidental data exposure would be catastrophic, secure mode is the appropriate option. In secure mode, Hadoop uses the well-known Kerberos protocol to authenticate users and daemons themselves during all operations. Additionally, MapReduce tasks in secure mode are executed as the same OS user as the job was submitted, whereas in simple mode, they are executed as the user running the tasktracker.
The most important aspect to understand is that, regardless of whether simple or secure mode is configured, it controls only how users are authenticated with the system. Authorization is inherently service specific. The evaluation of the authenticated user’s privileges in the context of the action they are asking to perform is controlled entirely by the service. In the case of HDFS, this means deciding if a user is permitted to read from or write to a file, for example. Authentication must always be performed before authorization is considered, and because it is commonly the same for all services, it can be built as a separate, generic service.
In Hadoop, there is a strong relationship between who a user
is in the host operating system and who they are in HDFS or MapReduce.
Furthermore, since there are many machines involved in a cluster, it may
not be immediately obvious what is actually required in order to execute a
MapReduce job. Hadoop, like most systems, uses the concepts of users and
groups to organize identities. However—and this is the root of quite a bit
of confusion—it uses the identity of the user according to the operating
system. That is, there is no such thing as a Hadoop user or group. When an
OS user, in my case, user esammer
, executes a command using the
hadoop
executable, or uses any of the Java APIs, Hadoop
accepts this username as the identity with no further checks. Versions of
Apache Hadoop prior to 0.20.200 or CDH3u0 also allowed users to specify
their identity by setting a configuration parameter when performing an
action in HDFS or even running a MapReduce job, although this is no longer
possible.
In simple mode, the Hadoop library on the client sends the username of the running process with each command to either the namenode or jobtracker, depending on the command executed. When in secure mode, the primary component of the Kerberos principal name is used as the identity of the user. The user must already have a valid Kerberos ticket in their cache, otherwise the command will fail with an incredibly cryptic message like Example 6-1.
Example 6-1. A typical failed authentication attempt with security enabled
WARN ipc.Client: Exception encountered while connecting to the server: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
As mentioned earlier, Hadoop supports strong authentication using the Kerberos protocol. Kerberos was developed by a team at MIT to provide strong authentication of clients to a server and is well-known to many enterprises. When operating in secure mode, all clients must provide a valid Kerberos ticket that can be verified by the server. In addition to clients being authenticated, daemons are also verified. In the case of HDFS, for instance, a datanode is not permitted to connect to the namenode unless it provides a valid ticket within each RPC. All of this amounts to an environment where every daemon and client application can be cryptographically verified as a known entity prior to allowing any operations to be performed, a desirable feature of any data storage and processing system.
To say Kerberos is “well-known” could be an overstatement. For many, Kerberos is shrouded in dense, intimidating terminology and requires specific knowledge to configure properly. Many implementations of Kerberos exist, and though there are RFCs[15] that describe the Kerberos protocol itself, management tools and methods have traditionally been vendor-specific. In the Linux world, one of the most popular implementations is MIT Kerberos version 5 (or MIT krb5 for short), an open source software package that includes the server, client, and admin tools. Before we dive into the details of configuring Hadoop to use Kerberos for authentication, let’s first take a look at how Kerberos works, as well as the MIT implementation.
A user in Kerberos is called a principal, which is made up of three distinct components: the primary, instance, and realm. The first component of the principal is called the primary, or sometimes the user component. The primary component is an arbitrary string and may be the operating system username of the user or the name of a service. The primary component is followed by an optional section called the instance, which is used to create principals that are used by users in special roles or to define the host on which a service runs, for example. An instance, if it exists, is separated from the primary by a slash and then the content is used to disambiguate multiple principals for a single user or service. The final component of the principal is the realm. The realm is similar to a domain in DNS in that it logically defines a related group of objects, although rather than hostnames as in DNS, the Kerberos realm defines a group of principals (see Table 6-1). Each realm can have its own settings including the location of the KDC on the network and supported encryption algorithms. Large organizations commonly create distinct realms to delegate administration of a realm to a group within the enterprise. Realms, by convention, are written in uppercase characters.
Table 6-1. Example Kerberos principals
Principal | Description |
---|---|
esammer@MYREALM.CLOUDERA.COM | A standard user principal. User esammer in realm MYREALM.CLOUDERA.COM. |
esammer/admin@MYREALM.CLOUDERA.COM | The admin instance of the user esammer in the realm MYREALM.CLOUDERA.COM. |
hdfs/hadoop01.cloudera.com@MYREALM.CLOUDERA.COM | The hdfs service on the host hadoop01.cloudera.com in the realm MYREALM.CLOUDERA.COM. |
At its core, Kerberos provides a central, trusted service called the Key Distribution Center or KDC. The KDC is made up of two distinct services: the authentication server (AS), which is responsible for authenticating a client and providing a ticket granting ticket (TGT), and the ticket granting service (TGS), which, given a valid TGT, can grant a ticket that authenticates a user when communicating with a Kerberos-enabled (or Kerberized) service. The KDC contains a database of principals and their keys, very much like /etc/passwd and some KDC implementations (including MIT Kerberos) support storing this data in centralized systems like LDAP. That’s a lot of verbiage, but the process of authenticating a user is relatively simple.
Consider the case where user esammer
wants to execute the command
hadoop fs -get /user/esammer/data.txt
. When operating
in secure mode, the HDFS namenode and datanode will not
permit any communication that does not contain a valid Kerberos ticket.
We also know that at least two (and frequently many more) services must
be contacted: one is the namenode to get the file metadata and check
permissions, and the rest are the datanodes to retrieve the blocks of
the file. To obtain any tickets from the KDC, we first retrieve a TGT
from the AS by providing our principal name. The TGT, which is only
valid for an administrator-defined period of time, is encrypted with our
password and sent back to the client. The client prompts
us for our password and attempts to decrypt the TGT. If it works, we’re
ready to request a ticket from the TGS, otherwise we’ve failed to
decrypt the TGT and we’re unable to request tickets. It’s important to
note that our password has never left the local machine; the system
works because the KDC has a copy of the password, which has been shared
in advance. This is a standard shared secret or symmetric key encryption
model.
It is still not yet possible to speak to the namenode or datanode; we need to provide a valid ticket for those specific services. Now that we have a valid TGT, we can request service specific tickets from the TGS. To do so, using our TGT, we ask the TGS for a ticket for a specific service, identified by the service principal (such as the namenode of the cluster). The TGS, which is part of the KDC, can verify the TGT we provide is valid because it was encrypted with a special key called the TGT key. If the TGT can be validated and it hasn’t yet expired, the TGS provides us a valid ticket for the service, which is also only valid for a finite amount of time. Within the returned ticket is a session key; a shared secret key that the service to which we speak can confirm with the KDC. Using this ticket, we can now contact the namenode and request metadata for /user/esammer/data.txt. The namenode will validate the ticket with the KDC and assuming everything checks out, then performs the operation we originally requested. Additionally, for operations that involve access to block data, the namenode generates a block token for each block returned to the client. The block token is then provided to the datanode by client, which validates its authenticity before providing access to the block data.
The TGT received from the KDC’s AS usually has a lifetime of 8 to
24 hours, meaning it is only necessary to provide a password once per
time period. The TGT is cached locally on the client machine and reused
during subsequent requests to the TGS. The MIT Kerberos implementation,
for instance, caches ticket information in the temporary file /tmp/krb5cc_uid
where uid
is the Linux user’s uid. To perform
the initial authentication and retrieve a TGT from the KDC with MIT
Kerberos, use the kinit command; to list
cached credentials, use the klist
command as in Example 6-2.
Example 6-2. Obtaining a ticket granting ticket with kinit
[esammer@hadoop01 ~]$ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_500) [esammer@hadoop01 ~]$ kinit Password for esammer@MYREALM.CLOUDERA.COM: [esammer@hadoop01 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_500 Default principal: esammer@MYREALM.CLOUDERA.COM Valid starting Expires Service principal 03/22/12 15:35:50 03/23/12 15:35:50 krbtgt/MYREALM.CLOUDERA.COM@MYREALM.CLOUDERA.COM renew until 03/22/12 15:35:50 [esammer@hadoop01 ~]$ hadoop fs -get /user/esammer/data.txt ...
Kerberos is an enormous topic, complex in its own right. Prior to embarking on a Kerberos deployment, it’s critical to understand how hosts and services are accessed by users as well as other services. Without a coherent understanding of a system, it’s likely that you will find that services that used to be accessible no longer work. For a detailed explanation of Kerberos, see Kerberos: The Definitive Guide by Jason Garman (O'Reilly Media).
Now that we have some understanding of how Kerberos works conceptually, it’s worth looking at how this applies to Hadoop. There are two primary forms of authentication that occur in Hadoop with respect to Kerberos: nodes within the cluster authenticating with one another to ensure that only trusted machines are part of the cluster, and users, both human and system, that access the cluster to interact with services. Since many of the Hadoop daemons also have embedded web servers, they too must be secured and authenticated.
Within each service, both users and worker nodes are verified by their Kerberos credentials. HDFS and MapReduce follow the same general architecture; the worker daemons are each given a unique principal that identifies each daemon, they authenticate, and include a valid ticket in each RPC to their respective master daemon. The workers authenticate by using a keytab stored on the local disk. Though tedious, the act of creating a unique principal for each daemon, for each host, generating the keytab, and getting it to the proper machine, is absolutely necessary when configuring a secure Hadoop cluster. Workers must have their own unique principals because if they didn’t, the KDC would issue a similar TGT (based on the principal’s key and timestamp) to all nodes, and services would see potentially hundreds of clients all attempting to authenticate with the same ticket, falsely characterizing it as a replay attack.
Multiple principals are used by the system when Hadoop is
operating in secure mode and take the form
where the service-name
/hostname
@KRB.REALM.COMservice-name
is hdfs
in the case of the HDFS daemons
and mapred
in the case
of the MapReduce daemons. Since worker nodes run both a datanode as well as a tasktracker, each node requires two
principals to be generated: one for the datanode and one for the
tasktracker. The namenode and jobtracker also have principals, although in
smaller clusters where the one or both of these daemons run on a node
that is also a slave, it is not necessary to create a separate principal
as namenode and datanode can share a principal and the tasktracker and
jobtracker can share a principal.
Since it isn’t feasible to log into each machine and execute
kinit
as both user hdfs
and mapred
and provide a password, the keys
for the service principals are exported to files and placed in a
well-known location. These files are referred to as key tables or just
keytabs. Exporting the keys to files
may seem dangerous, but if the contents of the files are properly
protected by filesystem permissions (that is, owned by the user the
daemon runs as, with permissions set to 0400), the integrity of the key
is not compromised. When the daemons start up, they use this keytab to
authenticate with the KDC and get a ticket so they can connect to the
namenode or jobtracker, respectively. When operating in secure mode, it
is not possible for a datanode or tasktracker to connect to its
constituent master daemon without a valid ticket.
With MIT Kerberos, exporting a key to a keytab will
invalidate any previously exported copies of that same key unless the
-norandkey
option is used. It’s
absolutely critical that you do not export a key that has already been
exported unless that’s what you mean to do. This should only be
necessary if you believe a keytab has become compromised or is
otherwise irrevocably lost or destroyed.
Users performing HDFS operations and running MapReduce jobs also must authenticate prior to those operations being allowed (or, technically, checked for authorization). When an application uses the Hadoop library to communicate with one of the services and is running in secure mode, the identity of the user to Hadoop is the primary component of the Kerberos principal. This is different from simple mode where the effective uid of the process is the identity of the user. Additionally, the tasks of a MapReduce jobs execute as the authenticated user that submitted the job. What this means is that, in secure mode, each user must have a principal in the KDC database and a user account on every machine in the cluster. See Table 6-2.
Table 6-2. Comparison of secure and simple mode identity
Simple | Secure | |
---|---|---|
Identity comes from: | Effective uid of client process | Kerberos principal |
MapReduce tasks run as: | Tasktracker user (e.g., mapred) | Kerberos principal |
The requirement that all users have a principal can complicate
otherwise simple tasks. For instance, assuming the HDFS super user is
hdfs
, it would normally
be possible to perform administrative activities using sudo
like in Example 6-3.
Example 6-3. Performing HDFS administrative commands with sudo
# Creating a new user's home directory in HDFS. Since /user is owned # by user hdfs, it is necessary to become that user or the super user (which # also happens to be hdfs). [esammer@hadoop01 ~]$ sudo -u hdfs hadoop fs -mkdir /user/jane
Unfortunately, this doesn’t work in secure mode because the uid of the process doesn’t make us
hdfs
. Instead, it is
necessary to authenticate as user hdfs
with Kerberos. This is normally
done using kinit
, as we saw earlier. This has the
unpleasant side effect of requiring that we share the password for the
HDFS principal. Rather than share the HDFS principal password with all the cluster
administrators, we can export the HDFS principal key to a keytab
protected by restrictive filesystem permissions, and then use
sudo
to allow selective users to access it when they
authenticate with kinit
. HDFS also supports the
notion of a super group that users can be a member of to perform
administrative commands as themselves.
Running tasks as the user that submitted the MapReduce job solves
a few potential problems, the first of which is that, if we were to
allow all tasks to run as user mapred
, each map task would produce its
intermediate output as the same user. A malicious user would be able to
simply scan through the directories specified by
mapred.local.dir
and read or modify the output of
another unrelated task. This kind of lack of isolation is a non-starter
for security-sensitive deployments.
Since the tasktracker runs as an unprivileged user (user mapred
, by default, in the case of CDH
and whatever user the administrator configures in Apache Hadoop), it
isn’t possible for it to launch task JVMs as a different user. One way
to solve this problem is to simply run the tasktracker process as root. While this would solve the
immediate problem of permissions, any vulnerability in the tasktracker
would open the entire system to compromise. Worse, since the
tasktracker’s job is to execute user supplied code as a user indicated
by the jobtracker, an attacker would trivially have full control over
all worker nodes. Instead of running the tasktracker as root, when
operating in secure mode, the tasktracker relies on a small setuid executable called the task-controller
.
The task-controller is a standalone binary implemented in C that sanity
checks its environment and immediately drops privileges to the proper
user before launching the task JVM. Configured by a small key value
configuration file called taskcontroller.cfg in
the Hadoop configuration directory, the
task-controller
is restricted to executing tasks for
users with a uid above a certain value (as privileged accounts usually
have low numbered uids). Specific users can also be explicitly prevented
from running tasks, regardless of their uid, which is useful for denying
Hadoop daemon users from executing tasks. For the
task-controller
to execute tasks as the user who
submitted the job, each user must have accounts on all machines of the
cluster. Administrators are expected to maintain these accounts,
and because of the potentially large number of machines to keep in sync,
admins are encouraged to either use centralized account management such
as LDAP or an automated system to keep password files
up-to-date.
Configuring Hadoop to operate in secure mode can be a daunting task with a number of external dependencies. Detailed knowledge of Linux, Kerberos, SSL/TLS, and JVM security constructs are required. At the time of this book, there are also some known gotchas that exist in certain Linux distributions and versions of the JVM that can cause you grief. Some of those are exposed below.
The high-level process for enabling security is as follows.
Audit all services to ensure enabling security will not break anything.
Hadoop security is all or nothing; enabling it will prevent all non-Kerberos authenticated communication. It is absolutely critical that you first take an inventory of all existing processes, both automated and otherwise, and decide how each will work once security is enabled. Don’t forget about administrative scripts and tools!
Configure a working non-security enabled Hadoop cluster.
Before embarking on enabling Hadoop’s security features, get a simple mode cluster up and running. You’ll want to iron out any kinks in DNS resolution, network connectivity, and simple misconfiguration early. Debugging network connectivity issues and supported encryption algorithms within the Kerberos KDC at the same time is not a position that you want to find yourself in.
Configure a working Kerberos environment.
Basic Kerberos operations such as authenticating and receiving a ticket-granting ticket from the KDC should work before you continue. You are strongly encouraged to use MIT Kerberos with Hadoop; it is, by far, the most widely tested. If you have existing Kerberos infrastructure (such as provided by Microsoft Active Directory) that you wish to authenticate against, it is recommended that you configure a local MIT KDC with one way cross realm trust so Hadoop daemon principals exist in the MIT KDC and user authentication requests are forwarded to Active Directory. This is usually far safer as large Hadoop clusters can accidentally create distributed denial of service attacks against shared infrastructure when they become active.
Ensure host name resolution is sane.
As discussed earlier, each Hadoop daemon has its own principal that it must know in order to authenticate. Since the hostname of the machine is part of the principal, all hostnames must be consistent and known at the time the principals are created. Once the principals are created, the hostnames may not be changed without recreating all of the principals! It is common that administrators run dedicated, caching-only, DNS name servers for large clusters.
Create Hadoop Kerberos principals.
Each daemon on each host of the cluster requires a distinct Kerberos principal when enabling security. Additionally, the Web user interfaces must also be given principals before they will function correctly. Just as the first point says, security is all or nothing.
Export principal keys to keytabs and distribute them to the proper cluster nodes.
With principals generated in the KDC, each key must be exported to a keytab, and copied to the proper host securely. Doing this by hand is incredibly laborious for even small clusters and, as a result, should be scripted.
Update Hadoop configuration files.
With all the principals generated and in their proper places, the Hadoop configuration files are then updated to enable security. The full list of configuration properties related to security are described later.
Restart all services.
To activate the configuration changes, all daemons must be restarted. The first time security is configured, it usually makes sense to start the first few daemons to make sure they authenticate correctly and are using the proper credentials before firing up the rest of the cluster.
Test!
It’s probably clear by now that enabling security is complex and requires a fair bit of effort. The truly difficult part of configuring a security environment is testing that everything is working correctly. It can be particularly difficult on a large production cluster with existing jobs to verify that everything is functioning properly, but no assumptions should be made. Kerberos does not, by definition, afford leniency to misconfigured clients.
Creating principals for each of the Hadoop daemons and distributing their respective keytabs is the most tedious part of enabling Hadoop security. Doing this for each daemon by hand would be rather error prone, so instead, we’ll create a file of host names and use a script to execute the proper commands. These examples assume MIT Kerberos 1.9 on CentOS 6.2.[16]
First, build a list of fully qualified host names, either by
exporting them from an inventory system or generating them based on a
well-known naming convention. For example, if all hosts follow the
naming convention of hadoop
,
where N
.mycompany.comN
is a zero padded sequential number,
a simple shell script will do:
[esammer@hadoop01 ~]$ for n in $(seq -f "%02g" 1 10) ; do echo "hadoop${n}.mycompany.com" done > hostnames.txt [esammer@hadoop01 ~]$ cat hostnames.txt hadoop01.mycompany.com hadoop02.mycompany.com hadoop03.mycompany.com hadoop04.mycompany.com hadoop05.mycompany.com hadoop06.mycompany.com hadoop07.mycompany.com hadoop08.mycompany.com hadoop09.mycompany.com hadoop10.mycompany.com
Using our host list as input, we can write a script to create the necessary principals, export the keys to keytabs, and bucket them by machine name.
This script will regenerate keys of any existing principals of the same name, which will invalidate any existing keytabs or passwords. Always measure twice and cut once when running scripts that affect the KDC!
#!/bin/sh [ -r "hostnames.txt" ] || { echo "File hostnames.txt doesn't exist or isn't readable." exit 1 } # Set this to the name of your Kerberos realm. krb_realm=MYREALM.MYCOMPANY.COM for name in $(cat hostnames.txt); do install -o root -g root -m 0700 -d ${name} kadmin.local <<EOF addprinc -randkey host/${name}@${krb_realm} addprinc -randkey hdfs/${name}@${krb_realm} addprinc -randkey mapred/${name}@${krb_realm} ktadd -k ${name}/hdfs.keytab -norandkey \ hdfs/${name}@${krb_realm} host/${name}@${krb_realm} ktadd -k ${name}/mapred.keytab -norandkey \ mapred/${name}@${krb_realm} host/${name}@${krb_realm} EOF done
This script relies on a properly configured Kerberos KDC and
assumes it is being run on the same machine as the KDC database. It
also assumes /etc/krb5.conf is
correctly configured and that the current user, root
, has privileges to write to the
KDC database files. It’s also important to use the
-norandkey
option to ktadd
,
otherwise each time you export the key, it changes, invalidating all
previously created keytabs containing that key. Also tricky is that
the -norandkey
option to
ktadd
works only when using kadmin.local
(rather than
kadmin
). This is because
kadmin.local
never transports the key over the
network since it works on the local KDC database. If you are not using
MIT Kerberos, consult your vendor’s documentation to ensure keys are
protected at all times.
You should now have a directory for each hostname, each of which
contains two keytab files: one named hdfs.keytab and one
named mapred.keytab. Each
keytab contains its respective service principal (for example,
hdfs/hostname
@realm
)
and a copy of the host keytab. Next, using a secure copy utility like
scp
or rsync
tunnelled over
ssh
, copy the keytabs to the proper machines and
place them in the Hadoop configuration directory. The owner of the
hdfs.keytab file must be the user
the namenode, secondary namenode, and datanodes run as, whereas the
mapred.keytab file must be owned
by the user the jobtracker and tasktrackers run as. Keytab files must
be protected at all times and as such, should have the permissions
0400 (owning user read only).
With the keytabs distributed to the proper machines, the next step is to update the Hadoop configuration files to enable secure mode. First, Kerberos security is enabled in core-site.xml.
hadoop.security.authentication
The
hadoop.security.authentication
parameter
defines the authentication mechanism to use within Hadoop. By
default, it is set to simple
, which simply trusts the
client is who they claim to be, whereas setting it to the string
kerberos
enables Kerberos support. In the future, other authentication
schemes may be supported, but at the time of this writing, these
are the only two valid options.
Example value: kerberos
hadoop.security.authorization
Enabling
hadoop.security.authorization
causes Hadoop to
authorize the client when it makes remote procedure calls to a
server. The access control lists that affect these permissions
are configured via the hadoop-policy.xml file and allow
per-service level control. For instance, it is possible to
permit only users placed in the mapred-admin
Linux group to
invoke APIs that are part of the administration service (the
security.admin.operations.protocol.acl
policy).
When enabling security, this feature should be enabled as well
and meaningful ACLs configured.
Example: true
Example 6-4 shows core-site.xml with the proper security properties set.
Example 6-4. Updating core-site.xml to enable Hadoop security
<?xml version="1.0"?> <configuration> <!-- Add these properties to the existing core-site.xml configuration. --> <property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property> </configuration>
Next, hdfs-site.xml must be configured so HDFS knows the principals to use and the location of the keytabs. Hadoop datanodes will also refuse to start in secure mode unless the data transceiver port is below 1024 (a privileged port) so that must also be changed.
dfs.block.access.token
Block access tokens are temporary keys that allow
an HDFS block to be read, written, deleted, or a host of other
internal actions, by an authenticated user. This mechanism
allows Hadoop to ensure that only the intended users are able to
access data in HDFS. While disabled (false
) by default, this
parameter should be enabled (set to true
) in a secure
deployment.
dfs.namenode.keytab.file
The dfs.namenode.keytab.file
parameter specifies the location of the keytab that contains the
Kerberos principal key for the namenode. This is the file
uploaded to each host and by convention, is placed in the Hadoop
configuration directory.
Example: /etc/hadoop/conf/hdfs.keytab
dfs.namenode.kerberos.principal
The Kerberos principal the namenode should use to
authenticate. The key for this principal must exist in the
keytab specified by dfs.namenode.keytab.file
.
The special token _HOST can be used for the instance portion of
the principal, in which case the fully qualified domain name
will be interpolated. Note that the _HOST token cannot be used
anywhere else in the principal.
Example: hdfs/_HOST@MYREALM.MYCOMPANY.COM
dfs.namenode.kerberos.https.principal
Similar to
dfs.namenode.kerberos.principal
(see previous
entry), this parameter specifies the Kerberos principal that
should be used by the embedded HTTPS server. The key for this
principal must also be in the keytab specified by
dfs.namenode.keytab.file
. Note that the local
part of this principal must be
host
, as shown
in the example.
Example: host/_HOST@MYREALM.MYCOMPANY.COM
dfs.https.address
The hostname or IP address on which the embedded HTTPS server should be bound. It is valid to specify the wild card IP 0.0.0.0 to indicate the HTTPS server should listen on all interfaces.
Example: 0.0.0.0
dfs.https.port
The port on which the embedded HTTPS server should listen for requests.
Example: 50470
dfs.datanode.keytab.file
Exactly the same as
dfs.namenode.keytab.file
, the
dfs.datanode.keytab.file
specifies the keytab
file containing the principal keys used by the datanode process.
This can, and usually is, the same file as
dfs.namenode.keytab.file
.
Example: /etc/hadoop/conf/hdfs.keytab
dfs.datanode.kerberos.principal
The Kerberos principal the datanode should use to
authenticate. The key for this principal must exist in the
keytab specified by dfs.datanode.keytab.file
.
The special token _HOST can be used for the instance portion of
the principal, in which case the fully qualified domain name
will be interpolated. Note that the _HOST token cannot be used
anywhere else in the principal. This is commonly the same
principal as
dfs.namenode.kerberos.principal
.
Example: hdfs/_HOST@MYREALM.MYCOMPANY.COM
dfs.datanode.kerberos.https.principal
Similar to
dfs.datanode.kerberos.principal
, this parameter
specifies the Kerberos principal that should be used by the
embedded HTTPS server. The key for this principal must also be
in the keytab specified by
dfs.datanode.keytab.file
.
Example: host/_HOST@MYREALM.MYCOMPANY.COM
dfs.datanode.address
The hostname or IP address and port, separated by a colon, on which the data transceiver RPC server should be bound. It is valid to specify the wild card IP 0.0.0.0 to indicate the server should listen on all interfaces. With security enabled, this port must be below 1024 or the datanode will not start.
Example: 0.0.0.0:1004
dfs.datanode.http.address
The hostname or IP address and port, separated by a colon, on which the embedded HTTP server should be bound. It is valid to specify the wild card IP 0.0.0.0 to indicate the HTTP server should listen on all interfaces.
Example: 0.0.0.0:1006
dfs.datanode.data.dir.perm
When security is enabled, Hadoop performs extra
checks to ensure HDFS block data cannot be read by unauthorized
users. One of these checks involves making sure the directories
specified by dfs.data.dir
are set to
restrictive permissions. This prevents user code from simply
opening and reading block data directly from the local disk
rather than using the HDFS APIs, which require a valid Kerberos
ticket and perform authorization checks on the file. If the
permissions are incorrect, the datanode will change the
permissions to the value specified by this parameter.
HDFS supports a feature called local read short-circuiting (as implemented by
HDFS-2246)
in which a client application running on the same machine as
the datanode can completely bypass the datanode server and
read block files directly from the local filesystem. This can
dramatically increase the speed of read operations, but at the
cost of opening access to the underlying block data of
all blocks to the client. When this
feature is enabled, clients must be running as the same user
as the datanode or be in a group that has
read access to block data. Both scenarios break some of the
invariants assumed by the security model and can inadvertently
expose data to malicious applications. Take great care when
enabling this feature on a secure cluster or setting
dfs.datanode.data.dir.perm
to anything other
than 0700.
Example: 0700
See Example 6-5 for a sample hdfs-site.xml file configured for security.
Example 6-5. Updating hdfs-site.xml to enable Hadoop security
<?xml version="1.0"?> <configuration> <property> <name>dfs.block.access.token.enable</name> <value>true</value> </property> <!-- NameNode security config --> <property> <name>dfs.namenode.keytab.file</name> <value>hdfs.keytab</value> </property> <property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/_HOST@MYREALM.MYCOMPANY.COM</value> </property> <property> <name>dfs.namenode.kerberos.https.principal</name> <value>host/_HOST@MYREALM.MYCOMPANY.COM</value> </property> <!-- DataNode security config --> <property> <name>dfs.datanode.keytab.file</name> <value>hdfs.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/_HOST@MYREALM.MYCOMPANY.COM</value> </property> <property> <name>dfs.datanode.kerberos.https.principal</name> <value>host/_HOST@MYREALM.MYCOMPANY.COM</value> </property> </configuration>
Lastly, mapred-site.xml must be configured; in many ways, it has similar parameters to hdfs-site.xml.
mapreduce.jobtracker.kerberos.principal
Just like
dfs.namenode.kerberos.principal
, this specifies
the Kerberos principal the jobtracker uses during
authentication. Again, the key for this principal must be in the
keytab specified by
mapreduce.jobtracker.keytab.file
. The _HOST
token may be used in the instance part of the principal, in
which case, it will be interpolated with the fully qualified
domain name of the machine on which the jobtracker is
running.
Example: mapred/_HOST@MYREALM.MYCOMPANY.COM
mapreduce.jobtracker.kerberos.https.principal
This parameter is used to specify the Kerberos principal for the embedded HTTPS server, just as its HDFS siblings.
Example: host/_HOST@MYREALM.MYCOMPANY.COM
mapreduce.jobtracker.keytab.file
The keytab file containing the jobtracker
principal key. See dfs.namenode.kerberos.principal
for more information.
Example: /etc/hadoop/conf/mapred.keytab
mapreduce.tasktracker.kerberos.principal
The Kerberos principal used by the tasktracker.
See mapreduce.jobtracker.kerberos.principal
for more information.
Example: mapred/_HOST@MYREALM.MYCOMPANY.COM
mapreduce.tasktracker.kerberos.https.principal
The Kerberos principal used by the embedded HTTPS
server. See mapreduce.jobtracker.kerberos.https.principal
for more information.
Example: host/_HOST@MYREALM.MYCOMPANY.COM
mapreduce.tasktracker.keytab.file
The keytab file containing the tasktracker principal key. See mapreduce.jobtracker.keytab.file
for more information.
Example: /etc/hadoop/conf/mapred.keytab
mapred.task.tracker.task-controller
As we’ve seen, the tasktracker has a few
strategies for launching child task JVMs. In simple security
mode, the tasks are launched as the same user as the tasktracker
itself, while in secure mode, the setuid
task-controller
executable is used. The
tasktracker uses a Java plug-in that controls which strategy is
used, which is defined by this parameter. By default,
mapred.task.tracker.task-controller
specifies
the class name org.apache.hadoop.mapred.DefaultTaskController
,
which simply executes the child task with the same permissions
as the parent tasktracker. In the case of a secure cluster, this
should instead be set to org.apache.hadoop.mapred.LinuxTaskController
,
which is the implementation that knows about the setuid task-controller
executable.
Example:
org.apache.hadoop.mapred.LinuxTaskController
mapreduce.tasktracker.group
In secure mode, the setuid
task-controller
, after sanitizing its
environment, changes its effective user id and group id to the
user running the MapReduce job and a specified group,
respectively. This parameter is what specifies the group that
should be used when this occurs. It’s important that the
effective group be controlled so the user task has the proper
permissions when it creates files (for example, in the Hadoop
log directory). This value should always match the effective
group of the tasktracker itself. If it doesn’t, the
task-controller
will exit with a non-zero
status. This isn’t important with simple security mode because
the task-controller
isn’t used.
Example: mapred
The task-controller’s configuration is separate from the three primary XML files and is required for it to function properly. Located in the Hadoop configuration directory, taskcontroller.cfg is a plain text file that works very much like a Java property file: one key value pair per line, separated by an equals sign. Unlike a property file, however, whitespace is not permitted between the equals sign and its key or value. See Example 6-6 for a sample taskcontroller.cfg. The following parameters are supported.
mapred.local.dir
(required)A comma separated list of directories used for
temporary data during MapReduce jobs. This should be the same
list as specified in mapred.local.dir
in the
mapred-site.xml file. See
mapred.local.dir
for
more information.
hadoop.log.dir
(required)The directory in which log data should be written.
This should be the same path as specified in
HADOOP_LOG_DIR
in the hadoop-env.sh file. See Logging Configuration for more
information.
mapred.tasktracker.group
(required)The group that the tasktracker process runs as. In
CDH, this is mapred
and whatever user the
administrator chooses in Apache Hadoop. This group should not
contain any users other than the user the tasktracker runs as.
If it does, those users will be able to impersonate any user
with a user id greater than what is specified in
min.user.id
.
min.user.id
(required)The minimum uid of the user to run a task as. If a
user with uid below this value submits a MapReduce job, the
task-controller
will refuse to execute the
tasks, exiting with a non-zero status code, which will cause all
tasks to fail. The default value of this is 1000 in CDH while
Apache Hadoop has no default.
banned.users
Specific users can be banned from executing tasks
on the cluster. By default, CDH includes users mapred
, hdfs
, and bin
in this list whereas Apache
Hadoop leaves this empty. At a minimum, you should always
include user hdfs
. By doing so, you can
avoid the potential attack vector where a task running as
hdfs
could read
or modify HDFS blocks stored on the local disks, circumventing
the permission checks performed by the namenode.
Example 6-6. Sample taskcontroller.cfg
mapred.local.dir=/data/1/hadoop/mapred/local,/data/2/hadoop/mapred/local hadoop.log.dir=/var/log/hadoop mapreduce.tasktracker.group=mapred banned.users=mapred,hdfs,bin min.user.id=1000
Finally, after distributing the changed configuration files to all hosts in the cluster, you are now ready to restart all daemons and begin testing. It is strongly recommended that you start with the namenode and a single datanode first to make sure everything is in working order. Watch the log files for errors, especially those related to authentication and resolve any issues you find. Given the complexity of setting up security, it is common to miss a step the first time and have a little debugging to do. Take it slow, one daemon at a time, and for production clusters, always have a rollback plan.
So far, we’ve discussed only how clients identify themselves and how Hadoop authenticates them. Once a client is authenticated, though, that client is still subject to authorization when it attempts to perform an action. The actions that can be performed vary from service to service. An action in the context of HDFS, for example, may be reading a file, creating a directory, or renaming a filesystem object. MapReduce actions, on the other hand, could be submitting or killing a job. Evaluating whether or not a user is permitted to perform a specific action is the process of authorization.
Every filesystem operation in HDFS is subject to authorization. In an effort to exploit existing knowledge, HDFS uses the same authorization model as most POSIX filesystems. Each filesystem object (such as a file or directory) has three classes of user: an owner, a group, and “other,” which indicates anyone who isn’t in one of the two previous classes. The available permissions that can be granted to each of the three classes on an object are read, write, and execute, just as with Linux or other Unix-like systems. For example, it is possible to grant the owner of a file both read and write privileges. These permissions are represented by a single octal (base-8) integer that is calculated by summing permission values (see Table 6-3).
Using our previous example, to indicate the file owner has both read and write privileges, we would sum the read permission value of 4 and the write permission value of 2, giving us 6, which represents the combination of the two privileges. This representation is always unambiguous. That is, it is impossible for two combinations to yield the same sum. The special value of 0 (zero) means that a user has no permissions. Noteworthy, it is not meaningful (nor is it illegal) for a file to be executable in HDFS, although directories use the execute permission to indicate that a user may access its contents (if they know the name of the contained file or directory already) and metadata information (then read permission is also required to retrieve the children’s names).
To represent the three classes of user—owner, group, and other—we use three integers, one for each class, in that order. For instance, a file that allows the owner to read and write, the group to read, and other users to read, would have the permissions 644; the 6 indicates read and write for the owner (4 + 2), whereas the subsequent fields are both 4, indicating only the read permission is available. To indicate the owner of a directory has read, write, and execute permissions, but no one else has access, the permissions 700 would be used.
In addition to the above permissions, three other special permissions exist: setuid, setgid, and sticky mode (or “the sticky bit”). The setuid permission in Linux changes the process effective user id to that of the file owner when the file is executed. Since files in HDFS cannot be executed, granting this means nothing. The setgid permission, when set on a directory in Linux, forces the group of all immediate child files and directories to that of the parent directory. However, this is the default behavior in HDFS, so there’s no need to explicitly set the setgid permission on a directory. Like setuid, there is no meaning to setting setgid on a file in HDFS because files cannot be executed. The final permission is called the sticky bit and when set on a directory, means that only the owner of a file in that directory may delete or rename the file, even if another user has access to do so (as granted by the write permission on the directory itself). The exception to this rule is that the HDFS super user and owner of the directory always have the ability to perform these actions. This is exactly the behavior one desires for temporary directories where all users need to be able to write data, but only the owner of that data should be able to remove or change it. These permissions are granted by using an optional fourth octal number placed to the left of the standard set of three. Just like the read, write, and execute permissions, setuid, setgid, and sticky bit each have a designed value that may be summed to grant multiple permissions. (See Table 6-4.)
A value of 0 (zero) removes these permissions, if they are set. Example 6-7 shows typical HDFS interaction with Kerberos authentication and permissions set.
Example 6-7. Using the sticky bit permission in HDFS
# Authenticate as user esammer [esammer@hadoop01 ~]$ kinit esammer Password for esammer@PE.HADOOP.CLOUDERA.COM: # Create a directory called test [esammer@hadoop01 ~]$ hadoop fs -mkdir /user/esammer/test # Change the permissions to enable sticky bit and set: # owner: read, write, execute # group: read, write, execute # other: read, write, execute [esammer@hadoop01 ~]$ hadoop fs -chmod 1777 /user/esammer/test # Create an empty file as esammer [esammer@hadoop01 ~]$ hadoop fs -touch /user/esammer/test/foo # Authenticate as user esammer2 [esammer@hadoop01 ~]$ kinit esammer2 Password for esammer2@PE.HADOOP.CLOUDERA.COM: # Attempt to remove the file foo. esammer@hadoop01 ~]$ hadoop fs -rmr /user/esammer/test/foo rmr: org.apache.hadoop.security.AccessControlException: Permission denied by sticky ↵ bit setting: user=esammer2, inode="/user/esammer/test/foo":esammer:hadoop:-rw-r--r-- # Oops. The sticky bit stopped us from removing the file. Let's create a file of # our own called bar. [esammer@hadoop01 ~]$ hadoop fs -touchz /user/esammer/test/bar # Switch back to user esammer. [esammer@hadoop01 ~]$ kinit esammer Password for esammer@PE.HADOOP.CLOUDERA.COM: # We can see both files... [esammer@hadoop01 ~]$ hadoop fs -ls /user/esammer/test/ Found 2 items -rw-r--r-- 3 esammer2 hadoop 0 2012-04-25 13:53 /user/esammer/test/bar -rw-r--r-- 3 esammer hadoop 0 2012-04-25 13:52 /user/esammer/test/foo # But because user esammer owns the directory, we can delete esammer2's file! [esammer@hadoop01 ~]$ hadoop fs -rmr /user/esammer/test/bar Deleted hdfs://hadoop01.cloudera.com/user/esammer/test/bar
Hadoop MapReduce, like HDFS, has a few different classes of users (four, to be exact).
The cluster owner is the OS user that started the
cluster. In other words, this is the user the jobtracker daemon is
running as. This is normally user hadoop
for Apache Hadoop and
mapred
for CDH.
Like the HDFS super user, the MapReduce cluster owner is granted
all permissions implicitly and should be used rarely, if ever, by
administrators.
One or more users may be specified as cluster administrators. These users have all of the same powers as the cluster owner, but do not need to be able to authenticate as the Linux user that started the cluster. Granting users this power allows them to perform administrative operations while still retaining the ability to audit their actions individually, rather than having them use a shared account. Use of a shared account is also discouraged as it creates the need to share authentication credentials.
When a job is submitted to the jobtracker, the user specifies a queue. The queue has an access control list (ACL) associated with it that defines which users and groups may submit jobs, but also which users may administer the queue. Administrative actions may be changing the priority of a job, killing tasks, or even killing entire jobs.
Finally, the job owner is the user who submitted a job. A job owner always has the ability to perform administrative actions on their own jobs.
Defining the cluster owner is a side effect of simply
starting the jobtracker so there’s no explicit configuration of this
user, per se. Cluster and queue administrators, on the other hand, are
defined by the mapred-site.xml and
mapred-queue-acls.xml
files, respectively. Since these files define access control, it’s
important to make sure the files and their parent directory are writable
only by the cluster owner or root
. If users will be submitting jobs
on the same machine as the jobtracker, the files and their parent
directory must be world readable to allow users to submit jobs because
they contain information used by the client as well as the
server.
An access control list in Hadoop is a space-separated pair of comma-separated lists. The first list is the users that allowed to perform a given action; the second is the list of groups that are allowed to perform an action. It is possible for either or both of the lists to be empty, in which case, no users or groups may perform an action. If the ACL is simply an asterisk, everyone is allowed to perform an action. The table in Example 6-8 shows a few examples of setting access control lists for services.
Example 6-8. Sample Hadoop access control lists and their meanings
ACL | Description |
---|---|
“*” | Everyone is permitted. |
" " (a single space) | No one is permitted. |
“user1,user2 group1,group2,group3” | Users 1 and 2, and groups 1, 2, and 3 are permitted. |
“user1 " (a user followed by a single space) | User 1 is permitted, but no groups. |
" group1” (a single space followed by a group) | Group 1 is permitted, but no users. |
The quotes are not part of the ACL and should not be typed. They are shown here to highlight the space character.
By default, Hadoop MapReduce is wide open; any user may submit a job to any queue, and all users may perform administrative actions. For single-purpose clusters in a trusted environment, this is fine, but for multitenant clusters, this doesn’t work. There are a number of parameters and extra files that need to be configured to define and enable queue ACLs. The first step is to modify mapred-site.xml to enable ACLs, define cluster administrators, and specify the queues that should exist.
mapred.acls.enabled
Access control lists must be globally enabled prior to use. Hadoop does not enforce ACLs, even if they’re defined, if this parameter is not set to true.
Example: true. Default: false
mapred.cluster.administrators
An ACL that defines users and groups that are cluster administrators.
Example: “esammer mradmins,ops” (without quotes). Default: undefined.
mapred.queue.names
Before attaching ACLs to queues, all queues must be
predefined. This is done by simply providing a comma-separated
list of queue names to the mapred.queue.names
parameter. There should be no spaces between names. Some scheduler
plug-ins—notably the FIFO scheduler, which is the default—do not
support multiple queues. Both the capacity scheduler and fair
scheduler plug-ins support multiple queues.
Example: research,production-etl,adhoc. Default: default.
While the mapred-queue-acls.xml file is reloaded dynamically, changes to mapred-site.xml require daemons be restarted before changes take effect.
The next step is to define ACLs for the queues named in
mapred.queue.names
. This must be done in the mapred-queue-acls.xml file, also located in
the Hadoop configuration directory. Like the primary three configuration
files, mapred-queue-acls.xml
follows the same format; an XML file with a top level configuration
element, which contains zero or more property elements, each of which is
a name value pair. Each MapReduce queue is configured with an ACL, as
described above, in a separate configuration property where the name of
the property is mapred.queue.
and the value is the ACL. The queue-name
.privilege
queue-name
should be one of the queue names from the
mapred.queue.names
property, and
privilege
is one of the available queue
privileges (see Table 6-5).
Table 6-5. MapReduce queue privileges
Privilege | Description |
---|---|
acl-submit-job | An ACL that controls who may submit MapReduce jobs to a queue. |
acl-administer-jobs | An ACL that controls who may perform administrative actions on jobs in a queue. |
If, for example, you want to create a production-etl
queue and grant user
jane
and anyone in the
prod
group access to
submit jobs to it, production-etl
must be listed in
mapred.queue.names
in mapred-site.xml, and
mapred.queue.production-etl.acl-submit-job
should be
set to "jane prod
"
(without quotes) in mapred-queue-acls.xml, as in Example 6-9.
Example 6-9. Sample mapred-queue-acls.xml file
<?xml version="1.0"?> <configuration> <!-- Allow user jane and group prod to submit jobs to the production-etl queue. --> <property> <name>mapred.queue.production-etl.acl-submit-job</name> <value>jane prod</value> </property> <!-- Allow no users, and groups prod and hadoop-admins, to administer jobs in the production-etl queue. Note the space after the opening value element. This means the user list is empty. The cluster owner, cluster admins, and job owner can always administer their jobs. --> <property> <name>mapred.queue.production-etl.acl-administer-jobs</name> <value> prod,hadoop-admins</value> </property> <!-- Many queues may be defined... --> </configuration>
The queues described thus far are independent of the capacity and
fair scheduler plug-ins, although they are related. None of the three
scheduler plug-ins (including the default FIFO scheduler) provide access
control over which jobs are submitted to which queues. If you need this
kind of functionality, the expectation is that the queue names, as
defined by mapred.queue.names
, will match the queues
configured in the scheduler plug-in. This allows job submission and
queue ACLs to be implemented generically, while the order and machines
on which tasks are executed is controlled by the plug-in. In Chapter 7, we’ll cover how each of the schedulers may be
configured to suit different environments and use cases.
Many of the projects commonly used with Hadoop (the so-called ecosystem projects) rely simply on the HDFS and MapReduce access controls. This makes a lot of sense because some of these tools are just clients of the two Hadoop services. Pig, for instance, is a good example of this. A user writes a Pig script, which is interpreted on the client machine, logical and physical execution plans are created, and a series of MapReduce jobs are submitted to the cluster. The data accessed by the job is commonly a directory of files in HDFS and is therefore subject to the filesystem permissions of the user who submitted the job. In other words, there’s nothing else to which access must be controlled.
But that, of course, is not always the case. Other ecosystem projects do have additional information or higher level constructs that must be secured in multiuser environments. Hive is a good example of this. While also a client-side application, Hive stores additional metadata about the data in HDFS and presents a relational database style abstraction[17] to the end user. This presents an interesting case where, even if we were to restrict a user from accessing the data contained in HDFS, they would still be able to see the metadata—the table names, field names and types, and so forth—managed by Hive. In multitenant environments, this may not be acceptable. Disclosing information about users’ datasets to one another may be sufficient to reveal what they are doing or how. Because of this risk, each of the ecosystem projects, including Hive, are responsible for protecting any metadata they maintain.
Each ecosystem project has different controls for protecting metadata and must be configured independently. It isn’t feasible to cover all of the nuances of each of these projects in detail here. This is meant to provide references for further reading.
Because HiveQL bakes down into MapReduce jobs, operations are subject to all of the access controls of MapReduce queue ACLs and HDFS file permissions. Additionally, Hive provides RDBMS style users, groups, and roles, as well as GRANT statements to control access to its so-called database objects. This enables finegrained access control to higher order constructs like tables, columns, or even views. Of course, this is enforced only at the time the HiveQL query is written and submitted; the files in HDFS have no knowledge of this metadata and as a result, it can’t be enforced further down the stack. Remember that for MapReduce jobs to work, the user submitting the job must be able to access the files in HDFS. This creates an interesting problem where a user may not have permission to view the data in a column of a table from Hive’s perspective, but the user must have access to the complete files in HDFS, thereby allowing them to circumvent the object-level policy. We’ll see possible solutions to this problem in Tying It Together.
For more information about configuring Hive object level access control, see http://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization.
HBase is a distributed, column-oriented, low latency system for storing enormous tables, both in terms of rows and columns. Already a huge topic, HBase, as of version 0.92.x (also included in CDH4), has support for performing access control checks on each operation issued to a table. These operations include get (reading a record), put (storing a record), scan (reading a range of records), and delete, as well as the administrative or data definition operations like creating tables. HBase security is implemented as a coprocessor—a plug-in that, in many ways, works like a database trigger, firing authorization checking code before each operation is performed.
Users and groups may be granted permission to perform any combination of the above operations at the table, column family, or column qualifier granularity. In other words, it is possible to restrict access to tables, groups of columns, or individual columns within a group.
For more information about configuring HBase security, see http://hbase.apache.org/book/security.html or https://ccp.cloudera.com/display/CDH4DOC/HBase+Security+Configuration for CDH users.
Oozie is a workflow system specifically built to work with Hadoop and MapReduce. Users (usually developers) write workflows in an XML language that define one or more MapReduce jobs, their interdependencies, and what to do in the case of failures. These workflows are uploaded to the Oozie server where they are scheduled to run or executed immediately. When Oozie executes a MapReduce job as part of a workflow, it is run by the Oozie server, which keeps track of job-level failures and status.
In a secure environment, the Oozie server must authenticate via
Kerberos to submit MapReduce jobs to the cluster. It’s important,
however, that these jobs execute as the user that created the workflow
and not as the user running the Oozie server. In order to launch
MapReduce jobs with the proper identity, each workflow defines the
user and group with which the jobs should run. Additionally, in a
secure environment, the value of the specified user and group are
checked against the identity of the authenticated user submitting the
workflow. In other words, if I—authenticated as user esammer
—attempt to submit a workflow
to Oozie that specifies user janedoe
, Oozie will refuse to accept
the workflow.
Oozie has a very simple security model based on the notion of users and groups. All users can see all workflows, but only modify those owned by themselves or a group to which they belong. Any number of users may be designated as administrators who, additionally, may modify any workflows, and perform administrative functions. With its security features disabled, all users are effectively administrators in Oozie.
For more information about how Oozie security, see http://incubator.apache.org/oozie/docs/3.1.3/docs/AG_Install.html or for CDH, https://ccp.cloudera.com/display/CDH4DOC/Oozie+Security+Configuration.
Hue is a web-based application that provides an end user−facing application that can be used to interact with a Hadoop cluster. Out of the box, Hue comes with a job browser, job designer, filesystem browser, and Beeswax, a Hive user interface from which users can execute HiveQL queries. Like most web applications, Hue provides username and password authentication, and (Hue) application level access control lists. Users are assigned to groups and groups are given permission to launch specific Hue applications. Additionally, users may be marked as super users, in which case they may launch all applications as well as administer users and privileges. As of version 2.x (included with CDH4), Hue supports synchronization with LDAP services (OpenLDAP and Active Directory, specifically) for both users and groups.
For more information about Hue configuration, see https://ccp.cloudera.com/display/CDH4DOC/Hue+Security+Configuration.
Sqoop is a tool that facilitates bidirectional exchange of data between HDFS and relational databases. With Sqoop, it’s possible to import the contents of a table, the result of a SQL query, or even an entire database, to files in HDFS so they may be used in MapReduce jobs. The inverse operation of exporting data residing in HDFS to tables in an RDBMS is also fully supported for files with a known schema. Sqoop does this by running a MapReduce job with a controlled number map tasks, each making a number of parallel connections to a database and performing parallel SELECT or INSERT/UPDATE statements (for import to Hadoop and export from Hadoop, respectively). One of Sqoop’s more interesting features is its ability to accept vendor specific plug-ins for high performance import and export. These plug-ins take advantage of native features of the RDBMS for faster operations and drop in with no coding required by the Sqoop user. At this time, high performance plug-ins are available for MySQL, PostgreSQL, Oracle, Teradata, Netezza, Vertica, and SQL Server.
Since Sqoop jobs run as MapReduce jobs, they can be controlled in all the same ways as any other MapReduce job. This includes managing HDFS permissions on files and directories to control access to data. The database credentials must be supplied with each invocation of the sqoop command-line tool.
Frequently, users want to perform incremental imports or exports of data. Sqoop supports a number of strategies for determining which rows have changed since the last run, but all of them require maintaining some kind of state between runs. Users may decide to manually preserve this information somewhere, but for recurring jobs in production, this usually isn’t realistic. Another option is to use Sqoop’s saved job feature, which stores both connection credentials and incremental state information in a directory on the local filesystem of the client machine in what is called the metastore. Administrators and users can then execute Sqoop jobs by their saved job name rather than requiring all the command-line options upon each invocation. If saved jobs are used, the metastore directory must be protected using standard filesystem permissions as you would with other sensitive files on the system, especially if you decide to allow Sqoop to also store database passwords.
To learn more about Sqoop, saved jobs, and managing Sqoop’s metastore, see the full documentation at http://sqoop.apache.org/docs/1.4.1-incubating/index.html or http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html for CDH users.
The Sqoop community is currently working on a proposal for Sqoop 2, which maintains the functionality of Sqoop, but realizes it as a persistent service rather than a command-line tool. For more information on Sqoop 2, be sure to check out https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2.
Apache Flume is a distributed, fault tolerant, scale out streaming data collection system commonly used for collecting log events. Unlike many of the other Hadoop ecosystem projects, Apache Flume does not strictly require Hadoop, although it complements HDFS very well. Flume uses the abstraction of sources, channels, and sinks to wire together data flows that transport data from one end point to another. Some of the more common sources are an Avro RPC source to which applications can speak directly, syslog, a raw netcat-style source, and a source that executes a command and reads from its standard output. Sources put data into channels that act as queues that support different strategies that effect durability. Sinks take events from channels and depending on the implementation, store them in HDFS, forward them to another Flume agent over the network, and insert them into HBase or any other data store. It is also simple to implement custom sources, channels, sinks, and other components within Flume for special use cases.
Flume’s role in the ecosystem is one of data ingestion and for most users, this means writing to HDFS or HBase. Since data is written to these kinds of systems, it’s important that administrators be able to control the permissions on the resultant files. The proper way of dictating the permissions with which data is written is controlled by the configured sink and will vary from implementation to implementation. The HDFS sink, for example, uses the Hadoop library and just as any other HDFS client, takes on the identity of the effective user ID of the process. Flume’s HDFS sink also fully supports Kerberos authentication using a keytab, in which case, the principal becomes the owner of files in HDFS.
Flume supports a wide array of data ingest configuration options. For more information, see http://flume.apache.org/FlumeUserGuide.html or http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html for CDH users.
Like Flume, Apache ZooKeeper doesn’t have a direct dependency on Hadoop, but a number of ecosystem projects do make use of it. ZooKeeper is a quorum-based, highly available, distributed coordination service used to build centralized configuration storage, naming, synchronization, leader election, and group membership services. Those building specialized distributed systems may find ZooKeeper appealing, but for many, it is seen as infrastructure for HBase and some features of HDFS.
ZooKeeper organizes data into ZNodes. Like files in a traditional filesystem, ZNodes can store arbitrary bytes, have access control lists associated with them, and can be organized hierarchically. In ZooKeeper, there is no distinction between a “file” and a “directory” ZNode; a single ZNode may have data (like a file) but also have child ZNodes (like a directory). Since ZNodes are very commonly used to store internal state of other applications and critical coordination information, they can be protected by ACLs, as mentioned earlier. In keeping with the filesystem metaphor, ZooKeeper offers a shell where administrators can execute commands to perform the equivalent of the chmod and chown commands.
For more information about ZooKeeper, see http://zookeeper.apache.org/doc/r3.4.3/ or http://archive.cloudera.com/cdh4/cdh/4/zookeeper/ for CDH users.
Many of the Hadoop ecosystem projects are client side tools or languages that bake down to MapReduce jobs. This class of projects, of which Pig, Cascading, and Crunch are all members, run entirely on a client machine and maintain no external metadata. These projects can be seen as a facade for Hadoop MapReduce, simplifying the programming model and providing a higher level language in which to perform operations. Each with a slightly different way of interpreting the problem of making MapReduce accessible, they are fundamentally similar in their implementation. For Hadoop administrators, these projects are trivial to support as they provide no additional operational overhead beyond MapReduce, and by extension, no additional security constructs.
With an ever-growing ecosystem of projects forming around Hadoop, each with its own requirements, it’s hard to understand how these pieces come together to form a single data processing platform. Beyond simply making these tools function, administrators are responsible for ensuring a singular identification, authentication, and authorization scheme is applied consistently and in accordance with data handling policies. This is a significant challenge, to use a well-worn, albeit applicable, cliche. However, there are a few things one can do to reduce the pain of building a secure, shared Hadoop platform.
Do your homework and make an informed decision about
whether or not you need Hadoop security. It is not as simple as
setting hadoop.security.authentication
to
kerberos
and
getting on with life. The repercussions of Kerberizing a cluster are
significant in that every client of the cluster must be able to
properly handle authentication. It’s easy to say and difficult to
make a reality. Can you trust clients to identify themselves
truthfully? If not, you have your work cut out for you. Create a map
of all access points to the cluster, understand which must support
multiple users through a single process, and create a strategy to
handle authentication. There’s no universal answer, although there
are some established patterns you can use.
Enabling secure mode means that every (OS) user that submits a MapReduce job must have an account on every machine in the cluster. The account may be locked so that it isn’t possible to log in to the machine, but the account must exist and have a valid shell. One way to do this is by brute force; maintain user accounts by hand on each machine. This simply doesn’t scale and will cause trouble for you down the road. Instead, use a configuration management system like Puppet or Chef, use LDAP for centralized account management, or use a commercial tool to maintain accounts. You can also save yourself some heartache by ensuring users have the same UIDs on all machines as well.
Set up a local installation of MIT Kerberos and configure the Hadoop cluster as its own realm. Configure each machine in the cluster to use the realm you’ve chosen as its default realm (this will save you endless pain later as it obviates the need to configure principal rewrite rules). If you have a global identity management system like Active Directory, configure a one way cross realm trust from the local KDC to the global system so users can successfully authenticate. While this seems odd at first, Hadoop has proven its ability to perform a distributed denial of service attack on corporate identity infrastructure simply by running a MapReduce job. Remember that every RPC requires communication with the KDC in secure mode and with a large enough cluster, with many tasks it’s possible to create a noticeable impact on shared infrastructure (one such incident rendered users incapable of authenticating for email for a short time during an unusual spike in Hadoop activity).
Remember that the HDFS directory structure, like any other shared filesystem, is not just an organization construct, but the way you enforce access to data. Think about applications and access, and design accordingly. Like a relational database schema, the filesystem directory structure can be difficult to change for production systems. This is an area where administrators and developers must work in tandem.
In scenarios where finegrained access control is required, it is useful to create gateway services. A gateway (or sometimes, proxy) service is a daemon that receives a (usually higher level) request and performs an action on behalf of a user when allowing the user to directly perform the action wouldn’t provide enough control.
A gateway, for instance, can be used to simulate column level
access to files. Because all columns are stored in a single file,
and the file is the level granularity with which access control can
be controlled, it is necessary to intercept operations, apply the
necessary checks, execute a modified version of the user’s request,
and place the result in a location they can access. It’s imperative
that the user not be given direct access to the source file, as this
would allow access control to be circumvented. If, for instance, the
preferred method of access was Hive, the gateway service could take a Hive query,
rewrite it, and submit it to Hive proper as a privileged user, on
their behalf.[18] In many ways, this is the same idea as the
sudo
command, which is familiar to many
administrators. This does require custom development and application
specific knowledge, so it’s hard to make a recommendation that’s
generally applicable, but the technique is what is important.
[15] IETF RFC 4120 - The Kerberos Network Authentication Service (version 5) - http://tools.ietf.org/html/rfc4120
[16] You can install the MIT Kerberos 1.9 client and server
packages on CentOS 6.2 using the commands yum install
krb5-workstation
and yum install
krb5-server
, respectively.
[17] This does not mean Hive makes Hadoop a relational database. Instead, it presents a familiar SQL-like interface to users to query data, but its execution engine is still MapReduce and subject to all the inherent functional and performance characteristics therein.
[18] Some may note that Hive has basic constructs for controlling table level access as well as creating views, so it should be possible to create a view that doesn’t select the restricted column and give access to that instead. The problem with this is that since the user’s MapReduce job generated by Hive runs as them, they still must have access to the underlying file, which means if they were to go to the command line, they could simply read the original file, circumventing the higher level controls.