== Flume and HDFS Security Integration
NOTE: This section is only required if you are using a Kerberized HDFS
cluster. If you are running CDH3b2 or a Hadoop version 0.21.x or
earlier, you can safely skip this section.
Flume's datapath needs to be able to interact with "secured" Hadoop
and HDFS. The Hadoop and HDFS designers have chosen to use the
Kerberos V5 system and protocols to authenticate communications
between clients and services. Hadoop clients include users, MR jobs
on behalf of users, and services include HDFS, MapReduce.
In this section we will describe how setup up a Flume node to be a
client as user 'flume' to a kerberized HDFS service. This section
will *not* talk about securing the communications between Flume nodes
and Flume masters, or the communications between Flume nodes in a
Flume flow. The current implementation does not support writing
individual isolated flows as different users.
NOTE: This has only been tested with the security enhanced betas of
CDH (CDH3b3+), and the MIT Kerberos 5 implementation.
=== Basics
Flume will act as a particular Kerberos principal (user) and needs
credentials. The Kerberos credentials are needed in order to interact
with the kerberized service.
There are two ways you can get credentials. The first is used by
interactive users because it requires an interactive logon. The
second is generally used by services (like a Flume daemon) and uses a
specially protected key table file called a 'keytab'.
Interactively using the +kinit+ program to contact the Kerberos KDC
(key distribution center) is one way is to prove your identity. This
approach requires a user to enter a password. To do this you need a
two part principal setup in the KDC, which is generally of the form
+user@REALM.COM+. Logging in via +kinit+ will grant a ticket granting
ticket (TGT) which can be used to authenticate with other services.
NOTE: this user needs to have an account on the namenode machine as
well -- Hadoop uses this user and group information from that machine
when authorizing access.
Authenticating a user or a service can alternately be done using a
specially protected 'keytab' file. This file contains a ticket
generating ticket (TGT) which is used to mutually authenticate the
client and the service via the Kerberos KDC.
NOTE: The keytab approach is similar to an "password-less" ssh
connections. In this case instead of an id_rsa private key file, the
service has a keytab entry with its private key.
Because a Flume node daemon is usually started unattended (via service
script), it needs to login using the keytab approach. When using a
keytab, the Hadoop services requires a three part principal. This has
the form +user/host.com@REALM.COM+. We recommend using +flume+ as the
user and the hostname of the machine as the service. Assuming that
Kerberos and kerberized Hadoop has been properly setup, you just need
to a few parameters to the Flume node's property file
(flume-site.xml).
----
flume.kerberos.user
flume/host1.com@REALM.COM
flume.kerberos.keytab
/etc/flume/conf/keytab.krb5
----
In this case, +flume+ is the user, +host1.com+ is the service, and
+REALM.COM+ is the Kerberos realm. The +/etc/keytab.krb5+ file contains
the keys necessary for +flume/host1.com@REALM.COM+ to authenticate
with other services.
Flume and Hadoop provides a simple keyword (_HOST) that gets expanded
to be the host name of the machine the service is running on. This
allows you to have one flume-site.xml file with the same
flume.kerberos.user property on all of your machines.
----
flume.kerberos.user
flume/_HOST@REALM.COM
----
You can test to see if your Flume node is properly setup by running
the following command.
----
flume node_nowatch -1 -n dump -c 'dump: console | collectorSink("hdfs://kerb-nn/user/flume/%Y%m%D-%H/","testkerb");'
----
This should write data entered at the console to a kerberized HDFS
with a namenode named kerb-nn, into a +/user/flume/YYmmDD-HH/+
directory.
If this fails, you many need to check to see if Flume's Hadoop
settings (in core-site.xml and hdfs-site.xml) are using Hadoop's
settings correctly.
=== Setting up Flume users on Kerberos
NOTE: These instructions are for MIT Kerb5.
There are several requirements to have a "properly setup" Kerberos +
HDFS + Flume.
* Need to have a prinicipal for the Flume user on each machine.
* Need to have a keytab that has keys for each principal on each machine.
Much of this setup can be done by using the +kadmin+ program, and
verified using the +kinit+, +kdestroy+, and +klist+ programs.
==== Administering Kerberos principals
First you need to have permissions to use the +kadmin+ program and the
ability to add to principals to the KDCs.
----
$ kadmin -p -w
----
If you entered this correctly, it will drop you do the kadmin prompt
----
kadmin:
----
Here you can add a Flume principal to the KDC
----
kadmin: addprinc flume
WARNING: no policy specified for flume@REALM.COM; defaulting to no policy
Enter password for principal "flume@REALM.COM":
Re-enter password for principal "flume@REALM.COM":
Principal "flume@REALM.COM" created.
kadmin:
----
You also need to add principals with hosts for each Flume node that
will directly write to HDFS. Since you will be exporting the key to a
keytab file, you can use the -randkey option to generate a random key.
----
kadmin: addprinc -randkey flume/host.com
WARNING: no policy specified for flume/host.com@REALM.COM; defaulting to no policy
Principal "flume/host.com@REALM.COM" created.
kadmin:
----
NOTE: Hadoop's Kerberos implementation requires a three part principal
name -- user/host@REALM.COM. As a user you usually only need the user
name, user@REALM.COM.
You can verify that the user has been added by using the +kinit+
program, and entering the password you selected. Next you can verify
that you have your Ticket Granting Ticket (TGT) loaded.
----
$ kinit flume/host.com
Password for flume/host.com@REALM.COM:
$ klist
Ticket cache: FILE:/tmp/krb5cc_1016
Default principal: flume/host.com@REALM
Valid starting Expires Service principal
09/02/10 18:59:38 09/03/10 18:59:38 krbtgt/REALM.COM@REALM.COM
Kerberos 4 ticket cache: /tmp/tkt1016
klist: You have no tickets cached
$
----
You can ignore the Kerberos 4 info. To "logout" you can use the
+kdestroy+ command, and then verify that credentials are gone by
running +klist+.
----
$ kdestroy
$ klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1016)
Kerberos 4 ticket cache: /tmp/tkt1016
klist: You have no tickets cached
$
----
Next to enable automatic logins, we can create a keytab file so that
does not require manually entering a password.
WARNING: This keytab file contains secret credentials that should be
protected so that only the proper user can read the file. After
created, it should be in 0400 mode (-r--------) and owned by the user
running the Flume process.
Then you can generate a keytab file (int this example called
+flume.keytab+) and add a user +flume/host.com+ to it.
----
kadmin: ktadd -k flume.keytab flume/host.com
----
NOTE: This will invalidate the ability for flume/host.com to manually
login of the account. You could however have a Flume user does not
use a keytab and that could log in.
WARNING: +ktadd+ can add keytab entries for mulitple principals into a
single file and allow for a single keytab file with many keys. This
however weakens the security stance and may make revoking credentials
from misbehaving machines difficult. Please consult with your
security administrator when assessing this risk.
You can verify the names and the version (KVNO) of the keys by running
the following command.
----
$ klist -Kk flume.keytab
Keytab name: FILE:flume.keytab
KVNO Principal
---- --------------------------------------------------------------------------
5 flume/host.com@REALM.COM (0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa)
5 flume/host.com@REALM.COM (0xbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb)
5 flume/host.com@REALM.COM (0xcccccccccccccccc)
5 flume/host.com@REALM.COM (0xdddddddddddddddd)
----
You should see a few entries and your corresponding keys in hex after
your principal names.
Finally, you can use +kinit+ with the +flume@REALM.COM+ principal to
interactively do a Kerberos login and use the Hadoop commands to browse HDFS.
----
$ kinit flume
Password for flume@REALM.COM: <-- enter password
$ hadoop dfs -ls /user/flume/
----
////
Windows instructions.
----
ktpass // windows generate keytab file
----
////