Kerberos and HDFS Encryption

User security contains three parts: Authentication, Authorization and Audit Authentication simply means verifying who the user claims to be. There are three factors of authentication: Who you are, What you know and What you have You may have heard the term two-factor authentication everywhere. The more factors you use, the more secure authentication is. More factors also mean more inconvenience; otherwise, three-factor authentication is always used.
Let's understand it with a few examples. Let's say you go to an ATM to withdraw money. How many factors are used? You pull out your ATM card (what you have), insert it, and enter your pin (what you know). This is twofactor authentication. How about online banking? You enter your username/password ( what you know) and you are logged in. So only one factor. This is the reason why for commercial banking, banks give you a mobile token (what you have) using which you get a unique code each time, called one time password (OTP). These days, banks also send a code to your mobile device via text to enable the second factor of authentication. Hadoop, Spark uses MIT Kerberos for authentication.

How Kerberos Works?
Let's understand Kerberos with an example.
Let's say you (client) want to go to a multiplex and watch a movie. It is a special type of multiplex where you may be allowed to watch one movie, two movies, n movies, or all the movies depending upon your special ticket. You get this special ticket from a counter outside called authentication service (AS). Since this special ticket gives you the power to get an actual ticket, let's call it ticket-granting ticket (TGT). To get a regular ticket to watch a movie, you need to show TGT to a special counter called ticket-granting server (TGS), and TGS will issue you a ticket (or service ticket). You can present the service ticket to a special movie theater it is valid in and you will be allowed in. The combination of AS and TGS is called a key distribution center (KDC). The beauty of TGT is that you do not have to go outside the multiplex, stand in the line, and show your credit card every time. Kerberos also needs an authentication realm that is simply the domain name fully capitalized. So for voidio.com, it will be VOIDIO.COM. Each server in the Kerberos authentication realm should have an FQDN, for example, dn5.voidio.com, and it should be forward (FQDN resolving to IP address) and reverse (IP address resolving to FQDN) resolvable.

Setting up Kerberos to do authentication

The first step in setting up Kerberos is setting the key distribution center (KDC) as below:

1. Making sure /etc/hosts directory reflects the correct fully qualified domain name (FQDN) (for example, sandbox.voidio.com):

$ cat /etc/hosts
...
127.0.0.1 localhost voidio.com
...

2. Install the admin server:
$ sudo apt-get install krb5-kdc krb5-admin-server

3. During the installation, it would ask for:

Default Kerberos version 5 realm: VOIDIO.COM
Kerberos servers for your realm: VOIDIO.com
Administrative server for your Kerberos realm: voidio.com

4. Check/validate krb5.conf:
$ cat /etc/krb5.conf

5. Configure the Kerberos server. Before you begin, a new realm must be created; this step normally takes a long time. You can use a hack like a random number generator to expedite the process:

$ sudo apt-get install rng-tools -y
$ sudo rngd -r /dev/urandom -o /dev/random #not for production though

6. Create a new realm:
$ sudo krb5_newrealm
#enter a master key password and keep it safe or remember

7. The Kerberos realm is administered using the kadmin utility.
Running admin.local as the root user on KDC allows the administrator to authenticate without having an existing principal. To add a new principal:

$ sudo kadmin.local

8. Add the infouser principal:

kadmin.local: addprinc infouser
Enter password for principal "infouser@VOIDIO.COM":
Re-enter password for principal "infouser@VOIDIO.COM":
Principal "infouser@VOIDIO.COM" created.
kadmin.local: quit

9. To test the newly created principal, use kinit command. A successful
Kerberos setup will return no error if you use the following command:

$ kinit infouser@VOIDIO.COM
Password for infouser@VOIDIO.COM:

Secure Mode:
When Hadoop is configured to run in secure mode, each Hadoop service and each user must be authenticated by Kerberos. Forward and reverse host lookup for all service hosts must be configured correctly to allow services to authenticate with each other. Host lookups may be configured using either DNS or /etc/hosts files. 
When service level authentication is turned on, end users must authenticate themselves before interacting with Hadoop services. The simplest way is for a user to authenticate interactively using the Kerberos kinit command. Programmatic authentication using Kerberos keytab files (stores long-term keys for one or more principals) may be used when interactive login with kinit is infeasible.

Ensure that HDFS and YARN daemons run as different Unix users, e.g. hdfs and yarn. Also, ensure that the MapReduce JobHistory server runs as different user such as mapred.

User:Group
Daemons
hdfs:hadoop
NameNode, Secondary NameNode, JournalNode, DataNode
yarn:hadoop
ResourceManager, NodeManager
mapred:hadoop
MapReduce JobHistory Server

Each Hadoop Service instance must be configured with its Kerberos principal and keytab file location.
The NameNode keytab file, on each NameNode host, should look like the following:

$ klist -e -k -t /etc/security/keytab/nn.service.keytab

KVNO Timestamp         Principal
   4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
   4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
   4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
   4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
   4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
   4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)

The Secondary NameNode keytab file
klist -e -k -t /etc/security/keytab/sn.service.keytab

The DataNode keytab file, on each host
$ klist -e -k -t /etc/security/keytab/dn.service.keytab

The ResourceManager keytab file, on the ResourceManager host
$ klist -e -k -t /etc/security/keytab/rm.service.keytab

The NodeManager keytab file, on each host
$ klist -e -k -t /etc/security/keytab/nm.service.keytab

The MapReduce JobHistory Server keytab file, on that host
$ klist -e -k -t /etc/security/keytab/jhs.service.keytab

KDiag: Diagnose Kerberos Problems
All HDFS communication protocols are layered on top of the TCP/IP protocol.

Client <-> NameNode

The connection between NameNode and the client is governed by the Client Protocol documented in …\hdfs\protocol\ClientProtocol.java.  A client establishes a connection to a configurable TCP port on the NameNode machine. This is an RPC connection.
By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients. The data transfered between hadoop services and clients can be encrypted on the wire. Setting hadoop.rpc.protection to privacy in core-site.xml activates data encryption.

NameNode <-> DataNode

All communication between Namenode and Datanode is initiated by the Datanode, and responded to by the Namenode. The Namenode never initiates communication to the Datanode, although Namenode responses may include commands to the Datanode that cause it to send further communications. DataNode sends information to NameNode through four major interfaces defined in the DataNodeProtocol. These four are

1) DataNode Registration. The DataNode informs NameNode of its existence. NameNode returns its registration id. This registration id is a parameter of other DataNode functions.
Registration is triggered when a new DataNode is initiated, an old one is re-initiated, or when a new NameNode is initiated.

2) DataNode sends heartbeat. The DataNode sends a heartbeat message every few seconds. This includes some information statistics about capacity and current activity. NameNode returns a list of block oriented commands for DataNode to execute. These commands primarily consist of instructions to transfer blocks to other DataNodes for replication purposes, or instructions to delete blocks.  The NameNode can also command an immediate Block Report from the DataNode, but this is only done to recover from severe problems.

3) DataNode sends block report. DataNode periodically reports the blocks contained in its storage.  The period is typically configured to hourly.

4) DataNode notifies BlockReceived. DataNode reports that it has received a new block, either from a Client (during file write) or from another DataNode (during replication).  It reports each block immediately upon receipt.

Client <-> DataNode

A client communicates with a DataNode directly to transfer (send/receive) data using the DataTransferProtocol, defined in DataTransferProtocol.java.  For performance purposes this protocol is a streaming protocol, not RPC. The client buffers data until a full block (the default is 64 Mbytes) has been created and then the block is streamed to the DataNode.

The DataTransferProtocol defines operations to read a block (opReadBlock()), write a block (opWriteBlock()), replace a block (opReplaceBlock()), copy a block (opCopyBlock()), and to get a block’s Checksum (opBlockChecksum()).
Because the DataNode data transfer protocol does not use the Hadoop RPC framework, DataNodes must authenticate themselves using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address. This authentication is based on the assumption that the attacker won’t be able to get root privileges on DataNode hosts. You need to set dfs.encrypt.data.transfer to true in the hdfs-site.xml in order to activate data encryption for data transfer protocol of DataNode (Block data transfer). The data transfered between hadoop services and clients (RPC) can be encrypted on the wire. Setting hadoop.rpc.protection to privacy in core-site.xml activates data encryption. Data transfer between Web-console and clients are protected by using SSL(HTTPS). SSL configuration is recommended but not required to configure Hadoop security with Kerberos.

By default Hadoop HTTP web-consoles (ResourceManager, NameNode, NodeManagers and DataNodes) allow access without any form of authentication. Hadoop HTTP web-consoles can be configured to require Kerberos authentication using HTTP SPNEGO protocol (supported by browsers like Firefox and Internet Explorer).

Transparent Encryption in HDFS

HDFS implements transparent, end-to-end encryption. Once configured, data read from and written to special HDFS directories is transparently encrypted and decrypted without requiring changes to user application code. This encryption is also end-to-end, which means the data can only be encrypted and decrypted by the client. HDFS never stores or has access to unencrypted data or unencrypted data encryption keys. This satisfies two typical requirements for encryption: at-rest encryption (meaning data on persistent media, such as a disk) as well as in-transit encryption (e.g. when data is travelling over the network).
Data encryption is required by a number of different government, financial, and regulatory entities. For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.

For transparent encryption, we introduce a new abstraction to HDFS: the encryption zone. An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a single encryption zone key which is specified when the zone is created. Each file within an encryption zone has its own unique data encryption key (DEK). DEKs are never handled directly by HDFS. Instead, HDFS only ever handles an encrypted data encryption key (EDEK). Clients decrypt an EDEK, and then use the subsequent DEK to read and write data. HDFS datanodes simply see a stream of encrypted bytes.

A new cluster service is required to manage encryption keys: the Hadoop Key Management Server (KMS). Hadoop KMS is a cryptographic key management server based on Hadoop’s KeyProvider API.
It provides a client and a server components which communicate over HTTP using a REST API.
The client is a KeyProvider implementation interacts with the KMS using the KMS HTTP REST API.
Configure the KMS backing KeyProvider properties in the etc/hadoop/kms-site.xml configuration file
When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone’s key. The EDEK is then stored persistently as part of the file’s metadata on the NameNode.
When reading a file within an encryption zone, the NameNode provides the client with the file’s EDEK and the encryption zone key version used to encrypt the EDEK. The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. Assuming that is successful, the client uses the DEK to decrypt the file’s contents.
All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.

Post a Comment

Thanks for your comment !
I will review your this and will respond you as soon as possible.