User
security contains three parts: Authentication, Authorization and Audit Authentication
simply means verifying who the user claims to be. There are three factors of authentication:
Who you are, What you know and What you have You
may have heard the term two-factor
authentication everywhere. The more
factors you use, the more secure authentication is. More factors also mean more
inconvenience; otherwise, three-factor authentication is always used.
For transparent encryption, we introduce a new abstraction to HDFS: the encryption zone. An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a single encryption zone key which is specified when the zone is created. Each file within an encryption zone has its own unique data encryption key (DEK). DEKs are never handled directly by HDFS. Instead, HDFS only ever handles an encrypted data encryption key (EDEK). Clients decrypt an EDEK, and then use the subsequent DEK to read and write data. HDFS datanodes simply see a stream of encrypted bytes.
A new cluster service is required to manage encryption keys: the Hadoop Key Management Server (KMS). Hadoop KMS is a cryptographic key management server based on Hadoop’s KeyProvider API.
Let's
understand it with a few examples. Let's say you go to an ATM to withdraw
money. How many factors are used? You pull out your ATM card (what you have),
insert it, and enter your pin (what you know). This is twofactor authentication.
How about online banking? You enter your username/password ( what you know) and
you are logged in. So only one factor. This is the reason why for commercial
banking, banks give you a mobile token (what you have) using which you get a
unique code each time, called one time password (OTP). These
days, banks also send a code to your mobile device via text to enable the
second factor of authentication. Hadoop, Spark uses MIT Kerberos for
authentication.
Let's
understand Kerberos with an example.
Let's
say you (client) want to go to a multiplex and watch a movie. It is a special
type of multiplex where you may be allowed to watch one movie, two movies, n movies,
or all the movies depending upon your special ticket. You get this special
ticket from a counter outside called authentication service (AS).
Since this special ticket gives you the power to get an actual ticket, let's call
it ticket-granting ticket (TGT). To get a regular ticket to watch a
movie, you need to show TGT to a special counter called ticket-granting
server (TGS), and TGS will issue you a ticket (or service ticket). You can
present the service ticket to a special movie theater it is valid in and you
will be allowed in. The combination of AS and TGS is called a key
distribution center (KDC). The beauty of TGT is that you do not have
to go outside the multiplex, stand in the line, and show your credit card every
time. Kerberos also needs an authentication realm that is simply the domain
name fully capitalized. So for voidio.com, it will be VOIDIO.COM. Each server
in the Kerberos authentication realm should have an FQDN, for example, dn5.voidio.com,
and it should be forward (FQDN resolving to IP address) and reverse (IP address
resolving to FQDN) resolvable.
Setting up Kerberos to do authentication
The first step in setting up Kerberos is setting the key
distribution center (KDC) as below:
1. Making sure /etc/hosts directory reflects the correct fully
qualified domain name (FQDN) (for example, sandbox.voidio.com):
$ cat /etc/hosts
...
127.0.0.1 localhost voidio.com
...
2. Install the admin server:
$ sudo apt-get
install krb5-kdc krb5-admin-server
3. During the installation, it would ask for:
Default Kerberos
version 5 realm: VOIDIO.COM
Kerberos servers for
your realm: VOIDIO.com
Administrative
server for your Kerberos realm: voidio.com
4. Check/validate krb5.conf:
$ cat /etc/krb5.conf
5. Configure the Kerberos server. Before you begin, a new realm
must be created; this step normally takes a long time. You can use a hack like
a random number generator to expedite the process:
$ sudo apt-get
install rng-tools -y
$ sudo rngd -r
/dev/urandom -o /dev/random #not for production though
6. Create a new realm:
$ sudo krb5_newrealm
#enter a master key
password and keep it safe or remember
7.
The Kerberos realm is administered using the kadmin utility.
Running
admin.local as the root user on KDC allows the administrator to authenticate
without having an existing principal. To add a new principal:
$ sudo kadmin.local
8.
Add the infouser principal:
kadmin.local:
addprinc infouser
Enter password for
principal "infouser@VOIDIO.COM":
Re-enter password
for principal "infouser@VOIDIO.COM":
Principal
"infouser@VOIDIO.COM" created.
kadmin.local: quit
9.
To test the newly created principal, use kinit command. A successful
Kerberos
setup will return no error if you use the following command:
$ kinit infouser@VOIDIO.COM
Password for
infouser@VOIDIO.COM:
Secure Mode:
When Hadoop is configured
to run in secure mode, each Hadoop
service and each user must be authenticated by Kerberos. Forward and reverse
host lookup for all service hosts must be configured correctly to allow
services to authenticate with each other. Host lookups may be configured using
either DNS or /etc/hosts files.
When service level
authentication is turned on, end users must authenticate themselves before
interacting with Hadoop services. The simplest way is for a user to
authenticate interactively using the Kerberos kinit command. Programmatic authentication using Kerberos keytab files
(stores long-term keys for one or more principals) may be used when interactive
login with kinit is infeasible.
Ensure that HDFS and YARN daemons run
as different Unix users, e.g. hdfs and yarn. Also, ensure that
the MapReduce JobHistory server runs as different user such as mapred.
User:Group
|
Daemons
|
hdfs:hadoop
|
NameNode, Secondary
NameNode, JournalNode, DataNode
|
yarn:hadoop
|
ResourceManager,
NodeManager
|
mapred:hadoop
|
MapReduce JobHistory
Server
|
Each Hadoop Service
instance must be configured with its Kerberos principal and keytab file
location.
The NameNode keytab file, on each
NameNode host, should look like the following:
$ klist -e -k -t
/etc/security/keytab/nn.service.keytab
Keytab name: FILE:/etc/security/keytab/nn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD
(AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09
nn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1
HMAC)
4 07/18/11 21:08:09
nn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09
host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1
HMAC)
4 07/18/11 21:08:09
host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1
HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD
(ArcFour with HMAC/md5)
The Secondary NameNode keytab file
klist -e -k -t
/etc/security/keytab/sn.service.keytab
The DataNode keytab file, on each host
$ klist -e -k -t
/etc/security/keytab/dn.service.keytab
The ResourceManager keytab file, on
the ResourceManager host
$ klist -e -k -t
/etc/security/keytab/rm.service.keytab
The NodeManager keytab file, on each
host
$ klist -e -k -t
/etc/security/keytab/nm.service.keytab
The MapReduce JobHistory Server keytab
file, on that host
$ klist -e -k -t
/etc/security/keytab/jhs.service.keytab
KDiag:
Diagnose Kerberos Problems
All HDFS communication protocols are
layered on top of the TCP/IP protocol.
Client <-> NameNode
The connection between NameNode and
the client is governed by the Client Protocol documented in
…\hdfs\protocol\ClientProtocol.java. A
client establishes a connection to a configurable TCP port on the NameNode
machine. This is an RPC connection.
By design, the NameNode never
initiates any RPCs. Instead, it only responds to RPC requests issued by
DataNodes or clients. The data transfered between hadoop services and clients
can be encrypted on the wire.
Setting hadoop.rpc.protection to privacy in core-site.xml activates
data encryption.
NameNode <-> DataNode
All communication between Namenode and
Datanode is initiated by the Datanode, and responded to by the Namenode. The
Namenode never initiates communication to the Datanode, although Namenode
responses may include commands to the Datanode that cause it to send further
communications. DataNode sends information to NameNode
through four major interfaces defined in the DataNodeProtocol. These four are
1) DataNode Registration. The DataNode informs NameNode of its
existence. NameNode returns its registration id. This registration id is a
parameter of other DataNode functions.
Registration is triggered when a new
DataNode is initiated, an old one is re-initiated, or when a new NameNode is
initiated.
2) DataNode sends heartbeat. The DataNode sends a heartbeat message
every few seconds. This includes some information statistics about capacity and
current activity. NameNode returns a list of block oriented commands for
DataNode to execute. These commands primarily consist of instructions to transfer
blocks to other DataNodes for replication purposes, or instructions to delete
blocks. The NameNode can also command an
immediate Block Report from the DataNode, but this is only done to recover from
severe problems.
3) DataNode sends block report. DataNode periodically reports the
blocks contained in its storage. The
period is typically configured to hourly.
4) DataNode notifies BlockReceived. DataNode reports that it has
received a new block, either from a Client (during file write) or from another
DataNode (during replication). It
reports each block immediately upon receipt.
Client <-> DataNode
A client communicates with a DataNode
directly to transfer (send/receive) data using the DataTransferProtocol,
defined in DataTransferProtocol.java.
For performance purposes this protocol is a streaming protocol, not RPC.
The client buffers data until a full block (the default is 64 Mbytes) has been
created and then the block is streamed to the DataNode.
The DataTransferProtocol defines operations to read a block
(opReadBlock()), write a block (opWriteBlock()), replace a block
(opReplaceBlock()), copy a block (opCopyBlock()), and to get a block’s Checksum
(opBlockChecksum()).
Because the DataNode data transfer
protocol does not use the Hadoop RPC framework, DataNodes must authenticate
themselves using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address.
This authentication is based on the assumption that the attacker won’t be able
to get root privileges on DataNode hosts. You need to
set dfs.encrypt.data.transfer to true in the hdfs-site.xml
in order to activate data encryption for data transfer protocol of DataNode
(Block data transfer). The data transfered between hadoop
services and clients (RPC) can be encrypted on the wire.
Setting hadoop.rpc.protection to privacy in core-site.xml activates
data encryption. Data transfer between Web-console and
clients are protected by using SSL(HTTPS). SSL configuration is recommended but
not required to configure Hadoop security with Kerberos.
By default Hadoop HTTP web-consoles
(ResourceManager, NameNode, NodeManagers and DataNodes) allow access without
any form of authentication. Hadoop HTTP web-consoles can be
configured to require Kerberos authentication using HTTP SPNEGO protocol
(supported by browsers like Firefox and Internet Explorer).
Transparent
Encryption in HDFS
HDFS implements transparent,
end-to-end encryption. Once configured, data read from and written to special
HDFS directories is transparently encrypted and decrypted without requiring
changes to user application code. This encryption is also end-to-end, which
means the data can only be encrypted and decrypted by the client. HDFS never
stores or has access to unencrypted data or unencrypted data encryption keys.
This satisfies two typical requirements for encryption: at-rest encryption
(meaning data on persistent media, such as a disk) as well as in-transit
encryption (e.g. when data is travelling over the network).
Data encryption is required by a
number of different government, financial, and regulatory entities. For
example, the health-care industry has HIPAA
regulations, the card payment industry has PCI
DSS regulations, and the US government has FISMA regulations. Having
transparent encryption built into HDFS makes it easier for organizations to
comply with these regulations.
For transparent encryption, we introduce a new abstraction to HDFS: the encryption zone. An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a single encryption zone key which is specified when the zone is created. Each file within an encryption zone has its own unique data encryption key (DEK). DEKs are never handled directly by HDFS. Instead, HDFS only ever handles an encrypted data encryption key (EDEK). Clients decrypt an EDEK, and then use the subsequent DEK to read and write data. HDFS datanodes simply see a stream of encrypted bytes.
A new cluster service is required to manage encryption keys: the Hadoop Key Management Server (KMS). Hadoop KMS is a cryptographic key management server based on Hadoop’s KeyProvider API.
It provides a client and a server
components which communicate over HTTP using a REST API.
The client is a KeyProvider
implementation interacts with the KMS using the KMS HTTP REST API.
Configure the KMS backing KeyProvider
properties in the etc/hadoop/kms-site.xml configuration file
When creating a new file in an
encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted
with the encryption zone’s key. The EDEK is then stored persistently as part of
the file’s metadata on the NameNode.
When reading a file within an encryption
zone, the NameNode provides the client with the file’s EDEK and the encryption
zone key version used to encrypt the EDEK. The client then asks the KMS to
decrypt the EDEK, which involves checking that the client has permission to
access the encryption zone key version. Assuming that is successful, the client
uses the DEK to decrypt the file’s contents.
All of the above steps for the read
and write path happen automatically through interactions between the DFSClient,
the NameNode, and the KMS.
Post a Comment
Post a Comment
Thanks for your comment !
I will review your this and will respond you as soon as possible.