Tuesday, December 17, 2013

Wire Encryption in Hadoop

Wire Encryption in Hadoop

By Vinay Shukla
Encryption is applied to electronic information in order to ensure its privacy and confidentiality.  Typically, we think of protecting data as it rests or in motion.  Wire Encryption protects the latter as data moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC.
Let’s cover the configuration required to encrypt each of these protocols. To see the step-by-step instructions please see the HDP 2.0 documentation.

RPC Encryption

The most common way for a client to interact with a Hadoop cluster is through RPC.  A client  connects to a NameNode (NN) over RPC protocol to read or write a file. RPC connections in Hadoop use Java’s Simple Authentication & Security Layer (SASL) which supports encryption. When hadoop.rpc.protection property is set to privacy the data over RPC is encrypted with symmetric keys. Please refer to this post for more details onhadoop.rpc.protection setting.

Data Transfer Protocol

The NN gives the client the address of the first DataNode (DN) to read or write the block. The actual data transfer between the client and a DN is over Data Transfer Protocol. To encrypt data transfer you must setdfs.encryt.data.transfer=true on NN and all DNs. The actual algorithm used for encryption can be customized with dfs.encrypt.data.transfer.algorithm set to either 3des or rc4. If nothing is set, then the default on the system is used (usually 3DES.) While 3DES is more cryptographically secure, RC4 is substantially faster.

HTTPS Encryption

Encryption over the HTTP protocol is implemented with the support for SSL across a Hadoop cluster. For example, to enable NN UI to listen for HTTP over SSL you must configure SSL on the NN and all the DNs by setting dfs.https.enable=true in hdfs-site.xml. Typically SSL is configured to only authenticate the Server-this is called 1-way SSL. In addition, SSL can also be configured to authenticate the client-this is called mutual authentication or 2-way SSL. To configure 2-way SSL set dfs.client.https.need-auth=true in hdfs-site.xml on each NN and DN. For 1-way SSL only the keystore needs to be configured on the NN and DN. The keystore & the truststore configuration go in the ssl-server.xml and ssl-client.xml file on the NN and each DN. The truststore configuration is only needed when using a self-signed certificate or a certificate that is not in the JVM’s truststore.
The following configuration properties need to be specified in ssl-server.xml.
Default Value
The type of the keystore, JKS = Java Keystore, the de-facto standard in Java
The location of the keystore file
The password to open the keystore file
The type of the trust store
The location of the truststore file
The password to open the truststore

Encryption during Shuffle

Staring HDP 2.0 encryption during shuffle is supported.
The data moves between the Mappers and the Reducers over the HTTP protocol, this step is called shuffle. Reducer initiates the connection to the Mapper to ask for data and acts as SSL client. Enabling HTTPS for encrypting shuffle traffic involves the following steps.
  • Set mapreduce.shuffle.ssl.enabled to true in mapred-site.xml
  • Set keystore properties and optionally truststore (for 2-way SSL) properties mentioned in the above table.
Here is an example configuration from mapred-site.xml
The above configuration refers to a ssl-server.xml and ssl-client.xml. These files will contain properties as specified in the table above. Make sure to put ssl-server.xml and ssl-client.xml in the default${HADOOP_CONF_DIR}.


HiveServer2 implements encryption with Java SASL protocol’s quality of protection (QOP) setting. With this the data moving between a HiveServer2 over jdbc and a jdbc client can be encrypted. On the HiveServer2, set hive.server2.thrift.sasl.qop in hive-site.xml, and on the JDBC client specify sasl.sop as part of jdbc hive connection string. eg jdbc:hive://hostname/dbname;sasl.qop=auth-int. HIVE-4911 provides more details on this enhancement.

Closing Thoughts

Ensuring confidentiality of the data flowing in an out of a Hadoop cluster requiring configuring encryption on each channel that is being used to move the data. The blog describes encryption configuration required for encryption for various channels.

No comments: