Tuesday, December 17, 2013

Wire Encryption in Hadoop

Wire Encryption in Hadoop

By Vinay Shukla
Encryption is applied to electronic information in order to ensure its privacy and confidentiality.  Typically, we think of protecting data as it rests or in motion.  Wire Encryption protects the latter as data moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC.
Let’s cover the configuration required to encrypt each of these protocols. To see the step-by-step instructions please see the HDP 2.0 documentation.

RPC Encryption

The most common way for a client to interact with a Hadoop cluster is through RPC.  A client  connects to a NameNode (NN) over RPC protocol to read or write a file. RPC connections in Hadoop use Java’s Simple Authentication & Security Layer (SASL) which supports encryption. When hadoop.rpc.protection property is set to privacy the data over RPC is encrypted with symmetric keys. Please refer to this post for more details onhadoop.rpc.protection setting.

Data Transfer Protocol

The NN gives the client the address of the first DataNode (DN) to read or write the block. The actual data transfer between the client and a DN is over Data Transfer Protocol. To encrypt data transfer you must setdfs.encryt.data.transfer=true on NN and all DNs. The actual algorithm used for encryption can be customized with dfs.encrypt.data.transfer.algorithm set to either 3des or rc4. If nothing is set, then the default on the system is used (usually 3DES.) While 3DES is more cryptographically secure, RC4 is substantially faster.

HTTPS Encryption

Encryption over the HTTP protocol is implemented with the support for SSL across a Hadoop cluster. For example, to enable NN UI to listen for HTTP over SSL you must configure SSL on the NN and all the DNs by setting dfs.https.enable=true in hdfs-site.xml. Typically SSL is configured to only authenticate the Server-this is called 1-way SSL. In addition, SSL can also be configured to authenticate the client-this is called mutual authentication or 2-way SSL. To configure 2-way SSL set dfs.client.https.need-auth=true in hdfs-site.xml on each NN and DN. For 1-way SSL only the keystore needs to be configured on the NN and DN. The keystore & the truststore configuration go in the ssl-server.xml and ssl-client.xml file on the NN and each DN. The truststore configuration is only needed when using a self-signed certificate or a certificate that is not in the JVM’s truststore.
The following configuration properties need to be specified in ssl-server.xml.
Property
Default Value
Description
ssl.server.keystore.type
JKS
The type of the keystore, JKS = Java Keystore, the de-facto standard in Java
ssl.server.keystore.location
None
The location of the keystore file
ssl.server.keystore.password
None
The password to open the keystore file
ssl.server.truststore.type
JKS
The type of the trust store
ssl.server.truststore.location
None
The location of the truststore file
ssl.server.truststore.password
None
The password to open the truststore

Encryption during Shuffle

Staring HDP 2.0 encryption during shuffle is supported.
The data moves between the Mappers and the Reducers over the HTTP protocol, this step is called shuffle. Reducer initiates the connection to the Mapper to ask for data and acts as SSL client. Enabling HTTPS for encrypting shuffle traffic involves the following steps.
  • Set mapreduce.shuffle.ssl.enabled to true in mapred-site.xml
  • Set keystore properties and optionally truststore (for 2-way SSL) properties mentioned in the above table.
Here is an example configuration from mapred-site.xml
[xml]
hadoop.ssl.enabled
true
hadoop.ssl.require.client.cert
false
true
hadoop.ssl.hostname.verifier
DEFAULT
true
hadoop.ssl.keystores.factory.class
org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory
true
hadoop.ssl.server.conf
ssl-server.xml
true
hadoop.ssl.client.conf
ssl-client.xml
true
[/xml]
The above configuration refers to a ssl-server.xml and ssl-client.xml. These files will contain properties as specified in the table above. Make sure to put ssl-server.xml and ssl-client.xml in the default${HADOOP_CONF_DIR}.

JDBC

HiveServer2 implements encryption with Java SASL protocol’s quality of protection (QOP) setting. With this the data moving between a HiveServer2 over jdbc and a jdbc client can be encrypted. On the HiveServer2, set hive.server2.thrift.sasl.qop in hive-site.xml, and on the JDBC client specify sasl.sop as part of jdbc hive connection string. eg jdbc:hive://hostname/dbname;sasl.qop=auth-int. HIVE-4911 provides more details on this enhancement.

Closing Thoughts

Ensuring confidentiality of the data flowing in an out of a Hadoop cluster requiring configuring encryption on each channel that is being used to move the data. The blog describes encryption configuration required for encryption for various channels.

Tuesday, December 10, 2013

Announcing the Technical Preview of Apache Knox Gateway

Announcing the Technical Preview of Apache Knox Gateway

By Vinay Shukla
December 10th, 2013
Just yesterday, we talked about our roadmap for Security in Enterprise Hadoop. At our Security labs page you can see in one place the security roadmap and efforts underway across Hadoop and their timelines.
Security is often described as rings of defense. Continuing this analogy the Apache community has been working to create a perimeter security solution for Hadoop. This effort is Apache Knox Gateway (Apache Knox) and we are happy to announce the Technical Preview of Apache Knox. With this technical preview Apache Knox is one step closer to general availability.
Apache Knox is the Web/REST API Gateway solution for Hadoop. It provides a single access point to access all of Hadoop resources over REST. It also enables the integration of enterprise identity management solutions and provides numerous perimeter security features for REST/HTTP access to Hadoop.

Monday, December 9, 2013

Hadoop Security : Today and Tomorrow

Hadoop Security : Today and Tomorrow


By Vinay Shukla
Security is a top agenda item and represents critical requirements for Hadoop projects. Over the years, Hadoop has evolved to address key concerns regarding authentication, authorization, accounting, and data protection natively within a cluster and there are many secure Hadoop clusters in production. Hadoop is being used securely and successfully today in sensitive financial services applications, private healthcare initiatives and in a range of other security-sensitive environments. As enterprise adoption of Hadoop grows, so do the security concerns and a roadmap to embrace and incorporate these enterprise security features has emerged.

An Open Roadmap for Security

We recently published an open roadmap for security in Hadoop. This effort outlines a community based approach to delivering on the key requirements that are being asked for by some of the largest and most security stringent organizations on the planet. Our goal is to incite and work within the community to deliver flexible, comprehensive, and integrated security controls for Hadoop. You can check out this roadmap in the labs section of our website, but let’s see what we can do for security today.

Securing a Hadoop cluster today

The distributed nature of Hadoop, a key for its success, poses unique challenges in securing it. Securing a client-server system is often easier because security controls can be placed at the service, the single point of access. Securing a system requires a redundant and layered approach. With Hadoop this requirement is exaggerated by the complexity of a distribution. Many of the layers are in place today to help you secure a cluster. Lets review some of the tools available across the four pillars.
  • Authentication verifies the identity of a system or user accessing the system. Hadoop provides two modes of authentication. The first, Simple or Pseudo authentication, essentially places trust in user’s assertion about who they are. The second, Kerberos, provides a fully secure Hadoop cluster. In line with best practice, Hadoop provides these capabilities while relying on widely accepted corporate user-stores (such as LDAP or Active Directory) so that a single source can be used for credential catalog across Hadoop and existing systems.
  • Authorization specifies access privileges for a user or system. Hadoop provides fine-grained authorization via file permissions in HDFS and resource level access control (via ACL) for MapReduce and coarser grained access control at a service level. For data, HBase provides authorization with ACL on tables and column families and Accumulo extends this even further to cell level control. Also, Apache Hive provides coarse grained access control on tables.
  • Accounting provides the ability to track resource use within a system. Within Hadoop, insight into usage and data access is critical for compliance or forensics. As part of core Apache Hadoop, HDFS and MapReduce provide base audit support. Additionally, the Apache Hive metastore records audit (who/when) information for Hive interactions. Finally, Apache Oozie, the workflow engine, provides audit trail for services.
  • Data Protection ensures privacy and confidentiality of information. Hadoop and HDP allow you to protect data in motion. HDP provides encryption capability for various channels such as Remote Procedure Call (RPC), HTTP, JDBC/ODBC, and Data Transfer Protocol (DTP) to protect data in motion.  Finally, HDFS and Hadoop supports operating system level encryption.

Securing a Hadoop cluster tomorrow

There is a lot of innovation around security in Hadoop today. There is a lot of focus on making all these security frameworks work together and to make them simple to manage. Here are some of the ways Hortonworks is leading Hadoop security enhancements and is bringing enterprise security to Hadoop.
  • Perimeter level Security With Apache Knox
  • Apache Hadoop has Kerberos for authentication. However, some organizations require integration with their enterprise identity management and Single Sign-On (SSO) solutions. Hortonworks created Apache Knox Gateway (Apache Knox) to provide Hadoop cluster security at the perimeter for REST/HTTP requests and to enable the integration of enterprise identity-management solutions. Apache Knox provides integration with corporate identity systems such as LDAP, Active Directory (AD) and will also integrate with SAML based SSO and other SSO systems.
  • Apache Knox also protects a Hadoop cluster by hiding its network topology to eliminate the leak of network internals. A network firewall may be configured to deny all direct access to a Hadoop cluster and accept only the connections coming from the Apache Knox Gateway over HTTP. These measures dramatically reduce the attack vector.
  • Finally, Apache Knox promotes the use of REST/HTTP for Hadoop access. REST is proven, scalable, and provides client interoperability across languages, operating systems, and computing devices. By using Hadoop REST/HTTP APIs through Knox, clients do not need a local Hadoop installation.
  • Improved Authentication
  • Hortonworks is working with the community on a few ongoing projects to extend authentication to beyond Kerberos and provide token-based authentication. The JIRAs tracking this work are HADOOP-9392,HADOOP-9479, HADOOP-9533 and HADOOP-9804.
  • Granular Authorization
  • Authorization mechanisms in Hadoop components are getting enhanced across the board. Most Hive users want a familiar database-style authorization model. Hortonworks is working to enhance the Hive authorization to bring SQL GRANT and REVOKE semantic in Hive and make it fully secure. This enhancement will also bring row and column level security to Hive. The JIRA for this work is  HIVE-5837and the enhancement is expected in Q1, 2014. HDFS-4685 will bring ACL support in HDFS.
  • Accounting & Audit
  • There are enhancements planned to provide better reporting with audit log correlation and audit viewer capability in Hadoop. With audit log correlation an auditor will be able to answer what sequence of operations John Doe did across Hadoop components without requiring external tools. In future, with Ambari, an auditor will be able to see John Doe’s actions (such as read HDFS files and submit some MapReduce jobs) as one logical operation. In addition, Apache Knox plans to provide billing capability to record REST API usage for Hadoop.
  • Protecting data with Encryption
  • For encryption of data-in-motion the Hadoop ecosystem will continue to provide encryption of channels not already covered and to offer more effective encryption algorithm for all channels. In Q1, 2014 HDP will provide SSL support for Hive Server2. Further enhancements are in the works to provide encryption in Hive, HDFS and HBase.

Conclusion

Hadoop is a secure system and offers key features for securely processing enterprise data. But the security work never ends. Hortonworks is working on several projects to enhance Hadoop security from the inside, shore up defenses from the outside with Apache Knox and to keep up with evolving requirements by providing more flexible authentication and authorization and by improving data protection. We are also working to improve integration with enterprise Identity Management and security systems.
We encourage you to review this roadmap for Hadoop security and to get involved. Also, over the next few months we will publish more best practices, covering each pillar of security in more detail. Stay tuned for next post about wire encryption.