Hadoop Security : Today and Tomorrow
By Vinay Shukla
Security is a top agenda item and represents critical requirements for Hadoop projects. Over the years, Hadoop has evolved to address key concerns regarding authentication, authorization, accounting, and data protection natively within a cluster and there are many secure Hadoop clusters in production. Hadoop is being used securely and successfully today in sensitive financial services applications, private healthcare initiatives and in a range of other security-sensitive environments. As enterprise adoption of Hadoop grows, so do the security concerns and a roadmap to embrace and incorporate these enterprise security features has emerged.
An Open Roadmap for Security
We recently published an open roadmap for security in Hadoop. This effort outlines a community based approach to delivering on the key requirements that are being asked for by some of the largest and most security stringent organizations on the planet. Our goal is to incite and work within the community to deliver flexible, comprehensive, and integrated security controls for Hadoop. You can check out this roadmap in the labs section of our website, but let’s see what we can do for security today.
Securing a Hadoop cluster today
The distributed nature of Hadoop, a key for its success, poses unique challenges in securing it. Securing a client-server system is often easier because security controls can be placed at the service, the single point of access. Securing a system requires a redundant and layered approach. With Hadoop this requirement is exaggerated by the complexity of a distribution. Many of the layers are in place today to help you secure a cluster. Lets review some of the tools available across the four pillars.
- Authentication verifies the identity of a system or user accessing the system. Hadoop provides two modes of authentication. The first, Simple or Pseudo authentication, essentially places trust in user’s assertion about who they are. The second, Kerberos, provides a fully secure Hadoop cluster. In line with best practice, Hadoop provides these capabilities while relying on widely accepted corporate user-stores (such as LDAP or Active Directory) so that a single source can be used for credential catalog across Hadoop and existing systems.
- Authorization specifies access privileges for a user or system. Hadoop provides fine-grained authorization via file permissions in HDFS and resource level access control (via ACL) for MapReduce and coarser grained access control at a service level. For data, HBase provides authorization with ACL on tables and column families and Accumulo extends this even further to cell level control. Also, Apache Hive provides coarse grained access control on tables.
- Accounting provides the ability to track resource use within a system. Within Hadoop, insight into usage and data access is critical for compliance or forensics. As part of core Apache Hadoop, HDFS and MapReduce provide base audit support. Additionally, the Apache Hive metastore records audit (who/when) information for Hive interactions. Finally, Apache Oozie, the workflow engine, provides audit trail for services.
- Data Protection ensures privacy and confidentiality of information. Hadoop and HDP allow you to protect data in motion. HDP provides encryption capability for various channels such as Remote Procedure Call (RPC), HTTP, JDBC/ODBC, and Data Transfer Protocol (DTP) to protect data in motion. Finally, HDFS and Hadoop supports operating system level encryption.
Securing a Hadoop cluster tomorrow
There is a lot of innovation around security in Hadoop today. There is a lot of focus on making all these security frameworks work together and to make them simple to manage. Here are some of the ways Hortonworks is leading Hadoop security enhancements and is bringing enterprise security to Hadoop.
- Perimeter level Security With Apache Knox
- Apache Hadoop has Kerberos for authentication. However, some organizations require integration with their enterprise identity management and Single Sign-On (SSO) solutions. Hortonworks created Apache Knox Gateway (Apache Knox) to provide Hadoop cluster security at the perimeter for REST/HTTP requests and to enable the integration of enterprise identity-management solutions. Apache Knox provides integration with corporate identity systems such as LDAP, Active Directory (AD) and will also integrate with SAML based SSO and other SSO systems.
- Apache Knox also protects a Hadoop cluster by hiding its network topology to eliminate the leak of network internals. A network firewall may be configured to deny all direct access to a Hadoop cluster and accept only the connections coming from the Apache Knox Gateway over HTTP. These measures dramatically reduce the attack vector.
- Finally, Apache Knox promotes the use of REST/HTTP for Hadoop access. REST is proven, scalable, and provides client interoperability across languages, operating systems, and computing devices. By using Hadoop REST/HTTP APIs through Knox, clients do not need a local Hadoop installation.
- Improved Authentication
- Granular Authorization
- Authorization mechanisms in Hadoop components are getting enhanced across the board. Most Hive users want a familiar database-style authorization model. Hortonworks is working to enhance the Hive authorization to bring SQL GRANT and REVOKE semantic in Hive and make it fully secure. This enhancement will also bring row and column level security to Hive. The JIRA for this work is HIVE-5837and the enhancement is expected in Q1, 2014. HDFS-4685 will bring ACL support in HDFS.
- Accounting & Audit
- There are enhancements planned to provide better reporting with audit log correlation and audit viewer capability in Hadoop. With audit log correlation an auditor will be able to answer what sequence of operations John Doe did across Hadoop components without requiring external tools. In future, with Ambari, an auditor will be able to see John Doe’s actions (such as read HDFS files and submit some MapReduce jobs) as one logical operation. In addition, Apache Knox plans to provide billing capability to record REST API usage for Hadoop.
- Protecting data with Encryption
- For encryption of data-in-motion the Hadoop ecosystem will continue to provide encryption of channels not already covered and to offer more effective encryption algorithm for all channels. In Q1, 2014 HDP will provide SSL support for Hive Server2. Further enhancements are in the works to provide encryption in Hive, HDFS and HBase.
Hadoop is a secure system and offers key features for securely processing enterprise data. But the security work never ends. Hortonworks is working on several projects to enhance Hadoop security from the inside, shore up defenses from the outside with Apache Knox and to keep up with evolving requirements by providing more flexible authentication and authorization and by improving data protection. We are also working to improve integration with enterprise Identity Management and security systems.
We encourage you to review this roadmap for Hadoop security and to get involved. Also, over the next few months we will publish more best practices, covering each pillar of security in more detail. Stay tuned for next post about wire encryption.