Thursday, October 23, 2014

Apache Spark Webinar Recap: Q&A Session

Apache Spark Webinar Recap: Q&A Session
By Vinay Shukla
October 23rd, 2014
We recently hosted a Spark webinar as part of the YARN Ready series, aimed at a technical audience including developers of applications for Apache Hadoop and Apache Hadoop YARN. During the event, a number of good questions surfaced that we wanted to share with our broader audience in this blog. Take a look at the video and slides along with these questions and answers below.
You can listen to the entire webinar recording here.
And peruse the slides from the webinar here.
What is the primary benefit of YARN?
How does Spark fit in?
YARN enables a modern data architecture whereby users have the ability to store data in a single location and interact with it in multiple ways – by using the right data processing engine at the right time.
Deeper integration of Spark with YARN allows it to become a more efficient tenant along side other engines, such as Hive, Storm, HBase and others, simultaneously, all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and more effectively share resources within a single cluster.
What is RDD?
RDD is a core Spark concept for describing datasets at runtime. More details can be found here.
Do all the Spark specific libraries reside in their own HDFS folder independent from the main Hadoop bin directory?
Spark provides all its dependencies in an assembled jar used in run-time. Thus, it is a client side jar and doesn’t need any jars in the Hadoop bin.
Why you need reducers’ variable? You mentioned that Spark can automatically determine the best number of reducers.
Spark cannot automatically adjust reducers. However, Tez can automatically adjust reducers if auto-parallelism is enabled.
Is there a Hortonworks Sandbox where YARN and Spark are already installed for download?
You can download HDP 2.1 Sandbox from and follow the Spark Technical Preview doc to get Spark on the Sandbox
How about load balance during the batch process?
In Spark, users can repartition the RDD to rebalance the workload, but it has to be done by users. Tez, on the other hand, can do it automatically when auto-parallelism is enabled.
What is the best way to store relational data in Spark inside HDFS?
It depends on each application. Some candidates may include a columnar format, e.g., ORCFormat, which retrieves required columns more efficiently.
If we want to use Spark but we have a separate cluster for it, do we need YARN, or can we just use Spark in standalone fashion? Is YARN only needed if we co-locate Hadoop and Spark on the same nodes?
YARN provides a multi-tenant environment, allowing it to share the same datasets with other data engines. Even though Spark can be run in standalone mode, it will require a dedicated cluster, depriving the users from the Hadoop features.
Can we use Hive analytical functions & UDFs in Spark API?
Spark does support Hive UDFs, but this support may not be comprehensive.
How do I set up a Spark cluster today in HDP?
Spark-on-YARN is a client-only install; you can follow the instructions from our tech preview here.
How do you handle single node Spark/YARN failures? Do we need to handle this in our code or are restarts automatic?
Tez manages this automatically without any action required on the user’s part.
How does Spark as a SQL server (for 3rd party reporting tools to connect) work in Hadoop cluster?
Spark 1.1.0 has Thrift server built in, which will work with beeline.
Do you support Spark Streaming? How do you compare Spark vs. Storm?
Spark Streaming is emerging and has only limited experience in production clusters. Many customers are using Storm and Kafka for stream processing.
What about Tez vs. Spark? Is Spark faster?
Spark is many things to many people: a developer facing API, a DAG execution engine and a cluster management system (in Spark Standalone.). Spark also comes with a library for ML, Streaming etc.
By contrast, Tez is a low level API not focused for application developers. Tez is an API for tool developers, whereas the Spark API allows developers to create both iterative in-memory and batch applications on Apache Hadoop YARN.
Currently, however, Spark under utilizes YARN resources particularly when large datasets are involved.
Spark does not behave like MapReduce or Tez—executing periodic jobs and releasing the compute resources once those jobs finish. In some ways, it behaves much more like a long running service, holding onto resources (such as memory) until the end of the entire workload. Using the experience we have already gained in building MapReduce, Apache Tez and other data processing engines, we believe similar concepts can be applied to Spark in order to optimize its resource utilization and be a good multi-tenant citizen within a YARN-based Hadoop cluster.
What are your opinions on Spark/YARN vs. Spark/Mesos?
Hortonworks is a Hadoop distribution, and our focus within the Spark community is to enable and leverage YARN for resource management so that our customers can use Spark along side other engines in a Hadoop cluster.
Are there advantages to run Spark on top of YARN as opposed to Mesos?
We have a solid understanding of how Spark on YARN is beneficial. Deeper integration of Spark with YARN allows it to become a more efficient tenant along side other engines, such as Hive, Storm, HBase and others, simultaneously, all on a single data platform.
This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited. It also shares resources effectively within a single cluster. Our customers want to run multiple workloads on a single set of data all in a single cluster. Hadoop already has a very wide array of engines available.
Is Spark used only through Java and Python programming by calling APIs or can we use it with Hive or Pig etc?
Besides Scala, Java and Python, Spark SQL with JDBC Server interface can be used.
Is there JDBC interface for Spark SQL?
With Spark 1.1.0 JDBC thrift service is included. However HDP Tech preview of Spark 1.1 does not support it yet.
Which IDE is used for Spark development? Does Hortonworks provide one?
Eclipse, Intellj or others. It depends on developers’ preference. Hortonworks doesn’t provide any IDE.
Does Hortonworks supplement or compete with Hadoop?
We do Hadoop. We are Hadoop. We are the ONLY 100% open source Hadoop distribution. Our distribution is comprised wholly of Apache Hadoop and its related Apache projects.

Monday, October 20, 2014

End to End Wire Encryption with Apache Knox

End to End Wire Encryption with Apache Knox

By Vinay Shukla
Enterprise Apache Hadoop provides the fundamental data services required to deploy into existing architectures. These include security, governance and operations services, in addition to Hadoop’s original core capabilities for data management and data access. This post focuses on recent work completed in the open source community to enhance the Hadoop security component, with encryption and SSL certificates.
Last year I wrote a blog summarizing wire encryption options in Hortonworks Data Platform (HDP). Since that blog, encryption capabilities in HDP and Hadoop have expanded significantly.
One of these new layers of security for a Hadoop cluster is Apache Knox. With Knox, a Hadoop cluster can now be made securely accessible to a large number of users. Today, Knox allows secure connections toApache HBase, Apache Hive, Apache Oozie, WebHDFS and WebHCat. In the near future, it will also include support for Apache Hadoop YARN, Apache Ambari, Apache Falcon, and all the REST APIs offered by Hadoop components.
Without Knox, these clients would connect directly to a Hadoop cluster, and the large number of direct client connections poses security disadvantages. The main one is access.
In a typical organization, only a few DBAs connect directly to a database, and all the end-users are routed through a business application that then connects to the database. That intermediate application provides an additional layer of security checks.
Hadoop’s approach with Knox is no different. Many Hadoop deployments use Knox to allow more users to make use of Hadoop’s data and queries without compromising on security. Only a handful of admins can connect directly to their Hadoop clusters, while end-users are routed through Knox.
Apache Knox plays the role of reverse proxy between end-users and Hadoop, providing two connection hops between the client and Hadoop cluster. The first connection is between the client and Knox, and Knox offers out of the box SSL support for this connection. The second connection is between Knox and a given Hadoop component, which requires some configuration.
This blog walks through configuration and steps required to use SSL for the second connection, between Knox and a Hadoop component.

SSL Certificates

SSL connections require a certificate, either self-signed or signed by a Certificate Authority (CA). The process of obtaining self-signed certificates differs slightly from how one obtains CA-signed certificates. When the CA is well-known, there’s no need to import the signer’s certificate into the truststore. For this blog, I will use self-signed certificates; however, wire encryption can also be enabled with a CA-signed certificate. Recently, my colleague Haohui Moi blogged about HTTPS with HDFS and included instructions for a CA-signed certificate.

SSL Between Knox & HBase

The first step is to configure SSL on HBase’s REST server (Stargate). To configure SSL, we will need to create a keystore to hold the SSL certificate. This example uses a self-signed certificate, and a SSL certificate used by a Certificate Authority (CA) makes the configuration steps even easier.
As user HBase (su hbase) create the keystore.
export HOST_NAME=`hostname`
keytool -genkey -keyalg RSA -alias selfsigned -keystore hbase.jks -storepass password -validity 360 -keysize 2048 -dname "CN=$HOST_NAME, OU=Eng, O=Hortonworks, L=Palo Alto, ST=CA, C=US" -keypass password
Make sure the common name portion of the certificate matches the host where the certificate will be deployed. For example, the self-signed SSL certificate I created has the following CN, when the host running HBase is
“Owner:, OU=Eng, O=HWK, L=PA, ST=CA, C=US
Issuer:, OU=Eng, O=HWK, L=PA, ST=CA, C=US”
We just created a self-signed certificate for use with HBase. Self-signed certificates are rejected during SSL handshake. To get around this, export the certificate and put it in the cacerts file of the JRE used by Knox. (This step is unnecessary when using a certificate issued by a well known CA.)
On the machine running HBase, export HBase’s SSL certificate into a file hbase.crt:
keytool -exportcert -file hbase.crt -keystore hbase.jks -alias selfsigned -storepass password
Copy the hive.crt file to the Node running Knox and run:
keytool -import -file hbase.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts -storepass changeit -alias selfsigned
Make sure the path to cacerts file points to cacerts of JDK used to run Knox gateway.
The default cacerts password is “changeit.”
Configure HBase REST Server for SSL
Using Ambari or another tool used for editing Hadoop configuration properties:
[code language=”XML”]
Save the configuration and re-start the HBase REST server using either Ambari or with command line as in:
sudo /usr/lib/hbase/bin/ stop rest & sudo /usr/lib/hbase/bin/ start rest -p60080
Verify HBase REST server over SSL
Replace localhost with whatever is the hostname of your HBase rest server.
curl -H "Accept: application/json" -k https://localhost:60080/
It should output the tables in your HBase. e.g. name.
Configure Knox to point to HBase over SSL and re-start Knox
Change the URL of the HBase service for your Knox topology in sandbox.xml to HTTPS e.g, ensure Host matches the host of HBase rest server.
[code language=”XML”]
Verify end to end SSL to HBase REST server via Knox
curl -H "Accept: application/json" -iku guest:guest-password -X GET'https://localhost:8443/gateway/sandbox/hbase/'
Should have an output similar to below:
HTTP/1.1 200 OK
Set-Cookie: JSESSIONID=166l8e9qhpi95ty5le8hni0vf;Path=/gateway/sandbox;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Content-Type: application/json
Content-Length: 38
Server: Jetty(8.1.14.v20131031)

SSL Between Knox & HiveServer

Create Keystore for Hive
As a Hive user (su hive) create the keystore.
The first step is to configure SSL on HiveServer2. To configure SSL create a keystore to hold the SSL certificate. The example here uses self-signed certificate. Using SSL certificate issued by a Certificate Authority(CA) will make the configuration steps easier by eliminating the need to import the signer’s certificate into the truststore.
On the Node running HiveServer2 run commands:
keytool -genkey -keyalg RSA -alias hive -keystore hive.jks -storepass password -validity 360 -keysize 2048-dname ", OU=Eng, O=Hortonworks, L=Palo Alto, ST=CA, C=US" -keypass password
keytool -exportcert -file hive.crt -keystore hive.jks -alias hive -storepass password -keypass password
Copy the hive.crt file to the Node running Knox and run:
keytool -import -file hive.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts -storepass changeit -alias hive
Make sure the path to cacerts file points to cacerts of JDK used to run Knox gateway. The default cacerts password is “changeit.”
Configure HiveServer2 for SSL
Using Ambari or other tool used for editing Hadoop configuration, ensure that hive.jks is in a location readable by HiveServer such as /etc/hive/conf
[code language=”XML”]
path to keystore file
keystore password
Save the configuration and re-start HiveServer2
Validate HiveServer2 SSL Configuration
Use Beeline & connect directly to HiveServer2 over SSL
beeline> !connect jdbc:hive2://sandbox:10001/;ssl=true beeline> show tables;
Ensure this connection works.
Configure Knox connection to HiveServer2 over SSL
Change the URL of the Hive service for your Knox topology in sandbox.xml to HTTPS e.g, ensure host matches the host of Hive server.
[code language=”XML”]
Validate End to End SSL from Beeline > Knox > HiveServer2
Use Beeline and connect via Knox to HiveServer2 over SSL
beeline> !connect jdbc:hive2://sandbox:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=knox hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/sandbox/hive
beeline> show tables;>

SSL Between Knox & WebHDFS

Create Keystore for WebHDFS
execute this shell script or set up environment vairables:
[code langauge=”javaScript”]
export HOST_NAME=`hostname`
export SERVER_KEY_LOCATION=/etc/security/serverKeys
export CLIENT_KEY_LOCATION=/etc/security/clientKeys
export KEYSTORE_FILE=keystore.jks
export TRUSTSTORE_FILE=truststore.jks
export CERTIFICATE_NAME=certificate.cert
export ALL_JKS=all.jks
export YARN_USER=yarn
execute these commands:
keytool -genkey -alias $HOST_NAME -keyalg RSA -keysize 2048 -dname"CN=$HOST_NAME,OU=hw,O=hw,L=paloalto,ST=ca,C=us" -keypass $SERVER_KEYPASS_PASSWORD -keystore $KEYSTORE_FILE-storepass $SERVER_STOREPASS_PASSWORD
cd $SERVER_KEY_LOCATION ; keytool -export -alias $HOST_NAME -keystore $KEYSTORE_FILE -rfc -file $CERTIFICATE_NAME -storepass $SERVER_STOREPASS_PASSWORD
keytool -import -trustcacerts -file $CERTIFICATE_NAME -alias $HOST_NAME -keystore $TRUSTSTORE_FILE
Also, import the certificate to the truststore used by Knox which is the JDK’s default cacerts file.
keytool -import -trustcacerts -file certificate.cert -alias $HOST_NAME -keystore /usr/lib/jvm/jre-1.7.0openjdk.x86_64/lib/security/cacerts
Make sure to point the path to cacerts for the JDK used by Knox in your deployment.
Type yes when the asked to add the certificate to the truststore.
Configure HDFS for SSL
Copy example ssl-server.xml and edit it to use the ssl configuration created in previous step.
cp /etc/hadoop/conf.empty/ssl-server.xml.example /etc/hadoop/conf/ssl-server.xml
And and make sure the following properties are set in /etc/hadoop/conf/ssl-server.xml:
[code language=”XML”]
Use Ambari to set the following properties in core-site.xml.
[code language=”XML”]
Use Ambari to set the following properties in hdfs-site.xml.
[code language=”XML”]
The valid values for dfs.http.policy are HTTPS_ONLY & HTTP_AND_HTTPS.
The valid values for hadoop.ssl.hostname.verifier are DEFAULT, STRICT,STRICT_I6, DEFAULT_AND_LOCALHOST and ALLOW_ALL. Only use ALLOW_ALL in a controlled environment & with caution. And then use ambari to restart all hdfs services.
Configure Knox to connect over SSL to WebHDFS
Make sure /etc/knox/conf/topologies/sandbox.xml (or whatever is the topology for your Knox deployment is) has a valid service address with HTTPS protocol to point to WebHDFS.
[code language=”XML”]
Validate End to End SSL – Client > Knox > WebHDFS
curl -iku guest:guest-password -X GET ' op=LISTSTATUS'

SSL Between Knox & Oozie

By default, Oozie server runs with properties necessary for SSL configuration. For example,
do a ‘ps’ on your Oozie server (look for the process named Bootstrap) and you will see the following properties:
  • -Doozie.https.port=11443
  • -Doozie.https.keystore.file=/home/oozie/.keystore
  • -Doozie.https.keystore.pass=password
You can change these properties with Ambari in Oozie server config in the advance oozie-env config section. I changed them to point the keystore file to /etc/oozie/conf/keystore.jks. For this blog, I re-used the keystore I created earlier for HDFS and copied /etc/security/serverKeys/keystore.jks to /etc/oozie/conf/keystore.jks
Configure Knox to connect over SSL to Oozie
Make sure /etc/knox/conf/topologies/sandbox.xml (if whatever is the topology for your Knox deployment is) has a valid service address with HTTPS protocol to point to Oozie.
[code language=”XML”]
Validate End to End SSL – Client > Knox > Oozie
Apache Knox comes with a DSL client that makes operations against a Hadoop cluster trivial to use for exploration and one does not need to handle raw HTTP requests as I used in the previous examples.
Now run the following commands:
cd to /usr/lib/knox & java -jar bin/shell.jar samples/ExampleOozieWorkflow.groovy
Check out the /usr/lib/knox/samples/ExampleOozieWorkflow.groovy for details on what this script does.
Use higher strength SSL cipher suite
Often it is desirable to use a more secure cipher suite for SSL. For example, in the HDP 2.1 Sandbox, the cipher used for SSL between curl client and Knox in my environment is “EDH-RSA-DES-CBC3-SHA.”
JDK ships with ciphers that comply with US exports restricts and that limit the cipher strength. To use higher strength cipher, download UnlimitedJCE policy for your JDK vendor’s website and copy the files into the JRE. For Oracle JDK, you can get unlimited policy files from here.
cp UnlimitedJCEPolicy/* /usr/jdk64/jdk1.7.0_45/jre/lib/security/
If you use Ambari to setup your Hadoop cluster, you already have highest strength cipher allowed for the JDK, since Ambari setups the JDK with unlimited strength JCE policy.
Optional: Verify higher strength cipher with SSLScan
You can use a tool such as OpenSSL or SSLScan to verify that now AES256 is used for SSL.
For example, the command:
openssl s_client -connect localhost:8443
will print the cipher details:
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384


SSL is the backbone of wire encryption. In Hadoop, there are multiple channels to move the data and access the cluster. This blog covered key SSL concepts and walked through steps to configure a cluster for end-to-end SSL. These steps are validated on a single node Hadoop cluster. Following these instructions for a multi-node Hadoop cluster will require a few more steps.
This blog covered a lot of ground, and I want to thank my colleague Sumit Gupta for his excellent input. Please let me know you have questions or comments.