Apache Spark Webinar Recap: Q&A Session

By Vinay Shukla

October 23rd, 2014

We recently hosted a Spark webinar as part of the YARN Ready series, aimed at a technical audience including developers of applications for Apache Hadoop and Apache Hadoop YARN. During the event, a number of good questions surfaced that we wanted to share with our broader audience in this blog. Take a look at the video and slides along with these questions and answers below.

You can listen to the entire webinar recording here.

And peruse the slides from the webinar here.

Question	Answer
What is the primary benefit of YARN? How does Spark fit in?	YARN enables a modern data architecture whereby users have the ability to store data in a single location and interact with it in multiple ways – by using the right data processing engine at the right time. Deeper integration of Spark with YARN allows it to become a more efficient tenant along side other engines, such as Hive, Storm, HBase and others, simultaneously, all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and more effectively share resources within a single cluster.
What is RDD?	RDD is a core Spark concept for describing datasets at runtime. More details can be found here.
Do all the Spark specific libraries reside in their own HDFS folder independent from the main Hadoop bin directory?	Spark provides all its dependencies in an assembled jar used in run-time. Thus, it is a client side jar and doesn’t need any jars in the Hadoop bin.
Why you need reducers’ variable? You mentioned that Spark can automatically determine the best number of reducers.	Spark cannot automatically adjust reducers. However, Tez can automatically adjust reducers if auto-parallelism is enabled.
Is there a Hortonworks Sandbox where YARN and Spark are already installed for download?	You can download HDP 2.1 Sandbox from Hortonworks.com/downloads and follow the Spark Technical Preview doc to get Spark on the Sandbox
How about load balance during the batch process?	In Spark, users can repartition the RDD to rebalance the workload, but it has to be done by users. Tez, on the other hand, can do it automatically when auto-parallelism is enabled.
What is the best way to store relational data in Spark inside HDFS?	It depends on each application. Some candidates may include a columnar format, e.g., ORCFormat, which retrieves required columns more efficiently.
If we want to use Spark but we have a separate cluster for it, do we need YARN, or can we just use Spark in standalone fashion? Is YARN only needed if we co-locate Hadoop and Spark on the same nodes?	YARN provides a multi-tenant environment, allowing it to share the same datasets with other data engines. Even though Spark can be run in standalone mode, it will require a dedicated cluster, depriving the users from the Hadoop features.
Can we use Hive analytical functions & UDFs in Spark API?	Spark does support Hive UDFs, but this support may not be comprehensive.
How do I set up a Spark cluster today in HDP?	Spark-on-YARN is a client-only install; you can follow the instructions from our tech preview here.
How do you handle single node Spark/YARN failures? Do we need to handle this in our code or are restarts automatic?	Tez manages this automatically without any action required on the user’s part.
How does Spark as a SQL server (for 3rd party reporting tools to connect) work in Hadoop cluster?	Spark 1.1.0 has Thrift server built in, which will work with beeline.
Do you support Spark Streaming? How do you compare Spark vs. Storm?	Spark Streaming is emerging and has only limited experience in production clusters. Many customers are using Storm and Kafka for stream processing. See http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming for a more detailed comparison.
What about Tez vs. Spark? Is Spark faster?	Spark is many things to many people: a developer facing API, a DAG execution engine and a cluster management system (in Spark Standalone.). Spark also comes with a library for ML, Streaming etc. By contrast, Tez is a low level API not focused for application developers. Tez is an API for tool developers, whereas the Spark API allows developers to create both iterative in-memory and batch applications on Apache Hadoop YARN. Currently, however, Spark under utilizes YARN resources particularly when large datasets are involved. Spark does not behave like MapReduce or Tez—executing periodic jobs and releasing the compute resources once those jobs finish. In some ways, it behaves much more like a long running service, holding onto resources (such as memory) until the end of the entire workload. Using the experience we have already gained in building MapReduce, Apache Tez and other data processing engines, we believe similar concepts can be applied to Spark in order to optimize its resource utilization and be a good multi-tenant citizen within a YARN-based Hadoop cluster.
What are your opinions on Spark/YARN vs. Spark/Mesos?	Hortonworks is a Hadoop distribution, and our focus within the Spark community is to enable and leverage YARN for resource management so that our customers can use Spark along side other engines in a Hadoop cluster.
Are there advantages to run Spark on top of YARN as opposed to Mesos?	We have a solid understanding of how Spark on YARN is beneficial. Deeper integration of Spark with YARN allows it to become a more efficient tenant along side other engines, such as Hive, Storm, HBase and others, simultaneously, all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited. It also shares resources effectively within a single cluster. Our customers want to run multiple workloads on a single set of data all in a single cluster. Hadoop already has a very wide array of engines available.
Is Spark used only through Java and Python programming by calling APIs or can we use it with Hive or Pig etc?	Besides Scala, Java and Python, Spark SQL with JDBC Server interface can be used.
Is there JDBC interface for Spark SQL?	With Spark 1.1.0 JDBC thrift service is included. However HDP Tech preview of Spark 1.1 does not support it yet.
Which IDE is used for Spark development? Does Hortonworks provide one?	Eclipse, Intellj or others. It depends on developers’ preference. Hortonworks doesn’t provide any IDE.
Does Hortonworks supplement or compete with Hadoop?	We do Hadoop. We are Hadoop. We are the ONLY 100% open source Hadoop distribution. Our distribution is comprised wholly of Apache Hadoop and its related Apache projects.

End to End Wire Encryption with Apache Knox

By Vinay Shukla

Enterprise Apache Hadoop provides the fundamental data services required to deploy into existing architectures. These include security, governance and operations services, in addition to Hadoop’s original core capabilities for data management and data access. This post focuses on recent work completed in the open source community to enhance the Hadoop security component, with encryption and SSL certificates.

Last year I wrote a blog summarizing wire encryption options in Hortonworks Data Platform (HDP). Since that blog, encryption capabilities in HDP and Hadoop have expanded significantly.

One of these new layers of security for a Hadoop cluster is Apache Knox. With Knox, a Hadoop cluster can now be made securely accessible to a large number of users. Today, Knox allows secure connections toApache HBase, Apache Hive, Apache Oozie, WebHDFS and WebHCat. In the near future, it will also include support for Apache Hadoop YARN, Apache Ambari, Apache Falcon, and all the REST APIs offered by Hadoop components.

Without Knox, these clients would connect directly to a Hadoop cluster, and the large number of direct client connections poses security disadvantages. The main one is access.

In a typical organization, only a few DBAs connect directly to a database, and all the end-users are routed through a business application that then connects to the database. That intermediate application provides an additional layer of security checks.

Hadoop’s approach with Knox is no different. Many Hadoop deployments use Knox to allow more users to make use of Hadoop’s data and queries without compromising on security. Only a handful of admins can connect directly to their Hadoop clusters, while end-users are routed through Knox.

Apache Knox plays the role of reverse proxy between end-users and Hadoop, providing two connection hops between the client and Hadoop cluster. The first connection is between the client and Knox, and Knox offers out of the box SSL support for this connection. The second connection is between Knox and a given Hadoop component, which requires some configuration.

This blog walks through configuration and steps required to use SSL for the second connection, between Knox and a Hadoop component.

SSL Certificates

SSL connections require a certificate, either self-signed or signed by a Certificate Authority (CA). The process of obtaining self-signed certificates differs slightly from how one obtains CA-signed certificates. When the CA is well-known, there’s no need to import the signer’s certificate into the truststore. For this blog, I will use self-signed certificates; however, wire encryption can also be enabled with a CA-signed certificate. Recently, my colleague Haohui Moi blogged about HTTPS with HDFS and included instructions for a CA-signed certificate.

SSL Between Knox & HBase

The first step is to configure SSL on HBase’s REST server (Stargate). To configure SSL, we will need to create a keystore to hold the SSL certificate. This example uses a self-signed certificate, and a SSL certificate used by a Certificate Authority (CA) makes the configuration steps even easier.

As user HBase (su hbase) create the keystore.

export HOST_NAME=`hostname`

keytool -genkey -keyalg RSA -alias selfsigned -keystore hbase.jks -storepass password -validity 360 -keysize 2048 -dname "CN=$HOST_NAME, OU=Eng, O=Hortonworks, L=Palo Alto, ST=CA, C=US" -keypass password

Make sure the common name portion of the certificate matches the host where the certificate will be deployed. For example, the self-signed SSL certificate I created has the following CN sandbox.hortonworks.com, when the host running HBase is sandbox.hortonworks.com.

“Owner: CN=sandbox.hortonworks.com, OU=Eng, O=HWK, L=PA, ST=CA, C=US

Issuer: CN=sandbox.hortonworks.com, OU=Eng, O=HWK, L=PA, ST=CA, C=US”

We just created a self-signed certificate for use with HBase. Self-signed certificates are rejected during SSL handshake. To get around this, export the certificate and put it in the cacerts file of the JRE used by Knox. (This step is unnecessary when using a certificate issued by a well known CA.)

On the machine running HBase, export HBase’s SSL certificate into a file hbase.crt:

keytool -exportcert -file hbase.crt -keystore hbase.jks -alias selfsigned -storepass password

Copy the hive.crt file to the Node running Knox and run:

keytool -import -file hbase.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts -storepass changeit -alias selfsigned

Make sure the path to cacerts file points to cacerts of JDK used to run Knox gateway.

The default cacerts password is “changeit.”

Configure HBase REST Server for SSL

Using Ambari or another tool used for editing Hadoop configuration properties:

[code language=”XML”]

hbase.rest.ssl.enabled

true

hbase.rest.ssl.keystore.store

/path/to/keystore/created/hbase.jks

hbase.rest.ssl.keystore.password

password

hbase.rest.ssl.keystore.keypassword

password

[/code]

Save the configuration and re-start the HBase REST server using either Ambari or with command line as in:

sudo /usr/lib/hbase/bin/hbase-daemon.sh stop rest & sudo /usr/lib/hbase/bin/hbase-daemon.sh start rest -p60080

Verify HBase REST server over SSL

Replace localhost with whatever is the hostname of your HBase rest server.

curl -H "Accept: application/json" -k https://localhost:60080/

It should output the tables in your HBase. e.g. name.

{“table”:[{“name”:”ambarismoketest”}]}

Configure Knox to point to HBase over SSL and re-start Knox

Change the URL of the HBase service for your Knox topology in sandbox.xml to HTTPS e.g, ensure Host matches the host of HBase rest server.

[code language=”XML”]

WEBHBASE

https://sandbox.hortonworks.com:60080

[/code]

Verify end to end SSL to HBase REST server via Knox

curl -H "Accept: application/json" -iku guest:guest-password -X GET'https://localhost:8443/gateway/sandbox/hbase/'

Should have an output similar to below:

HTTP/1.1 200 OK

Set-Cookie: JSESSIONID=166l8e9qhpi95ty5le8hni0vf;Path=/gateway/sandbox;Secure;HttpOnly

Expires: Thu, 01 Jan 1970 00:00:00 GMT

Cache-Control: no-cache

Content-Type: application/json

Content-Length: 38

Server: Jetty(8.1.14.v20131031)

{“table”:[{“name”:”ambarismoketest”}]}

SSL Between Knox & HiveServer

Create Keystore for Hive

As a Hive user (su hive) create the keystore.

The first step is to configure SSL on HiveServer2. To configure SSL create a keystore to hold the SSL certificate. The example here uses self-signed certificate. Using SSL certificate issued by a Certificate Authority(CA) will make the configuration steps easier by eliminating the need to import the signer’s certificate into the truststore.

On the Node running HiveServer2 run commands:

keytool -genkey -keyalg RSA -alias hive -keystore hive.jks -storepass password -validity 360 -keysize 2048-dname "CN=sandbox.hortonworks.com, OU=Eng, O=Hortonworks, L=Palo Alto, ST=CA, C=US" -keypass password

keytool -exportcert -file hive.crt -keystore hive.jks -alias hive -storepass password -keypass password

Copy the hive.crt file to the Node running Knox and run:

keytool -import -file hive.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts -storepass changeit -alias hive

Make sure the path to cacerts file points to cacerts of JDK used to run Knox gateway. The default cacerts password is “changeit.”

Configure HiveServer2 for SSL

Using Ambari or other tool used for editing Hadoop configuration, ensure that hive.jks is in a location readable by HiveServer such as /etc/hive/conf

[code language=”XML”]

hive.server2.use.SSL

true

hive.server2.keystore.path

/etc/hive/conf/hive.jks

path to keystore file

hive.server2.keystore.password

password

keystore password

[/code]

Save the configuration and re-start HiveServer2

Validate HiveServer2 SSL Configuration

Use Beeline & connect directly to HiveServer2 over SSL

beeline> !connect jdbc:hive2://sandbox:10001/;ssl=true beeline> show tables;

Ensure this connection works.

Configure Knox connection to HiveServer2 over SSL

Change the URL of the Hive service for your Knox topology in sandbox.xml to HTTPS e.g, ensure host matches the host of Hive server.

[code language=”XML”]

Hive

https://sandbox.hortonworks.com:10001/cliservice

[/code]

Validate End to End SSL from Beeline > Knox > HiveServer2

Use Beeline and connect via Knox to HiveServer2 over SSL

beeline> !connect jdbc:hive2://sandbox:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=knox hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/sandbox/hive

beeline> show tables;>

SSL Between Knox & WebHDFS

Create Keystore for WebHDFS

execute this shell script or set up environment vairables:

[code langauge=”javaScript”]

export HOST_NAME=`hostname`

export SERVER_KEY_LOCATION=/etc/security/serverKeys

export CLIENT_KEY_LOCATION=/etc/security/clientKeys

export SERVER_KEYPASS_PASSWORD=password

export SERVER_STOREPASS_PASSWORD=password

export KEYSTORE_FILE=keystore.jks

export TRUSTSTORE_FILE=truststore.jks

export CERTIFICATE_NAME=certificate.cert

export SERVER_TRUSTSTORE_PASSWORD=password

export CLIENT_TRUSTSTORE_PASSWORD=password

export ALL_JKS=all.jks

export YARN_USER=yarn

[/code]

execute these commands:

mkdir -p $SERVER_KEY_LOCATION ; mkdir -p $CLIENT_KEY_LOCATION

cd $SERVER_KEY_LOCATION;

keytool -genkey -alias $HOST_NAME -keyalg RSA -keysize 2048 -dname"CN=$HOST_NAME,OU=hw,O=hw,L=paloalto,ST=ca,C=us" -keypass $SERVER_KEYPASS_PASSWORD -keystore $KEYSTORE_FILE-storepass $SERVER_STOREPASS_PASSWORD

cd $SERVER_KEY_LOCATION ; keytool -export -alias $HOST_NAME -keystore $KEYSTORE_FILE -rfc -file $CERTIFICATE_NAME -storepass $SERVER_STOREPASS_PASSWORD

keytool -import -trustcacerts -file $CERTIFICATE_NAME -alias $HOST_NAME -keystore $TRUSTSTORE_FILE

Also, import the certificate to the truststore used by Knox which is the JDK’s default cacerts file.

keytool -import -trustcacerts -file certificate.cert -alias $HOST_NAME -keystore /usr/lib/jvm/jre-1.7.0openjdk.x86_64/lib/security/cacerts

Make sure to point the path to cacerts for the JDK used by Knox in your deployment.

Type yes when the asked to add the certificate to the truststore.

Configure HDFS for SSL

Copy example ssl-server.xml and edit it to use the ssl configuration created in previous step.

cp /etc/hadoop/conf.empty/ssl-server.xml.example /etc/hadoop/conf/ssl-server.xml

And and make sure the following properties are set in /etc/hadoop/conf/ssl-server.xml:

[code language=”XML”]

ssl.server.truststore.location

/etc/security/serverKeys/truststore.jks

ssl.server.truststore.password

password

ssl.server.truststore.type

jks

ssl.server.keystore.location

/etc/security/serverKeys/keystore.jks

ssl.server.keystore.password

password

ssl.server.keystore.type

jks

ssl.server.keystore.keypassword

password

[/code]

Use Ambari to set the following properties in core-site.xml.

[code language=”XML”]

hadoop.ssl.require.client.cert=false

hadoop.ssl.hostname.verifier=DEFAULT_AND_LOCALHOST

hadoop.ssl.keystores.factory.class=org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory

hadoop.ssl.server.conf=ssl-server.xml

[/code]

Use Ambari to set the following properties in hdfs-site.xml.

[code language=”XML”]

dfs.http.policy=HTTPS_ONLY

dfs.datanode.https.address=workshop.hortonworks.com:50475

[/code]

The valid values for dfs.http.policy are HTTPS_ONLY & HTTP_AND_HTTPS.

The valid values for hadoop.ssl.hostname.verifier are DEFAULT, STRICT,STRICT_I6, DEFAULT_AND_LOCALHOST and ALLOW_ALL. Only use ALLOW_ALL in a controlled environment & with caution. And then use ambari to restart all hdfs services.

Configure Knox to connect over SSL to WebHDFS

Make sure /etc/knox/conf/topologies/sandbox.xml (or whatever is the topology for your Knox deployment is) has a valid service address with HTTPS protocol to point to WebHDFS.

[code language=”XML”]

WEBHDFS

https://workshop.hortonworks.com:50470/webhdfs

[/code]

Validate End to End SSL – Client > Knox > WebHDFS

curl -iku guest:guest-password -X GET 'https://workshop.hortonworks.com:8443/gateway/sandbox/webhdfs/v1/ op=LISTSTATUS'

SSL Between Knox & Oozie

By default, Oozie server runs with properties necessary for SSL configuration. For example,

do a ‘ps’ on your Oozie server (look for the process named Bootstrap) and you will see the following properties:

-Doozie.https.port=11443
-Doozie.https.keystore.file=/home/oozie/.keystore
-Doozie.https.keystore.pass=password

You can change these properties with Ambari in Oozie server config in the advance oozie-env config section. I changed them to point the keystore file to /etc/oozie/conf/keystore.jks. For this blog, I re-used the keystore I created earlier for HDFS and copied /etc/security/serverKeys/keystore.jks to /etc/oozie/conf/keystore.jks

Configure Knox to connect over SSL to Oozie

Make sure /etc/knox/conf/topologies/sandbox.xml (if whatever is the topology for your Knox deployment is) has a valid service address with HTTPS protocol to point to Oozie.

[code language=”XML”]

OOZIE

https://workshop.hortonworks.com:11443/oozie

[/code]

Validate End to End SSL – Client > Knox > Oozie

Apache Knox comes with a DSL client that makes operations against a Hadoop cluster trivial to use for exploration and one does not need to handle raw HTTP requests as I used in the previous examples.

Now run the following commands:

cd to /usr/lib/knox & java -jar bin/shell.jar samples/ExampleOozieWorkflow.groovy

Check out the /usr/lib/knox/samples/ExampleOozieWorkflow.groovy for details on what this script does.

Use higher strength SSL cipher suite

Often it is desirable to use a more secure cipher suite for SSL. For example, in the HDP 2.1 Sandbox, the cipher used for SSL between curl client and Knox in my environment is “EDH-RSA-DES-CBC3-SHA.”

JDK ships with ciphers that comply with US exports restricts and that limit the cipher strength. To use higher strength cipher, download UnlimitedJCE policy for your JDK vendor’s website and copy the files into the JRE. For Oracle JDK, you can get unlimited policy files from here.

cp UnlimitedJCEPolicy/* /usr/jdk64/jdk1.7.0_45/jre/lib/security/

If you use Ambari to setup your Hadoop cluster, you already have highest strength cipher allowed for the JDK, since Ambari setups the JDK with unlimited strength JCE policy.

Optional: Verify higher strength cipher with SSLScan

You can use a tool such as OpenSSL or SSLScan to verify that now AES256 is used for SSL.

For example, the command:

openssl s_client -connect localhost:8443

will print the cipher details:

New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384

Conclusion

SSL is the backbone of wire encryption. In Hadoop, there are multiple channels to move the data and access the cluster. This blog covered key SSL concepts and walked through steps to configure a cluster for end-to-end SSL. These steps are validated on a single node Hadoop cluster. Following these instructions for a multi-node Hadoop cluster will require a few more steps.

This blog covered a lot of ground, and I want to thank my colleague Sumit Gupta for his excellent input. Please let me know you have questions or comments.

This and That

Thursday, October 23, 2014

Apache Spark Webinar Recap: Q&A Session

Monday, October 20, 2014