Announcing Apache Spark, Now GA on Hortonworks Data Platform

By Vinay Shukla

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform(HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1.

HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent levels of service and response. Now Spark is one of the many data access engines that works with YARN and that is supported in an HDP enterprise data lake. Spark provides HDP subscribers yet another way to derive value from any data, any application, anywhere.

When we started our Apache Spark journey at Hortonworks, we laid out our roadmap for all to see. Phase 1 of that roadmap included improved integration with Apache Hive, enhanced security and streamlined Spark operations. We met those Phase 1 commitments, and now that Spark is GA on HDP, we are happy to report on that roadmap progress.

Hive Integration

For improved Hive integration, HDP 2.2.4 offers ORC file support for Spark. This allows Spark to read data stored in ORC files. Spark can leverage ORC file’s more efficient columnar storage and predicate pushdown capability for even faster in-memory processing.

Ambari Integration

To simplify Hadoop deployment, HDP 2.2.4 also allows Spark to be installed and managed by Apache Ambari 2.0. Ambari allows the cluster administrator to manage both Spark configuration and Spark daemons lifecycles. The screen shot below shows an example of Ambari managing Spark.

Security

To enhance security, we worked within the community to ensure that Spark runs on a Kerberos-enabled cluster. This means that only authenticated users can submit Spark jobs on those secure clusters.

Spark for Developers

Hortonworks does Hadoop and developers do Hadoop. With Apache Hadoop YARN as the common architectural center of a Modern Data Architecture, developers everywhere can now create applications to exploit Spark’s in-memory processing power, derive insights from its iterative and analytic capabilities, and enrich their data science workloads with other types of data processing within a single, shared dataset in HDP.

At the Spark Summit East in New York last month, we talked with many of our customers about their innovative use cases that leverage Apache Spark. We find Spark’s in-memory data processing engine to be extremely useful in addressing use cases from ETL (Extract, Transform, Load) to stream processing to building recommendation and classification systems.

We share our customers’ excitement and from the outset of Spark’s introduction into the community, Hortonworks has embraced Spark. Over the past year, we introduced our Hortonworks subscribers to Spark through a series of technical previews, to show its value in the context of an enterprise data lake.

And now, we encourage our customers to use Spark as a GA component in HDP 2.2. For developers new to Spark, our conversations typically revolve around two stages in their journey building Spark-based applications:

Stage 1 – Explore and Develop in Spark Local Mode

The first stage starts with a local mode of Spark where Spark runs on a single node. The developer uses this system to learn Spark and starts to build a prototype of the their application leveraging the Spark API. Using Spark Shells (Scala REPL & PySpark), a developer rapidly prototypes and packages a Spark application with tools such as Maven or Scala Build Tool (SBT). Even though the dataset is typically small (so that it fits on a developer machine) a developer can easily debug the application on a single node.

Stage 2 – Deploy Production Spark Applications

The second stage involves running the prototype application against a much larger dataset to fine tune it and get it ready for a production deployment. Typically, this means running Spark on YARN as another workload in the enterprise data lake and allowing it to read data from HDFS. The developer takes the custom application created against a local mode of Spark and submits the application as a Spark job to a staging or production cluster.

Data Science with Spark

For data scientists, Spark is a highly effective data processing tool. It offers first class support for machine learning algorithms and provides an expressive and higher-level API abstraction for transforming or iterating over datasets. Put simply, Apache Spark makes it easier to build machine learning pipelines compared to other approaches.

Data scientists often use tools such as Notebooks (e.g. iPython) to quickly create prototypes and share their work. Many of data scientists love R, and the Spark community is hard at work to deliver R integration with SparkR. We are excited about this emerging capability.

For ease of use, Apache Zeppelin is an emerging tool that provides Notebook features for Spark. We have been exploring Zeppelin and discovered that it makes Spark more accessible and useful. We will write about using Zeppelin with Spark in the coming weeks. Here is a screenshot that provides a view into the compelling user interface that Zeppelin can provide for Spark.

At Hortonworks, we embarked on our Spark journey early last year and we are honored to announce the general availability of Spark 1.2.1 on HDP. Typical of Apache open source innovation, the Spark community continues to move quite rapidly and it recently released Spark 1.3.

This new version of Spark brought forth new features such as a Data Frames API and Kafka support in Spark Streaming. We are fully committed to Spark and plan to revise our Spark Tech Preview to provide customers access to the new capabilities now available in Spark 1.3. Given the pace with which these capabilities continue to appear, we plan to continue to provide updates via tech previews between our major releases to allow customers to keep up with the speed of innovation in Spark.

Stay tuned.