Spark & HDP Perfect Together

By Vinay Shukla and Ram Sriharsha

Apache Spark’s momentum continues to grow and throughout 2015 we saw customers across all industries get real value from using it with the Hortonworks Data Platform (HDP). Examples include:

Insurance	Optimize their claims reimbursements process by using Spark’s machine learning capabilities to process and analyze all claims.
Healthcare	Build a Patient Care System using Spark Core, Streaming and SQL.
Retail	Use Spark to analyze point-of-sale data and coupon usage.
Internet	Use Spark’s ML capability to identify fake profiles and enhance products matches that they show their customers.

Many customers are using Cloudbreak and Ambari to spin up clusters in the cloud for ad-hoc, self-service data science.

Spark as a technology made great progress in 2015 as we took it through numerous technology preview releases before its “GA” release with HDP 2.2.4. Below we will recap the progress made in the past year and also outline what to expect next.

Recap of progress since last year

Last year we outlined Hortonworks’ Spark roadmap, and much of that roadmap has been successfully delivered. We worked with the community to bring ORC support to Spark and now this work is GA with Spark 1.4.1 (SPARK-2883, SPARK-10623).

On the operations side, we created an Ambari stack definition for Spark to get it installed and managed in a manner consistent with the rest of the Hadoop cluster. We also improved Spark’s support for Kerberized clusters and began work on Spark history server’ integration with the YARN Application Timeline Service (ATS). Now this work (pull request) is making its way through the community.

Most of our customers who use Spark also use Hive. Last year we worked with the community to upgrade the version of Hive in Spark to 0.13.1 and this year we have upgraded Spark to use Hive 1.2.1. We have also streamlined the integration between Hive & Spark. This integration enables customers who have embraced the data lake concept to access their data from Hive or from Spark without running into Hive version incompatibility issues. In addition, this work lays the foundation for both Hive and Spark to evolve independently while also allowing Spark to leverage Hive.

We also contributed many Machine Learning (ML) improvements, including (SPARK-7015, SPARK-8092,SPARK-7014) and continue to focus on improving the ML support in pyspark (SPARK-7861, SPARK-7833, SPARK-7690,SPARK-7404).

Finally, we are adding a number of examples and user facing documentation in the community (SPARK-6013,SPARK-7387, SPARK-7546, SPARK-7574, SPARK-7575, SPARK-9670)

What’s Next for Spark and HDP

Data science is one of the sweet spot for Spark, the challenge remains to find these unicorn Data Scientists. The skills gap is hindering many projects.

Spark provides primitives for Machine Learning algorithms (MLLib) and for modeling Machine Learning workflows (ML Pipeline), there is still a need to stitch together higher level libraries to solve common problems such as entity resolution or anomaly detection.

Writing a Spark program is easy, but debugging and tuning Spark jobs requires deeper understanding of how Spark works and how the data is distributed. Many of our customers struggle with finding out the optimal number & size of Spark executors for a workload.

Spark is the fastest moving project in the big data ecosystem and its libraries remain at different levels of maturity. We are often asked by our customer to guide them on their Spark journey, provide updates on the Spark ecosystem maturity and advise them on how to deploy their solution into production. As more Spark projects move from pilot to production, this becomes even more important. Investigating, validating, certifying and then supporting each of the components in the Spark project is one of the key ways that we provide value to our end customers.

At Hortonworks our focus remains on addressing our customer’s challenges. We have identified the following three focus areas to address these challenges with Spark and Hadoop:

Accelerate Data Science
Seamless Data Access
Innovate at the Core

Accelerate Data Science

Notebooks enable data scientists be more productive, develop and execute Spark code and visualize results without having to go to Spark command line or worry about cluster details. There are many notebooks choices available with Spark. iPython remains a mature choice and we have an Ambari stack definition available to quickly setup iPython on Hadoop clusters.

Apache Zeppelin is new and upcoming notebook which brings data exploration, visualization, sharing and collaboration features to Spark. We are excited about this project and are working with the community to bring Zeppelin to maturity.

We added Hive Interpreter to Zeppelin, and are working with the community to improve the editor and make it more stable and some of the JIRAs tracking the work are ZEPPELIN-11, ZEPPELIN-16, ZEPPELIN-19,ZEPPELIN-33, ZEPPELIN-188, ZEPPELIN-223, and ZEPPELIN-299.

Last week we released our first tech preview of Zeppelin and we plan to make Zeppelin ready for production use by adding security, stability, R support and make the visualization more intuitive.

We are deepening our involvement in the Zeppelin community to help deliver features such as summary statistics, context sensitive help to improve data science experience.

Even with notebooks, data science remains hard. Often data scientists struggle with feature engineering, algorithm selection and tuning and sharing their work with others and to put the work into production.

Many data science problems are common across multiple industries and customers. For example, Geospatial queries are useful to optimize search Ads for location, to predict crime hotspots and to understand where people take Uber on weekends. Processing Geospatial data is hard. There are some libraries out there, but they often miss metadata, aren’t language integrated or not scalable. To address the common problem of Geospatial queries, we recently published Magellan. This library brings Geospatial queries to Spark.

Another common data cleansing problem is entity disambiguation or de-duplication. For example, an entity could refer to the name of a person, that are represented differently across multiple data pipelines. In such a case, it is important to know when two entities are the same, even though the entities may be misspelled, or appear in a different format (last name, first name instead of first name, last name for example).

Common solutions to entity disambiguation include string similarity techniques or other simple similarity measures. Based on these measures, two entities can be considered close/ identical if the distance between them (1 – similarity) is small in some sense. Unfortunately, many of these techniques are ad-hoc and more importantly do not learn from data and do not have high predictive power. Furthermore, any solution needs to provide a way to plug into private and public knowledge bases to mine relationships between entities.

We are working to open source an entity disambiguation library that runs on top of Spark and Spark ML. This library will allow consumers to leverage best in class entity disambiguation algorithms and also provide extension points for consuming private knowledge bases and training algorithms that have high prediction accuracy.

Seamless Data Access

Many of our customers have embraced the data lake concept where they bring all their enterprise data from disparate sources under a single data management architecture. As the logical next step they want to process all that data within Spark. As part of the data lake architecture they have made significant commitments to HDFS, HBase, the ORC file format and other storage abstractions in the Hadoop ecosystem. Since Spark has first class support for external data sources, it can run directly on the cluster in YARN, and that is where customers want to perform their data analysis.

With the DataSource API, Spark provides a first class way to bring data from external sources while leveraging these systems for filtering, predicate pushdown etc. We used DataSource API to bring ORC data efficiently into Spark. And now we are working to bring HBase data efficiently into processing with Spark. This will be different from existing Spark + HBase connector in that it will leverage HBase to read only the data required for the query, rather than read extraneous data. Look for a preview of this feature in the coming months.

The value of data lake is that it brings more data under one roof and opens new opportunity for insights and to drive efficiency in such scenarios, Spark helps tap into this value.

Innovate at the Core

Many of our customers have asked for an ability to share the benefits of Spark across multiple users. We think with RDD and SparkContext sharing multiple users can access shared Spark resources. We have an effort making its way through community (SPARK-6112) to allow RDDs to be persisted to HDFS Memory Tier. This will allow RDDs to be shared across users.

Spark Streaming also continues to gain traction with our customers. They view the single programming model and execution engine for both batch and streaming workloads as a way to simplify their development and deployment architectures. Making spark streaming more robust and enterprise ready has been an area of focus for us and we are hard at work to deliver it as GA in our platform soon.

Spark is fast becoming a cornerstone of many enterprises’ data platform strategy. As such it needs to meet the familiar enterprise requirements around devOps, security, stability, install and upgrades. We have already delivered Ambari stack definition to provide an easy install and upgrade experience for Spark on HDP. We continue to work on improving the security and multi-tenancy story, both within Spark and in YARN. Customers can now take advantage of advanced YARN features like CPU scheduling, node labels, disk isolation, network isolation, and cluster capacity management across workloads.

Wrap up

We did much over the last year, but much more remains to be done. We are excited to be working on the next set of features to make our customers more successful with their Spark and Data Science projects.

We have seen rapid adoption of Spark in our customer base and we want to thank our customers for choosing Spark on HDP. We also want to thank our partners Microsoft, Databricks, HP, NFLabs and the community on sharing this journey with us.

If you are new to Spark or HDP and want to get started download the free Hortonworks Sandbox and try some of the Spark Tutorials.

This and That

Tuesday, October 27, 2015