This and That: 2015

Tuesday, October 27, 2015

Spark & HDP Perfect Together

By Vinay Shukla and Ram Sriharsha

Apache Spark’s momentum continues to grow and throughout 2015 we saw customers across all industries get real value from using it with the Hortonworks Data Platform (HDP). Examples include:

Insurance	Optimize their claims reimbursements process by using Spark’s machine learning capabilities to process and analyze all claims.
Healthcare	Build a Patient Care System using Spark Core, Streaming and SQL.
Retail	Use Spark to analyze point-of-sale data and coupon usage.
Internet	Use Spark’s ML capability to identify fake profiles and enhance products matches that they show their customers.

Many customers are using Cloudbreak and Ambari to spin up clusters in the cloud for ad-hoc, self-service data science.

Spark as a technology made great progress in 2015 as we took it through numerous technology preview releases before its “GA” release with HDP 2.2.4. Below we will recap the progress made in the past year and also outline what to expect next.

Recap of progress since last year

Last year we outlined Hortonworks’ Spark roadmap, and much of that roadmap has been successfully delivered. We worked with the community to bring ORC support to Spark and now this work is GA with Spark 1.4.1 (SPARK-2883, SPARK-10623).

On the operations side, we created an Ambari stack definition for Spark to get it installed and managed in a manner consistent with the rest of the Hadoop cluster. We also improved Spark’s support for Kerberized clusters and began work on Spark history server’ integration with the YARN Application Timeline Service (ATS). Now this work (pull request) is making its way through the community.

Most of our customers who use Spark also use Hive. Last year we worked with the community to upgrade the version of Hive in Spark to 0.13.1 and this year we have upgraded Spark to use Hive 1.2.1. We have also streamlined the integration between Hive & Spark. This integration enables customers who have embraced the data lake concept to access their data from Hive or from Spark without running into Hive version incompatibility issues. In addition, this work lays the foundation for both Hive and Spark to evolve independently while also allowing Spark to leverage Hive.

We also contributed many Machine Learning (ML) improvements, including (SPARK-7015, SPARK-8092,SPARK-7014) and continue to focus on improving the ML support in pyspark (SPARK-7861, SPARK-7833, SPARK-7690,SPARK-7404).

Finally, we are adding a number of examples and user facing documentation in the community (SPARK-6013,SPARK-7387, SPARK-7546, SPARK-7574, SPARK-7575, SPARK-9670)

What’s Next for Spark and HDP

Data science is one of the sweet spot for Spark, the challenge remains to find these unicorn Data Scientists. The skills gap is hindering many projects.

Spark provides primitives for Machine Learning algorithms (MLLib) and for modeling Machine Learning workflows (ML Pipeline), there is still a need to stitch together higher level libraries to solve common problems such as entity resolution or anomaly detection.

Writing a Spark program is easy, but debugging and tuning Spark jobs requires deeper understanding of how Spark works and how the data is distributed. Many of our customers struggle with finding out the optimal number & size of Spark executors for a workload.

Spark is the fastest moving project in the big data ecosystem and its libraries remain at different levels of maturity. We are often asked by our customer to guide them on their Spark journey, provide updates on the Spark ecosystem maturity and advise them on how to deploy their solution into production. As more Spark projects move from pilot to production, this becomes even more important. Investigating, validating, certifying and then supporting each of the components in the Spark project is one of the key ways that we provide value to our end customers.

At Hortonworks our focus remains on addressing our customer’s challenges. We have identified the following three focus areas to address these challenges with Spark and Hadoop:

Accelerate Data Science
Seamless Data Access
Innovate at the Core

Accelerate Data Science

Notebooks enable data scientists be more productive, develop and execute Spark code and visualize results without having to go to Spark command line or worry about cluster details. There are many notebooks choices available with Spark. iPython remains a mature choice and we have an Ambari stack definition available to quickly setup iPython on Hadoop clusters.

Apache Zeppelin is new and upcoming notebook which brings data exploration, visualization, sharing and collaboration features to Spark. We are excited about this project and are working with the community to bring Zeppelin to maturity.

We added Hive Interpreter to Zeppelin, and are working with the community to improve the editor and make it more stable and some of the JIRAs tracking the work are ZEPPELIN-11, ZEPPELIN-16, ZEPPELIN-19,ZEPPELIN-33, ZEPPELIN-188, ZEPPELIN-223, and ZEPPELIN-299.

Last week we released our first tech preview of Zeppelin and we plan to make Zeppelin ready for production use by adding security, stability, R support and make the visualization more intuitive.

We are deepening our involvement in the Zeppelin community to help deliver features such as summary statistics, context sensitive help to improve data science experience.

Even with notebooks, data science remains hard. Often data scientists struggle with feature engineering, algorithm selection and tuning and sharing their work with others and to put the work into production.

Many data science problems are common across multiple industries and customers. For example, Geospatial queries are useful to optimize search Ads for location, to predict crime hotspots and to understand where people take Uber on weekends. Processing Geospatial data is hard. There are some libraries out there, but they often miss metadata, aren’t language integrated or not scalable. To address the common problem of Geospatial queries, we recently published Magellan. This library brings Geospatial queries to Spark.

Another common data cleansing problem is entity disambiguation or de-duplication. For example, an entity could refer to the name of a person, that are represented differently across multiple data pipelines. In such a case, it is important to know when two entities are the same, even though the entities may be misspelled, or appear in a different format (last name, first name instead of first name, last name for example).

Common solutions to entity disambiguation include string similarity techniques or other simple similarity measures. Based on these measures, two entities can be considered close/ identical if the distance between them (1 – similarity) is small in some sense. Unfortunately, many of these techniques are ad-hoc and more importantly do not learn from data and do not have high predictive power. Furthermore, any solution needs to provide a way to plug into private and public knowledge bases to mine relationships between entities.

We are working to open source an entity disambiguation library that runs on top of Spark and Spark ML. This library will allow consumers to leverage best in class entity disambiguation algorithms and also provide extension points for consuming private knowledge bases and training algorithms that have high prediction accuracy.

Seamless Data Access

Many of our customers have embraced the data lake concept where they bring all their enterprise data from disparate sources under a single data management architecture. As the logical next step they want to process all that data within Spark. As part of the data lake architecture they have made significant commitments to HDFS, HBase, the ORC file format and other storage abstractions in the Hadoop ecosystem. Since Spark has first class support for external data sources, it can run directly on the cluster in YARN, and that is where customers want to perform their data analysis.

With the DataSource API, Spark provides a first class way to bring data from external sources while leveraging these systems for filtering, predicate pushdown etc. We used DataSource API to bring ORC data efficiently into Spark. And now we are working to bring HBase data efficiently into processing with Spark. This will be different from existing Spark + HBase connector in that it will leverage HBase to read only the data required for the query, rather than read extraneous data. Look for a preview of this feature in the coming months.

The value of data lake is that it brings more data under one roof and opens new opportunity for insights and to drive efficiency in such scenarios, Spark helps tap into this value.

Innovate at the Core

Many of our customers have asked for an ability to share the benefits of Spark across multiple users. We think with RDD and SparkContext sharing multiple users can access shared Spark resources. We have an effort making its way through community (SPARK-6112) to allow RDDs to be persisted to HDFS Memory Tier. This will allow RDDs to be shared across users.

Spark Streaming also continues to gain traction with our customers. They view the single programming model and execution engine for both batch and streaming workloads as a way to simplify their development and deployment architectures. Making spark streaming more robust and enterprise ready has been an area of focus for us and we are hard at work to deliver it as GA in our platform soon.

Spark is fast becoming a cornerstone of many enterprises’ data platform strategy. As such it needs to meet the familiar enterprise requirements around devOps, security, stability, install and upgrades. We have already delivered Ambari stack definition to provide an easy install and upgrade experience for Spark on HDP. We continue to work on improving the security and multi-tenancy story, both within Spark and in YARN. Customers can now take advantage of advanced YARN features like CPU scheduling, node labels, disk isolation, network isolation, and cluster capacity management across workloads.

Wrap up

We did much over the last year, but much more remains to be done. We are excited to be working on the next set of features to make our customers more successful with their Spark and Data Science projects.

We have seen rapid adoption of Spark in our customer base and we want to thank our customers for choosing Spark on HDP. We also want to thank our partners Microsoft, Databricks, HP, NFLabs and the community on sharing this journey with us.

If you are new to Spark or HDP and want to get started download the free Hortonworks Sandbox and try some of the Spark Tutorials.

Friday, July 10, 2015

A Spark is Lit in HDInsight

By Vinay Shukla

Apache Spark has garnered a lot of developer attention and is often the top of agenda in my customer interactions. Since we announced support for Spark in HDP, we have seen broad customer adoption of our Spark offering. Our customers love Spark for the simplicity of its API, speed of development and the runtime performance. Spark is also democratizing Machine Learning and making it easier and approachable to more developers.

Today Microsoft announced support for Spark in HDInsight – this is a big step towards driving customer adoption for Spark workloads on Hadoop clusters in Azure.

New Spark Features

Apache Spark community is a fast moving one and constantly delivers new improvements. Some of the recent developments in Spark are:

DataFrame API
ML Pipelines API & Streaming API in PySpark
Spark Streaming support for direct Kafka with exactly-once semantics
SparkR support
ORC support in Spark

The DataFrame API represents tabular data. Here is an example:

DataFrame API

The table represents employees in various departments. Both the Python RDD example & DataFrame example calculate the average age of employees in various departments. Data Frames are intuitive and much easier to write.

In my customer interactions the most common use cases for Spark are ETL, Data science and Machine learning and interactive SQL and Streaming are also often talked about.

DataScience with Spark

Before Spark, the data sitting in the Data Lake needed to be moved to an edge node for Machine Learning (ML) workloads. Many customers want to process data science and machine learning workloads in their YARN cluster to take advantage of all their data for ML. Spark on YARN enables this use case with it’s data parallel ML algorithms.

A data science workflow starts with asking the question. There are many steps in this non-linear workflow. The most common steps are finding the right data set, ETL workflow, feature engineering, model generation, model tuning, and reporting.

Data scientists often use notebook for writing code snippets, data exploration and visualization and for quickly framing the question and going for the answer.

Most data scientist are familiar with iPython Jupyter notebook and Apache Zeppelin is an up and coming notebook. We recently started a blog series on data science with Spark & Apache Zeppelin.

Spark in Azure

Microsoft and Hortonworks have deep engineering ties across much of Hadoop and HDP projects. We have worked closely together to support Apache Spark in HDInsight and are continuing this work to enable more capabilities.

Spark is a fast moving project with new features in each release. For some customers keeping up with this pace becomes a challenge. Spark for Azure HDInsight provides customers a quick means to take advantage of these new features, all with the ease of use and the guaranteed availability on Azure HDInsight.

Conclusion

We are deeply invested in Spark and continue to add new capabilities working with Microsoft and the community. Stay tuned for an upcoming blog where we will talk about adding ORC support to Spark.

We are excited to be working with Microsoft to help more customers realize the possibilities of big data with Apache Spark. For more information:

Announcing Spark for Azure HDInsight public preview

Learn more about Apache Spark

Tuesday, April 14, 2015

Announcing Apache Spark, Now GA on Hortonworks Data Platform

By Vinay Shukla

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform(HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1.

HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent levels of service and response. Now Spark is one of the many data access engines that works with YARN and that is supported in an HDP enterprise data lake. Spark provides HDP subscribers yet another way to derive value from any data, any application, anywhere.

When we started our Apache Spark journey at Hortonworks, we laid out our roadmap for all to see. Phase 1 of that roadmap included improved integration with Apache Hive, enhanced security and streamlined Spark operations. We met those Phase 1 commitments, and now that Spark is GA on HDP, we are happy to report on that roadmap progress.

Hive Integration

For improved Hive integration, HDP 2.2.4 offers ORC file support for Spark. This allows Spark to read data stored in ORC files. Spark can leverage ORC file’s more efficient columnar storage and predicate pushdown capability for even faster in-memory processing.

Ambari Integration

To simplify Hadoop deployment, HDP 2.2.4 also allows Spark to be installed and managed by Apache Ambari 2.0. Ambari allows the cluster administrator to manage both Spark configuration and Spark daemons lifecycles. The screen shot below shows an example of Ambari managing Spark.

Security

To enhance security, we worked within the community to ensure that Spark runs on a Kerberos-enabled cluster. This means that only authenticated users can submit Spark jobs on those secure clusters.

Spark for Developers

Hortonworks does Hadoop and developers do Hadoop. With Apache Hadoop YARN as the common architectural center of a Modern Data Architecture, developers everywhere can now create applications to exploit Spark’s in-memory processing power, derive insights from its iterative and analytic capabilities, and enrich their data science workloads with other types of data processing within a single, shared dataset in HDP.

At the Spark Summit East in New York last month, we talked with many of our customers about their innovative use cases that leverage Apache Spark. We find Spark’s in-memory data processing engine to be extremely useful in addressing use cases from ETL (Extract, Transform, Load) to stream processing to building recommendation and classification systems.

We share our customers’ excitement and from the outset of Spark’s introduction into the community, Hortonworks has embraced Spark. Over the past year, we introduced our Hortonworks subscribers to Spark through a series of technical previews, to show its value in the context of an enterprise data lake.

And now, we encourage our customers to use Spark as a GA component in HDP 2.2. For developers new to Spark, our conversations typically revolve around two stages in their journey building Spark-based applications:

Stage 1 – Explore and Develop in Spark Local Mode

The first stage starts with a local mode of Spark where Spark runs on a single node. The developer uses this system to learn Spark and starts to build a prototype of the their application leveraging the Spark API. Using Spark Shells (Scala REPL & PySpark), a developer rapidly prototypes and packages a Spark application with tools such as Maven or Scala Build Tool (SBT). Even though the dataset is typically small (so that it fits on a developer machine) a developer can easily debug the application on a single node.

Stage 2 – Deploy Production Spark Applications

The second stage involves running the prototype application against a much larger dataset to fine tune it and get it ready for a production deployment. Typically, this means running Spark on YARN as another workload in the enterprise data lake and allowing it to read data from HDFS. The developer takes the custom application created against a local mode of Spark and submits the application as a Spark job to a staging or production cluster.

Data Science with Spark

For data scientists, Spark is a highly effective data processing tool. It offers first class support for machine learning algorithms and provides an expressive and higher-level API abstraction for transforming or iterating over datasets. Put simply, Apache Spark makes it easier to build machine learning pipelines compared to other approaches.

Data scientists often use tools such as Notebooks (e.g. iPython) to quickly create prototypes and share their work. Many of data scientists love R, and the Spark community is hard at work to deliver R integration with SparkR. We are excited about this emerging capability.

For ease of use, Apache Zeppelin is an emerging tool that provides Notebook features for Spark. We have been exploring Zeppelin and discovered that it makes Spark more accessible and useful. We will write about using Zeppelin with Spark in the coming weeks. Here is a screenshot that provides a view into the compelling user interface that Zeppelin can provide for Spark.

At Hortonworks, we embarked on our Spark journey early last year and we are honored to announce the general availability of Spark 1.2.1 on HDP. Typical of Apache open source innovation, the Spark community continues to move quite rapidly and it recently released Spark 1.3.

This new version of Spark brought forth new features such as a Data Frames API and Kafka support in Spark Streaming. We are fully committed to Spark and plan to revise our Spark Tech Preview to provide customers access to the new capabilities now available in Spark 1.3. Given the pace with which these capabilities continue to appear, we plan to continue to provide updates via tech previews between our major releases to allow customers to keep up with the speed of innovation in Spark.

Stay tuned.