Friday, July 10, 2015

A Spark is Lit in HDInsight

A Spark is Lit in HDInsight

By Vinay Shukla
Apache Spark has garnered a lot of developer attention and is often the top of agenda in my customer interactions. Since we announced support for Spark in HDP, we have seen broad customer adoption of our Spark offering. Our customers love Spark for the simplicity of its API, speed of development and the runtime performance. Spark is also democratizing Machine Learning and making it easier and approachable to more developers.
Today Microsoft announced support for Spark in HDInsight – this is a big step towards driving customer adoption for Spark workloads on Hadoop clusters in Azure.

New Spark Features

Apache Spark community is a fast moving one and constantly delivers new improvements. Some of the recent developments in Spark are:
  • DataFrame API
  • ML Pipelines API & Streaming API in PySpark
  • Spark Streaming support for direct Kafka with exactly-once semantics
  • SparkR support
  • ORC support in Spark
The DataFrame API represents tabular data. Here is an example:
DataFrame API
The table represents employees in various departments. Both the Python RDD example & DataFrame example calculate the average age of employees in various departments. Data Frames are intuitive and much easier to write.
In my customer interactions the most common use cases for Spark are ETL, Data science and Machine learning and interactive SQL and Streaming are also often talked about.

DataScience with Spark

Before Spark, the data sitting in the Data Lake needed to be moved to an edge node for Machine Learning (ML) workloads. Many customers want to process data science and machine learning workloads in their YARN cluster to take advantage of all their data for ML. Spark on YARN enables this use case with it’s data parallel ML algorithms.
A data science workflow starts with asking the question. There are many steps in this non-linear workflow. The most common steps are finding the right data set, ETL workflow, feature engineering, model generation, model tuning, and reporting.
Data scientists often use notebook for writing code snippets, data exploration and visualization and for quickly framing the question and going for the answer.
Most data scientist are familiar with iPython Jupyter notebook and Apache Zeppelin is an up and coming notebook. We recently started a blog series on data science with Spark & Apache Zeppelin.

Spark in Azure

Microsoft and Hortonworks have deep engineering ties across much of Hadoop and HDP projects. We have worked closely together to support Apache Spark in HDInsight and are continuing this work to enable more capabilities.
Spark is a fast moving project with new features in each release. For some customers keeping up with this pace becomes a challenge. Spark for Azure HDInsight provides customers a quick means to take advantage of these new features, all with the ease of use and the guaranteed availability on Azure HDInsight.


We are deeply invested in Spark and continue to add new capabilities working with Microsoft and the community. Stay tuned for an upcoming blog where we will talk about adding ORC support to Spark.
We are excited to be working with Microsoft to help more customers realize the possibilities of big data with Apache Spark. For more information: