Wednesday, June 29, 2016

Apache Zeppelin: The Road Ahead

Author: Vinay Shukla, Hortonworks, Moon So Lee, Apache Zeppelin PMC & NFLabs, Prabhjyot Singh, Apache Zeppelin PMC & Hortonworks


Recently the Apache Software Foundation (ASF) announced Apache Zeppelin as a top level project.


This was a great milestone for both the Zeppelin and data science community. Since its incubation in ASF in December 2014, the community around Zeppelin has become larger, more diverse, inclusive and vibrant. As of last week there are now 126 contributors, 812 forks & 3 releases of the project.


Apache Zeppelin helps data analysts, data scientist & business users get a better understanding of data. With Apache Zeppelin users can quickly explore data, create visualizations and share their insights, as web pages, with various stakeholders.

Recent Improvements

Over the last year, there have been several key improvements to Apache Zeppelin that have been contributed by a diverse group of developers. Some of the highlights are:
  • Security Features-Authentication, Access Control, LDAP Support
  • Sharing Collaboration- Notebook import/export
  • Interpreters-Noteable R interpreter, and others too numerous to list


The pluggable nature of the Apache Zeppelin interpreter architecture has made it easy to add support for interpreters. Now there are over 30 interpreters supporting everything from Spark, Hive, MySql, and R to things like Geode and HBase.

The Road Ahead

Visualization
The Apache Zeppelin community has been working on Project Helium, which aims to seed growth in all kinds of visualizations. This follows the model created by pluggable interpreters. Helium aims to make adding a new visualization simple and intuitive. With pluggable visualization, adding support for Map based visualization would be easy and will be added to Zeppelin later this year.
Collaboration
One of the most requested features among Zeppelin users was full support for sharing and collaboration. Data scientists and business analysts often collaborate on their work. They should be able to read notebooks stored in a GIT server and be able to write their notebooks to GIT.
Multi-User Support
Multi-user support in Zeppelin was another highly requested feature. There are multiple facets of multi-user support: the most basic aspect is that a notebook should execute as an authenticated end-user. We have added this feature in Zeppelin. Another facet of multi-user support is user-specific dependencies management. We plan on adding this feature.
Zeppelin is closely tied to Apache Spark. The Spark community is close to releasing Spark 2.0. Zeppelin will very shortly start to support Spark 2.0.
Notebook Organization
Another commonly requested feature is notebook organization and the community is working to provide this feature.
Data Preparation
As Zeppelin’s adoption grows, so does its use in enterprises. Often the data scientist/data analyst workflow starts with importing some data sets. A significant portion of time is spent on improving the quality of this data set before the data set is used in analysis or machine learning.
We plan to make data set import easier and allow basic features to check and validate the data quality.

Community


It takes a community to create a compelling Apache project. We truly believe in ASF’s motto of community over code. Now developers and supporters from NFLabs, Twitter, Hortonworks, MapR, Pivotal, and IBM among many others are working together to deliver new features and fix issues in Apache Zeppelin. We are very thankful to this community and look forward to growing this community and to make Apache Zeppelin the best Notebook there is.