This and That

Thursday, August 4, 2016

WELCOME TO APACHE SPARK 2.0

Apache Spark 2.0 was released yesterday in the community. This is a long awaited release that delivers several key features. We are really excited about this release and sincerely thank the Apache Software Foundation and Apache Spark communities for making this release possible. The most notable improvements in this release are in the areas of API, Performance, Structured Streaming and SparkR. Let’s review some of these improvements:

API
The unification of DataFrame and DataSet is now complete. The DataFrame remains the primary interface in R and Python. Another improvement is the elimination of the need to deal with multiple contexts (SparkContext, SQLContext, HiveContext). The SparkSession context, represented by the variable ‘spark’, is the new entry point to all the awesome Spark features, and the other contexts have been deprecated.

Performance
Project Tungsten has completed another major phase and with new completely new stage code generation, significant performance improvements have been delivered. Parquet and ORC file processing have also delivered performance improvements.

Structured Streaming
The DataFrame is the preferred Spark abstraction since it delivers both ease of use through better abstraction and superior performance through the Catalyst optimizer. The new Structured Streaming API delivers streaming with the same DataFrame API that we love.

SparkR
With Spark 2.0, SparkR now delivers new algorithms like naive-bayes, k-means clustering and survival regression. The machine learning persistence is also improved and save and load are supported on all models.

There are many other significant improvements and a full list is available from Apache Spark.

TRY SPARK 2.0 NOW
At Hortonworks we have always delivered the latest Apache Spark shortly after it is released in Apache, and this time is no different. We are going to deliver Apache Spark 2.0 in the following ways:

As a technical preview with the upcoming HDP release shortly.
With the Hortonworks Cloud you can take out the Apache Spark 2.0 technical preview for a spin today.
We congratulate the Spark community on this major milestone and we continue to deeply participate in the Spark community to deliver enterprise-ready Apache Spark. The best is yet to come, stay tuned.

Wednesday, June 29, 2016

Apache Zeppelin: The Road Ahead

Author: Vinay Shukla, Hortonworks, Moon So Lee, Apache Zeppelin PMC & NFLabs, Prabhjyot Singh, Apache Zeppelin PMC & Hortonworks

Recently the Apache Software Foundation (ASF) announced Apache Zeppelin as a top level project.

This was a great milestone for both the Zeppelin and data science community. Since its incubation in ASF in December 2014, the community around Zeppelin has become larger, more diverse, inclusive and vibrant. As of last week there are now 126 contributors, 812 forks & 3 releases of the project.

Apache Zeppelin helps data analysts, data scientist & business users get a better understanding of data. With Apache Zeppelin users can quickly explore data, create visualizations and share their insights, as web pages, with various stakeholders.

Recent Improvements

Over the last year, there have been several key improvements to Apache Zeppelin that have been contributed by a diverse group of developers. Some of the highlights are:

Security Features-Authentication, Access Control, LDAP Support
Sharing Collaboration- Notebook import/export
Interpreters-Noteable R interpreter, and others too numerous to list

The pluggable nature of the Apache Zeppelin interpreter architecture has made it easy to add support for interpreters. Now there are over 30 interpreters supporting everything from Spark, Hive, MySql, and R to things like Geode and HBase.

The Road Ahead

Visualization

The Apache Zeppelin community has been working on Project Helium, which aims to seed growth in all kinds of visualizations. This follows the model created by pluggable interpreters. Helium aims to make adding a new visualization simple and intuitive. With pluggable visualization, adding support for Map based visualization would be easy and will be added to Zeppelin later this year.

Collaboration

One of the most requested features among Zeppelin users was full support for sharing and collaboration. Data scientists and business analysts often collaborate on their work. They should be able to read notebooks stored in a GIT server and be able to write their notebooks to GIT.

Multi-User Support

Multi-user support in Zeppelin was another highly requested feature. There are multiple facets of multi-user support: the most basic aspect is that a notebook should execute as an authenticated end-user. We have added this feature in Zeppelin. Another facet of multi-user support is user-specific dependencies management. We plan on adding this feature.

Zeppelin is closely tied to Apache Spark. The Spark community is close to releasing Spark 2.0. Zeppelin will very shortly start to support Spark 2.0.

Notebook Organization

Another commonly requested feature is notebook organization and the community is working to provide this feature.

Data Preparation

As Zeppelin’s adoption grows, so does its use in enterprises. Often the data scientist/data analyst workflow starts with importing some data sets. A significant portion of time is spent on improving the quality of this data set before the data set is used in analysis or machine learning.

We plan to make data set import easier and allow basic features to check and validate the data quality.

Community

It takes a community to create a compelling Apache project. We truly believe in ASF’s motto of community over code. Now developers and supporters from NFLabs, Twitter, Hortonworks, MapR, Pivotal, and IBM among many others are working together to deliver new features and fix issues in Apache Zeppelin. We are very thankful to this community and look forward to growing this community and to make Apache Zeppelin the best Notebook there is.

Thursday, May 19, 2016

Apache Zeppelin with HDP 2.4.2

In March, 2016 we delivered 2nd technical preview of Apache Zeppelin on HDP 2.4. Meanwhile we and the Zeppelin community continues to add new features to Zeppelin. We now give you the final technical preview of Zeppelin, based on snapshot of Apache Zeppelin 0.6.0. The main features in this Zeppelin technical preview are:

Ambari Managed Installation
Zeppelin Livy integration
Security

Execute jobs as authenticated user
Zeppelin authentication against LDAP
Notebook Authorization

Prerequisites:

HDP 2.4.2 is installed
The cluster contains Spark 1.6 or 1.5
Git is installed on the node running Ambari Server

You can install git as ‘ sudo yum install git’

Java 8 is installed on the node where Zeppelin is installed

This document provides instructions for :

Setting up Zeppelin on HDP 2.4.2 with Spark 1.6

Ambari Managed Install

Running Zeppelin against Spark on YARN with Livy interpreter
Authentication with Zeppelin
Authenticate users against LDAP
Access Control on Notebooks

Note, while both Ambari managed and Manual install instructions are provided, you only need to follow either one to get Zeppelin setup in your cluster.

HDP Cluster Requirement

This technical preview can be installed on any HDP 2.4.2 cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. The following instructions assume that Spark (version 1.6 ) is already installed on the HDP cluster.

Ambari Managed Zeppelin Install

Step 1: Download the Zeppelin Ambari Stack Definition

On the node running Ambari server, run the following

VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - $[0-9]\.[0-9]$.*/\1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN

Step 2: Re-start Ambari Server

sudo service ambari-server restart

Step 3: Add Zeppelin Service with Ambari

Make sure to install Zeppelin Service to a node where Spark Client’s are installed.

Once Ambari comes back up and the services turn green, you can click on 'Add Service' from the 'Actions' dropdown menu in the bottom left of the Ambari dashboard:

Make a note of the node selected to run Zeppelin service, call this ZEPPELIN_HOST

On bottom left -> Actions -> Add service -> check Zeppelin service -> Next -> Next -> Next -> Deploy.

Accept all the default values and hit deploy button.

Step 4: Launch Zeppelin

Once Zeppelin is deployed, launch http://ZEPPELIN_HOST:9995 in your browser.

Try out included Zeppelin tutorial. There are a few Zeppelin notebooks available at Hortonworks Zeppelin Gallery. Please try them out.

The rest of the steps described in the doc are optional and needed for additional functionality around security.

Optional: Enable Zeppelin for Security

This section shows configuration to allow Zeppelin to authenticate end-user. Zeppelin uses Livy to execute jobs with Spark on YARN as the end user.

These are the high level steps to enable Zeppelin Security:

Configure Zeppelin for Authentication
Install Livy Server and Configure Livy with Zeppelin
Optionally, enable access control on Zeppelin notebook.

Configure Zeppelin for Authentication

Note, when Zeppelin is authenticating end users, and Livy propagates the end-user identity to Hadoop, the end-user needs to exist on all nodes. In production you can leverage sssd or pam for this, but for now manually add user1 to all hosts in your cluster.

E.g to run as sample user “user1” run the below as OS root equivalent on all your worker nodes.

useradd user1 -g hadoop

As HDFS admin, create HDFS home for user1

su hdfs

hdfs dfs -mkdir /user/user1

hdfs dfs -chown user1 /user/user1

Note if you configure Zeppelin to run as another user, you need to add that user to the OS and create HDFS home directory for that user.

Edit Zeppelin’s shiro config

On the node where Zeppelin server is installed, edit /usr/hdp/current/zeppelin-server/lib/conf/shiro.ini and ensure the following in URL section

[urls]

/api/version = anon

#/** = anon

/** = authcBasic

You can use users defined in shiro for authentication. E.g enable the section to authenticate as user1/password2.
[users]

admin = password1

user1 = password2

user2 = password3

Alternatively, to use LDAP as identity store by configuring the section below for your ldap.

[main]

#ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm

#ldapRealm.userDnTemplate = cn={0},cn=engg,ou=testdomain,dc=testdomain,dc=com

#ldapRealm.contextFactory.url = ldap://ldaphost:389

#ldapRealm.contextFactory.authenticationMechanism = SIMPLE

Restart Zeppelin server

You can use Ambari to restart Zeppelin server. Ignore the error in Ambari during Zeppelin restart, Zeppelin starts fine.

Access Zeppelin-Tutorial and login as user1/password2 (or any user defined in your LDAP)

Note: Logout functionality is not available in this technical preview but is being added.

On the Zeppelin node, install Livy

sudo yum install livy

Configure Livy Server

Create /etc/livy/conf/livy-env.sh with the following values. Ensure the path to Java is accurate for that node.

export SPARK_HOME=/usr/hdp/current/spark-client

export JAVA_HOME=/usr/jdk64/jdk1.8.0_60

export PATH=/usr/jdk64/jdk1.8.0_60/bin:$PATH

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_SERVER_JAVA_OPTS="-Xmx2g"

Create /etc/livy/conf/livy-defaults.conf with the following content.

livy.impersonation.enabled = true

On the node where Livy is installed, create ‘livy’ user to run the Livy process as user livy.

useradd livy -g hadoop

Create livy’s logs directory and grant user ‘livy’ permissions to write to it.

mkdir /usr/hdp/current/livy-server/logs

chmod 777 logs

On Livy node, edit /etc/spark/conf/spark-defaults.conf to add the following

spark.master yarn-client

Step 6: Grant user livy the ability to proxy users in Hadoop’s core-site.xml

Use Ambari to add to the /etc/hadoop/conf/core-site.xml the following and restart HDFS with Ambari.

hadoop.proxyuser.livy.groups

hadoop.proxyuser.livy.hosts

Step 7: Start Livy server

Launch Livy server as user ‘livy’
cd /usr/hdp/current/livy-server

su livy

./bin/livy-server start

Step 8: Configure Zeppelin to use Livy

In Zeppelin, notebooks are run against the configured Interpreters. Go to your notebook and click on interpreter bindings.

Screen Shot 2016-05-12 at 5.05.52 PM.png

On the next page select the interpreters you want to use. Note the interpreter selection is done via clicking on a interpreter in a toggle manner. The unselected interpreter appears in white color. You can reorder the interpreter available to your notebook by drag and drop of interpreter.

E.g below screenshot shows Livy Spark interpreter is selected ahead of Spark and launch with %lspark

Screen Shot 2016-04-28 at 12.48.28 PM.png

Step 9: Confirm Livy Interpreter setting

Note the below Livy interpreter setting. If you have Livy installed on another node, replace localhost in the Livy url with the Livy host.

Screen Shot 2016-05-11 at 11.46.23 AM.png

If you make any changes to Livy interpreter setting, make sure to re-start Livy interpreter.

Step 9: Run Notebooks with Livy Interpreter.

Livy support, Spark, SparkSQL, PySpark & SparkR. To run notes with Livy, make sure to use the corresponding magic string at the top of your note.

E.g %lspark for Scala code to run via Livy or %lspark.sql to run against SparkSQL via Livy.
To use SQLContext with Livy, make sure to not create any SQLContext explicitly since we create it by default. I.e. remove the following lines from your SparkSQL note.

//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//import sqlContext.implicits._

Configure Zeppelin to authorize end-users to Zeppelin notebooks.

Now Zeppelin provide access control on each notebook. Click the lock icon on the notebook to configure access to that notebook.

On the next popup add users who should have access to the policy. Refer to below screenshot

Screen Shot 2016-05-11 at 12.04.33 PM.png

Note with identity propagation enabled with Livy, the data access to controlled by the data source being accessed. E.g when you access HDFS as user1, the data access is controlled by HDFS permissions.

Import External Libraries

Often in the notebook you will want to use one or more libraries. For example, to run Magellan – you need to import its dependencies. To create a notebook to explore Magellan, you will need to include the Magellan library in your environment.

There are three ways in Zeppelin to include an external dependency.

Using the %dep interpreter. Note: this will only work for libraries that are published to Maven.
%dep
z.load("group:artifact:version")
%spark
import ...
Here is an example to import dependency for Magellan
%dep
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.esri.geometry:esri-geometry-api:1.2.1")
z.load("harsha2010:magellan:1.0.3-s_2.10")
For more information, see https://zeppelin.incubator.apache.org/docs/interpreter/spark.html#dependencyloading.
When you have a jar on the node where Zeppelin is running, the following approach can be useful:Add spark.files property at SPARK_HOME/conf/spark-defaults.conf; for example:
spark.files /path/to/my.jar
When you have a jar on the node where Zeppelin is running, this approach can also be useful:
Add SPARK_SUBMIT_OPTIONS env variable to the ZEPPELIN_HOME/conf/zeppelin-env.sh file; for example:
export SPARK_SUBMIT_OPTIONS="--packages group:artifact:version"

Stopping Zeppelin or Livy Server

To stop the Zeppelin server, use Ambari. To stop Livy

su livy; cd /usr/hdp/current/livy-server; ./bin/livy-server stop