Apache Spark Webinar Recap: Q&A Session
By Vinay Shukla
October 23rd, 2014
We recently hosted a Spark webinar as part of the YARN Ready series, aimed at a technical audience including developers of applications for Apache Hadoop and Apache Hadoop YARN. During the event, a number of good questions surfaced that we wanted to share with our broader audience in this blog. Take a look at the video and slides along with these questions and answers below.
You can listen to the entire webinar recording here.
And peruse the slides from the webinar here.
What is the primary benefit of YARN?
How does Spark fit in?
YARN enables a modern data architecture whereby users have the ability to store data in a single location and interact with it in multiple ways – by using the right data processing engine at the right time.
Deeper integration of Spark with YARN allows it to become a more efficient tenant along side other engines, such as Hive, Storm, HBase and others, simultaneously, all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and more effectively share resources within a single cluster.
What is RDD?
RDD is a core Spark concept for describing datasets at runtime. More details can be found here.
Do all the Spark specific libraries reside in their own HDFS folder independent from the main Hadoop bin directory?
Spark provides all its dependencies in an assembled jar used in run-time. Thus, it is a client side jar and doesn’t need any jars in the Hadoop bin.
Why you need reducers’ variable? You mentioned that Spark can automatically determine the best number of reducers.
Spark cannot automatically adjust reducers. However, Tez can automatically adjust reducers if auto-parallelism is enabled.
Is there a Hortonworks Sandbox where YARN and Spark are already installed for download?
How about load balance during the batch process?
In Spark, users can repartition the RDD to rebalance the workload, but it has to be done by users. Tez, on the other hand, can do it automatically when auto-parallelism is enabled.
What is the best way to store relational data in Spark inside HDFS?
It depends on each application. Some candidates may include a columnar format, e.g., ORCFormat, which retrieves required columns more efficiently.
If we want to use Spark but we have a separate cluster for it, do we need YARN, or can we just use Spark in standalone fashion? Is YARN only needed if we co-locate Hadoop and Spark on the same nodes?
YARN provides a multi-tenant environment, allowing it to share the same datasets with other data engines. Even though Spark can be run in standalone mode, it will require a dedicated cluster, depriving the users from the Hadoop features.
Can we use Hive analytical functions & UDFs in Spark API?
Spark does support Hive UDFs, but this support may not be comprehensive.
How do I set up a Spark cluster today in HDP?
Spark-on-YARN is a client-only install; you can follow the instructions from our tech preview here.
How do you handle single node Spark/YARN failures? Do we need to handle this in our code or are restarts automatic?
Tez manages this automatically without any action required on the user’s part.
How does Spark as a SQL server (for 3rd party reporting tools to connect) work in Hadoop cluster?
Spark 1.1.0 has Thrift server built in, which will work with beeline.
Do you support Spark Streaming? How do you compare Spark vs. Storm?
Spark Streaming is emerging and has only limited experience in production clusters. Many customers are using Storm and Kafka for stream processing.
See http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming for a more detailed comparison.
What about Tez vs. Spark? Is Spark faster?
Spark is many things to many people: a developer facing API, a DAG execution engine and a cluster management system (in Spark Standalone.). Spark also comes with a library for ML, Streaming etc.
By contrast, Tez is a low level API not focused for application developers. Tez is an API for tool developers, whereas the Spark API allows developers to create both iterative in-memory and batch applications on Apache Hadoop YARN.
Currently, however, Spark under utilizes YARN resources particularly when large datasets are involved.
Spark does not behave like MapReduce or Tez—executing periodic jobs and releasing the compute resources once those jobs finish. In some ways, it behaves much more like a long running service, holding onto resources (such as memory) until the end of the entire workload. Using the experience we have already gained in building MapReduce, Apache Tez and other data processing engines, we believe similar concepts can be applied to Spark in order to optimize its resource utilization and be a good multi-tenant citizen within a YARN-based Hadoop cluster.
What are your opinions on Spark/YARN vs. Spark/Mesos?
Hortonworks is a Hadoop distribution, and our focus within the Spark community is to enable and leverage YARN for resource management so that our customers can use Spark along side other engines in a Hadoop cluster.
Are there advantages to run Spark on top of YARN as opposed to Mesos?
We have a solid understanding of how Spark on YARN is beneficial. Deeper integration of Spark with YARN allows it to become a more efficient tenant along side other engines, such as Hive, Storm, HBase and others, simultaneously, all on a single data platform.
This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited. It also shares resources effectively within a single cluster. Our customers want to run multiple workloads on a single set of data all in a single cluster. Hadoop already has a very wide array of engines available.
Is Spark used only through Java and Python programming by calling APIs or can we use it with Hive or Pig etc?
Besides Scala, Java and Python, Spark SQL with JDBC Server interface can be used.
Is there JDBC interface for Spark SQL?
With Spark 1.1.0 JDBC thrift service is included. However HDP Tech preview of Spark 1.1 does not support it yet.
Which IDE is used for Spark development? Does Hortonworks provide one?
Eclipse, Intellj or others. It depends on developers’ preference. Hortonworks doesn’t provide any IDE.
Does Hortonworks supplement or compete with Hadoop?
We do Hadoop. We are Hadoop. We are the ONLY 100% open source Hadoop distribution. Our distribution is comprised wholly of Apache Hadoop and its related Apache projects.