Spark has become popular among data scientists and big data enthusiasts. If you are looking for the best collection of Apache Spark Interview Questions for your data analyst, big data or machine learning job, you have come to the right place.
In this Spark Tutorial, we shall go through some of the frequently asked Spark Interview Questions.
- Entry Level Spark Interview Questions
- Medium Level Spark Interview Questions
- Advanced Spark Interview Questions
Q1 - What is Apache Spark?
Apache Spark is an Open Source Project from the Apache Software Foundation. Apache Spark is a data processing engine and is being used in data processing and data analytics. It has inbuilt libraries for Machine Learning, Graph Processing, and SQL Querying. Spark is horizontally scalable and is very efficient in terms of speed when compared to big data giant Hadoop.
Q2 - When should you choose Apache Spark?
When the application needs to scale. When the application needs both batch and real-time processing of records. When the application needs to connect to multiple databases like Apache Cassandra, Apache Mahout, Apache HBase, SQL databases, etc. When the application should be able to query structured datasets cumulatively present across different database platforms.
Q3 - Which builtin libraries does Spark have?
Spark has four builtin libraries. They are :
- SQL and DataFrames
- Spark Streaming
- MLlib (Machine Learning Library)
Q4 - How fast is Apache Spark when compared to Hadoop ? Give us an example.
Apache Spark is about 100 times faster than Apache Hadoop. For a typical Logistic Regression problem, if Hadoop takes 110ms to complete, then Spark would take around 1ms.
Q5 - Why is Spark faster than Hadoop?
Spark is so fast because it uses a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
Q6 - Which programming languages could be used for Spark Application Development?
One can use following programming languages.
Q7 - On which platforms can Spark run?
When Spark is run in stand-alone cluster mode, it can run on :
- Hadoop YARN
Q8 - Which data sources can Spark access?
Spark can access data from hundreds of sources. Some of them are :
- Apache Cassandra
- Apache HBase
- Apache Hive
Q9 - Can structured data be queried in Spark? If so, how?
Yes, structured data can be queried using :
- DataFrame API
Q10 - What is Spark MLlib?
MLlib is Spark’s scalable Machine Learning inbuilt library. The library contains many machine learning algorithms and utilities to transform the data and extract useful information or inference from the data.
Q11 - How is MLlib scalable?
Spark’s DataFrames API realizes scalability.
Q12 - What kinds of machine learning use cases does MLlib solve?
MLlib contains common learning algorithms that can solve problems like :
- Topic Modelling
- Frequent itemsets
- Association rules
- Sequential pattern mining
Q13 - What is Spark Context?
SparkContext instance sets up internal services for a Spark Application and also establishes a connection to a Spark execution environment. Spark Context should be created by Spark Driver Application.
Q14 - What is an RDD?
RDD, short for Resilient Distributed Datasets, is a collection of elements. RDD is fault tolerant and can be operated on in parallel. RDD provides the abstraction for distributed computing across nodes in Spark Cluster.
Q15 - What are RDD transformations?
Transformations are operations which create a new RDD from an existing RDD.
Some of the RDD transformations are :
Q16 - What are RDD actions?
Actions are operations which consume or run operations on an RDD and produce an output value.
Some of the RDD actions are :
Q17 - What do you mean by ‘RDD Transformations are lazy’?
Transformations on an RDD are not actually executed until an action is encountered. Hence, RDD Transformations are called lazy.
Q18 - What do you know about RDD Persistence?
Persistence means Caching. When an RDD is persisted, each node in the cluster stores the partitions (of the RDD) in memory (RAM). When there are multiple transformations or actions on an RDD, persistence helps to cut down the latency by the time required to load the data from file storage to memory.
Q19 - What is the difference between cache() and persist() for an RDD?
cache() uses default storage level, i.e., MEMORY_ONLY.
persist() can be provided with any of the possible storage levels.
Q20 - What do you mean by the default storage level: MEMORY_ONLY?
Default Storage Level – MEMORY_ONLY mean store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed.
Q21 - What is Sliding Window Operation?
Q22 - How does Spark Context in Spark Application pick the value for Spark Master?
That can be done in two ways.
1. Create a new SparkConf object and set the master using its setMaster() method. This Spark Configuration object is passed as an argument while creating the new Spark Context.
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample") .setMaster("local") .set("spark.executor.memory","3g") .set("spark.driver.memory", "3g"); JavaSparkContext jsc = new JavaSparkContext(conf);
2. <apache-installation-directory>/conf/spark-env.sh file, located locally on the machine, contains information regarding Spark Environment configuration. Spark Master is one the parameters that could be provided in the configuration file.
Q23 - How do you configure Spark Application?
Spark Application could be configured using properties that could be set directly on a SparkConf object that is passed during SparkContext initialization.
Following are the properties that could be configured for a Spark Application.
- Spark Application Name
- Number of Spark Driver Cores
- Spark Driver’s Maximum Result Size
- Spark Driver’s Memory
- Spark Executors’ Memory
- Spark Extra Listeners
- Spark Local Directory
- Log Spark Configuration
- Spark Master
- Deploy Mode of Spark Driver
- Log Application Information
- Spark Driver Supervise Action
Reference : Configure Spark Application
Q24 - What is the use of Spark Environment Parameters? How do you configure those?
Spark Environment Parameters affect the behavior, working and memory usage of nodes in a cluster.
These parameters could be configured using the local config file spark-env.sh located at <apache-installation-directory>/conf/spark-env.sh.
Reference: Configure Spark Ecosystem
Q25 - How do you establish a connection from Apache Mesos to Apache Spark?
A connection to Mesos could be established in two ways.
- Spark Driver program has to be configured to establish a connection to Mesos. Also, Spark binaries location should be accessible to the Apache Mesos program.
- The other way to connect to Mesos is that installing Spark in the same location as that of Mesos, and configure the property,
spark.mesos.executor.hometo point to the location of Mesos installation.
Q26 - How do you minimize data transfers between nodes in a cluster?
Minimizing shuffles can minimize the data transfers between nodes, and thus making the process fast. There are some practices that can reduce the usage of shuffling operations. Use broadcast variables to join small and large RDDs; and accumulators to update the variable’s values.
Q27 - To run Spark Applications, should we install Spark on all the nodes of a YARN cluster?
Spark programs can be executed on top of YARN. So, there is no need to install Spark on the nodes a YARN cluster, to run spark applications.
Q28 - To run Spark Applications, should we install Spark on all the nodes of a Mesos cluster?
Spark programs can be executed on top of Mesos. So, there is no need to install Spark on the nodes of a Mesos cluster, to run spark applications.
Q29 - What is the usage of GraphX module in Spark?
GraphX is a graph processing library. It can be used to build and transform interactive graphs. Many algorithms are available with GraphX library. PageRank is one of them.
Q30 - How does Spark handle distributed processing?
Spark provides an abstraction to the distributed processing through Spark RDD API. A general user does not need to worry about how data is processed in a distributed cluster. There are some exceptions though. When you optimize an application for performance, you should understand about operations and actions which require the data transfer between nodes.