Category: Apache Spark

How to Setup an Apache Spark Cluster

Apache Spark Tutorial – We shall learn to setup an Apache Spark Cluster with a master node and multiple slave(worker) nodes. You can setup a computer running Windows/Linux/MacOS as a master or slave.

Setup an Apache Spark Cluster

To Setup an Apache Spark Cluster, we need to know to setup master node and to setup worker node.

Setup Master Node

Following is a step by step guide to setup Master node for an Apache Spark cluster. Execute the following steps on the node, which you want to be a Master.

  1. Navigate to Spark Configuration Directory

    Go to “SPARK_HOME/conf/” directory.

    SPARK_HOME is the complete path to root directory of Apache Spark in your computer.

  2. Edit the file spark-env.sh – Set SPARK_MASTER_HOST

    Note : If spark-env.sh is not present, spark-env.sh.template would be present. Make a copy of spark-env.sh.template with name “spark-env.sh” and add/edit the field “SPARK_MASTER_HOST“. Part of the file with SPARK_MASTER_HOST addition is shown below:

    Replace the ip with the ip address assigned to your computer (which you would like to make as a master).

  3. Start spark as master

    Goto SPARK_HOME/sbin and execute the following command.

    $ ./start-master.sh
  4. Verify the log file.

    You would see the following in the log file specifying ip address of the master node, the port on which spark has been started, port number on which WEB UI has been started, etc.

Setting up Master Node is complete.

 

Setup Slave(Worker) Node

Following is a step by step guide to setup Slave(Worker) node for an Apache Spark cluster. Execute the following steps on all of the nodes, which you want to be as worker nodes.

  1. Navigate to Spark Configuration Directory

    Go to “SPARK_HOME/conf/” directory.

    SPARK_HOME is the complete path to root directory of Apache Spark in your computer.

  2. Edit the file spark-env.sh – Set SPARK_MASTER_HOST

    Note : If spark-env.sh is not present, spark-env.sh.template would be present. Make a copy of spark-env.sh.template with name “spark-env.sh” and add/edit the field “SPARK_MASTER_HOST“. Part of the file with SPARK_MASTER_HOST addition is shown below:

    Replace the ip with the ip address assigned to your master (that you used in setting up master node).

  3. Start spark as slave

    Goto SPARK_HOME/sbin and execute the following command.

    $ ./start-slave.sh spark://<your.master.ip.address>:7077
  4. Verify the log

    You would find in the log that this Worker node has been successfully registered with master running at spark://192.168.0.102:7077 on the network.

The setup of Worker node is successful.

To add more worker nodes to the Apache Spark cluster, you may just repeat the process of worker setup on other nodes as well.

Once you have added some slaves to the cluster, you can view the workers connected to the master via Master WEB UI.

Hit the url http://<your.master.ip.address>:<web-ui-port-number>/ (example is http://192.168.0.102:8081/) in browser. Following would be the output with slaves connected listed under Workers.

Master WEB UI - setup an Apache Spark Cluster - Apache Spark Tutorial - www.tutorialkart.com

Master WEB UI – Setup an Apache Spark Cluster

 

Conclusion :

In this Apache Spark Tutorial, we have successfully setup an Apache Spark cluster.

What are the cluster managers supported in Apache Spark

The agenda of this tutorial is to understand what a cluster manager is, and its role, and the cluster managers supported in Apache Spark.

What is a cluster ?

A cluster is a set of tightly or loosely coupled computers connected through LAN (Local Area Network). The computers in the cluster are usually called nodes. Each node in the cluster can have a separate hardware and Operating System or can share the same among them. Resource (Node) management and task execution in the nodes is controlled by a software called Cluster Manager.

What does a cluster manager do in Apache Spark cluster ?

The spark application contains a main program (main method in Java spark application), which is called driver program. Driver program contains an object of SparkContext. SparkContext could be configured with information like executors’ memory, number of executors, etc. Cluster Manager keeps track of the available resources (nodes) available in the cluster. When SparkContext object is created, it connects to the cluster manager to negotiate for executors. From the available nodes, cluster manager allocates some or all of the executors to the SparkContext based on the demand. Also, please note that multiple spark applications could be run on a single cluster. However the procedure is same, SparkContext of each spark application requests cluster manager for executors. In a nutshell, cluster manager allocates executors on nodes, for a spark application to run.

Cluster managers supported in Apache Spark - Apache Spark Tutorial -

Role of Cluster Manager in Apache Spark

Cluster managers supported in Apache Spark

Following are the cluster managers available in Apache Spark :

Spark Standalone Cluster Manager

– Standalone cluster manager is a simple cluster manager that comes included with the Spark.

Apache Mesos

– Apache Mesos is a general cluster manager that can also run Hadoop MapReduce and service applications.

Hadoop YARN

– Hadoop YARN is the resource manager in Hadoop 2.

In this Apache Spark Tutorial, we have learnt about the cluster managers available in Spark and how a spark application could be launched using these cluster managers.

How to install latest Apache Spark on Ubuntu 16

In this Apache Spark Tutorial, we shall learn to install latest Apache Spark on Ubuntu 16.

Install dependencies first

Install Java

Open a terminal and run the following command to install Java :

sparkuser@tutorialkart:~$ sudo apt-get install default-jdk

Install latest Apache Spark on Ubuntu 16

Download Spark

Download latest Apache Spark release from http://spark.apache.org/downloads.html

Download and install latest Apache Spark on Ubuntu 16 - Apache Spark Tutorial - www.tutorialkart.com

Download Latest Apache Spark

Unzip and move spark to /usr/lib/

Open a terminal.

Unzip the downloaded .tgz file and move the folder to /usr/lib/ using the following commands :

sparkuser@tutorialkart:~$ tar xzvf spark-2.2.0-bin-hadoop2.7.tgz
sparkuser@tutorialkart:~$ mv spark-2.2.0-bin-hadoop2.7/ spark
sparkuser@tutorialkart:~$ sudo mv spark/ /usr/lib/

Add Path

Open ~/.bashrc with any of the editor and add Path to Java and Spark in ~/.bashrc . We shall use nano editor here :

sparkuser@tutorialkart:~$ sudo nano ~/.bashrc

Add following lines at the end of the file

export JAVA_HOME=/usr/lib/jvm/default-java/jre
export SPARK_HOME=/usr/lib/spark/bin
export PATH=$PATH:SPARK_HOME

Latest Apache Spark is successfully installed in your Ubuntu 16.

Verify installation

To verify the installation, close the Terminal already opened, and Open Terminal again. Run the following command :

sparkuser@tutorialkart:~$ spark-shell
Also verify the versions of Spark, Java and Scala displayed during the start of spark-shell.

:quit exits your from scala script of spark-shell.

How to load data from JSON file and execute SQL query in Spark SQL

Load data from JSON data source and execute Spark SQL query

Apache Spark Dataset and DataFrame APIs provides an abstraction to the Spark SQL from data sources. Dataset provides the goodies of RDDs along with the optimization benefits of Spark SQL’s execution engine.

Dataset loads JSON data source as a distributed collection of data. DataFrame is Dataset with data arranged into named columns. The architecture containing JSON data source, Dataset, Dataframe and Spark SQL is shown below :

Load data from JSON file and execute SQL query in Apache Spark - Apache Spark Tutorial - www.tutorialkart.com

JSON -> Dataset -> DataFrame -> Spark SQL -> SQL Query

Load data from JSON file and execute SQL query

Following is a step-by-step process to load data from JSON file and execute SQL query on the loaded data from JSON file :

  1. Create a Spark Session

    Provide application name and set master to local with two threads.

  2. Read JSON data source

    SparkSession.read().json(String path) can accept either a single text file or a directory storing text files, and load the data to Dataset.

  3. Create a temporary view using the DataFrame

  4. Run SQL query

    Temporary view could be considered as a table and attributes under schema root as columns

    Table : people
    Columns : name, salary

  5. Stop spark session

Complete java program to load data from JSON file and execute SQL query in given below:

 

Conclusion :

In this Apache Spark Tutorial, we have learnt to load a json file into Dataset and access the data using SQL queries through Spark SQL.

 

 

Apache Spark SQL Library – Features, Architecture, Examples

Apache Spark Tutorial

What is Spark SQL ?

Spark SQL is one of the four libraries of Apache Spark which provides Spark the ability to access structured/semi-structured data and optimize operations on the data through Spark SQL libraries.

Features

When Spark adopted SQL as a library, there is always something to expect in the store and here are the features that Spark provides through its SQL library.

  1. Relational Processing

    Spark with its addition of SQL, added relational processing ability to Spark’s existing functional programming.

  2. Structured/Semi-structured data analysis

  3. Supporting existing Data Formats

    There are various data formats evolving in recent time. Also the industry is always embracing the new data formats resulting in piles of data in these data formats. And in this Big Data ecosystem, for a new tool or library, it is always important that it provides compatibility or connections to those existing popular data formats. Spark provides support to data formats like Parquet, JSON, Apache HIVE, Cassandra, etc.

  4. Data Transformations

    Spark’s RDD API provides best in class performance for the transformations. And Spark exploits this feature with SQL queries convertible to RDDs for transformations.

  5. Performance

    Spark has the niche of performance over Hadoop. Spark SQL delivers much better performance over Hadoop with increased iterations over datasets because of inmemory processing.

  6. Standard JDBC/ODBC Connectivity

    Spark SQL libraries provide an interface to connect to Spark SQL through standard JDBC/ODBC connections and perform queries(table operations) on the structured data.

  7. User Defined Functions

    Spark lets you define your own column-based functions for the transformations to extend the Spark functions.

Get Hands on with Examples

  1. Querying using Spark SQL
  2. Spark SQL with JSON
  3. Hive Tables with Spark SQL

Wind Up

In this Apache Spark Tutorial, we have learnt about Spark SQL, its features/capabilities, architecture, libraries.

How to install latest Apache Spark on Mac OS

Install Latest Apache Spark on Mac OS

Following is a detailed step by step process to install latest Apache Spark on Mac OS. We shall first install the dependencies : Java and Scala. To install these programming languages and framework, we take help of Homebrew and xcode-select.

Install Spark on Mac OS - Apache Spark Tutorial - www.tutorialkart.com

Install Spark on Mac OS

  • Step 1 : Install Homebrew

    Open Terminal.
    Run the following command in Terminal :

    /usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”

    Enter the password if asked and continue.

  • Step 2 : Install xcode-select

    To install Java, Scala and Apache Spark through command line interface in Terminal, we shall install xcode-select. Enter and run the following command in Terminal :

    xcode-select –install
  • Step 2 : Install Java

    To install Java through command line, enter and run the following command in the Terminal :

    brew cask install java
  • Step 3 : Install Scala

    To install Scala through command line, enter and run the following command in Terminal :

    brew install scala
  • Step 4 : Install Spark

    To install Apache Spark through command line, enter and run the following command in the Terminal :

    brew install apache-spark
  • Step 5 : Verifying installation

    To verify if the installation is successful, run the spark using the following command in Terminal :

    spark-shell

We have successfully installed Apache Spark on Mac OS.

The installation directory would be /usr/local/Cellar/apache-spark/.

Conclusion :

In this Apache Spark Tutorial, we have learnt to install latest Apache Spark on Mac OS .

Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib

What is Topic Modelling ?

Topic Modelling is a natural language processing task of identifying the probable topic that is being represented by the text in the document.

We come across articles or documents containing text that usually belong to a topic. For example, consider some news articles or research papers or internet pages. Each of these describe or explain about a topic. Infact one starts writing text regarding to a topic.

The very example is right here. In this tutorial, we are discussing about Topic Modelling. So, our topic is “Topic Modelling”. You might come across the following words more frequently than others :

  • document
  • natural language processing
  • task
  • topic
  • model
  • probability

As another example, if a document belongs to a topic, “forest”, it might contain frequent words like trees, animals, types of forest, forest, life cycle, ecosystem, etc.

To capture these kind of information into a mathematical model, Apache Spark MLlib provides Topic modelling using Latent Dirichlet Condition.

Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib

Now, we shall learn the process of generating the Topic Model and using the same for prediction, in a step by step process.

  • Step 1 : Start Spark Context

    Configure the ecosystem to run on local and Start Spark Context.

  • Step 2 : Load Data into Spark RDD

    Load and Parse the sample data from data/mllib/sample_lda_data.txt (we are using the sample data provided in Apache Spark MLlib Examples github). Each line in the file represents a document. So, index each document with an unique id.

  • Step 3 : Run LDA Topic Modeller

    Set the number of topics, and run the LDA Topic Modeller against the data corpus.

  • Step 4 : Output Topics Distribution over vocabulary

    Once the model is generated, we may print the topics’ distribution over vocabulary

  • Step 5 : Model Persistence

    Save the model generated to predict topic for further documents.

  • Step 6 : Stop Spark Context

Example program :

 

 

 

Spark MLlib Tutorial – Scalable Machine Learning Library

Apache Spark MLlib Tutorial – Learn about Spark’s Scalable Machine Learning Library

MLlib is one of the four Apache Spark‘s libraries. It is a scalable Machine Learning Library.

Programming

MLlib could be developed using Java (Spark’s APIs).

With latest Spark releases, MLlib is inter-operable with Python’s Numpy libraries and R libraries.

Data Source

Using MLlib, one can access HDFS(Hadoop Data File System) and HBase, in addition to local files. This enables MLlib to be easily plugged into Hadoop workflows.

Performance

Spark’s framework excels at iterative computation. This enables the iterative parts of MLlib algorithms to run fast. And also MLlib contains high quality algorithms for Classification, Regression, Recommendation, Clustering, Topic Modelling, etc.

Apache Spark MLlib Tutorial

Apache Spark MLlib Tutorial

Following are some of the examples to MLlib algorithms, with step by step understanding of ML Pipeline construction and model building :

  1. Classification using Logistic Regression
  2. Classification using Naive Bayes
  3. Generalized Regression
  4. Survival Regression
  5. Decision Trees
  6. Random Forests
  7. Gradient Boosted Trees
  8. Recommendation using Alternating Least Squares (ALS)
  9. Clustering using KMeans
  10. Clustering using Gaussian Mixtures
  11. Topic Modelling using Latent Dirichlet Conditions
  12. Frequent Itemsets
  13. Association Rules
  14. Sequential Pattern Mining

MLlib Utilities

MLlib provides following workflow utilities :

  1. Feature Transformation
  2. ML Pipeline construction
  3. Model Evaluation
  4. Hyper-parameter tuning
  5. Saving and loading of models and pipelines
  6. Distributed Linear Algebra
  7. Statistics

 

Conclusion :

In this Apache Spark Tutorial – Spark MLlib Tutorial, we have learnt about different machine learning algorithms available in Spark MLlib and different utilities MLlib provides.