Spark Shell is an interactive shell through which we can access Spark’s API. Spark provides the shell in two programming languages : Scala and Python.

In this tutorial, we shall learn the usage of Scala Spark Shell with a basic word count example.

Prerequisites

It is assumed that you already installed Apache Spark on your local machine. If not, please refer Install Spark on Ubuntu or Install Spark on MacOS.

Hands on Scala Spark Shell

Start Spark interactive Scala Shell

To start Scala Spark shell open a Terminal and run the following command.

$ spark-shell

For the word-count example, we shall start with option --master local[4] meaning the spark context of this spark shell acts as a master on local node with 4 threads.

$ spark-shell --master local[4]

If you accidentally started spark shell without options, kill the shell instance.

~$spark-shell --master "local[4]"
Using Sparks default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/12 13:07:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/12 13:07:31 WARN Utils: Your hostname, tutorialkart resolves to a loopback address: 127.0.0.1; using 192.168.0.104 instead (on interface wlp7s0)
17/11/12 13:07:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/11/12 13:07:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.104:4040
Spark context available as 'sc' (master = local[4], app id = local-1510472252847).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

From the above Shell startup, following points could be made

Spark context Web UI

Spark context Web UI is available at http://192.168.0.104:4040 . Open a browser and hit the url.

Spark context available as sc, meaning you may access the spark context in the shell as variable named ‘sc’.

Spark session available as spark, meaning you may access the spark session in the shell as variable named ‘spark’.

Word-Count Example with Spark (Scala) Shell

Following are the three commands that we shall use for Word Count Example in Spark Shell :

/** map */
var map = sc.textFile("/path/to/text/file").flatMap(line => line.split(" ")).map(word => (word,1));

/** reduce */
var counts = map.reduceByKey(_ + _);

/** save the output to file */
counts.saveAsTextFile("/path/to/output/")

Map

In this step, using Spark context variable, sc, we read a text file.

sc.textFile("/path/to/text/file")

then we split each line using space " " as separator.

flatMap(line => line.split(" "))

and we map each word to a tuple (word, 1), 1 being the number of occurrences of word.

map(word => (word,1))

We use the tuple (word,1) as (key, value) in reduce stage.

Reduce

We reduce all the words based on Key

var counts = map.reduceByKey(_ + _);

Save counts to local file

The counts could be saved to local file.

counts.saveAsTextFile("/path/to/output/")

When you run all the commands in a Terminal, Spark Shell looks like:

scala> var map = sc.textFile("/home/arjun/data.txt").flatMap(line => line.split(" ")).map(word => (word,1));
map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:24

scala> var counts = map.reduceByKey(_ + _);
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:26

scala> counts.saveAsTextFile("/home/arjun/output/");

scala>

You can verify the output of word count.

$ ls
part-00000  part-00001  _SUCCESS

Sample of the contents of output file, part-00000, is shown below :

/home/arjun/output$cat part-00000
(branches,1)
(sent,1)
(mining,1)
(tasks,4)

We have successfully counted unique words in a file with Word Count example run on Scala Spark Shell.

You may use Spark Context Web UI to check the details of the Job (Word Count) that we have just run.

Scala Spark Shell - Web UI

Navigate through other tabs to get an idea of Spark Web UI and the details about the Word Count Job.

Spark Shell Suggestions

Suggestions

Spark Shell can provide suggestions. Type part of the command and click on ‘Tab’ key for suggestions.

scala> counts.sa
sample        sampleByKeyExact      saveAsHadoopFile            saveAsNewAPIHadoopFile   saveAsSequenceFile   
sampleByKey   saveAsHadoopDataset   saveAsNewAPIHadoopDataset   saveAsObjectFile         saveAsTextFile

Kill the Spark Shell Instance

To kill the spark shell instance, hit Control+Z on the current shell and kill the spark instance using process id, pid, and with the help of kill command.

Find pid :

~$ ps -aef|grep spark
arjun     8895  8113  0 13:01 pts/16   00:00:00 bash /usr/lib/spark/bin/spark-shell
arjun     8906  8895 91 13:01 pts/16   00:01:13 /usr/lib/jvm/default-java/jre/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell
arjun     9106  8113  0 13:03 pts/16   00:00:00 grep --color=auto spark

In this case, 8906 is the pid.

Kill the instance using pid :

~$ kill -9 8906

Conclusion

In this Apache Spark TutorialScala Spark Shell, we have learnt the usage of Spark Shell using Scala programming language with the help of Word Count Example.