Spark Shell is an interactive shell through which we can access Spark’s API. Spark provides the shell in two programming languages : Scala and Python.
In this tutorial, we shall learn the usage of Scala Spark Shell with a basic word count example.
Hands on Scala Spark Shell
Start Spark interactive Scala Shell
To start Scala Spark shell open a Terminal and run the following command.
For the word-count example, we shall start with option
--master local meaning the spark context of this spark shell acts as a master on local node with 4 threads.
$ spark-shell --master local
If you accidentally started spark shell without options, kill the shell instance.
~$spark-shell --master "local" Using Sparks default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/11/12 13:07:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/11/12 13:07:31 WARN Utils: Your hostname, tutorialkart resolves to a loopback address: 127.0.0.1; using 192.168.0.104 instead (on interface wlp7s0) 17/11/12 13:07:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 17/11/12 13:07:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.0.104:4040 Spark context available as 'sc' (master = local, app id = local-1510472252847). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151) Type in expressions to have them evaluated. Type :help for more information. scala>
From the above Shell startup, following points could be made
Spark context Web UI is available at
http://192.168.0.104:4040 . Open a browser and hit the url.
Spark context available as
sc, meaning you may access the spark context in the shell as variable named ‘sc’.
Spark session available as
spark, meaning you may access the spark session in the shell as variable named ‘spark’.
Word-Count Example with Spark (Scala) Shell
Following are the three commands that we shall use for Word Count Example in Spark Shell :
/** map */ var map = sc.textFile("/path/to/text/file").flatMap(line => line.split(" ")).map(word => (word,1)); /** reduce */ var counts = map.reduceByKey(_ + _); /** save the output to file */ counts.saveAsTextFile("/path/to/output/")
In this step, using Spark context variable, sc, we read a text file.
then we split each line using space
" " as separator.
flatMap(line => line.split(" "))
and we map each word to a tuple (word, 1), 1 being the number of occurrences of word.
map(word => (word,1))
We use the tuple (word,1) as (key, value) in reduce stage.
We reduce all the words based on Key
var counts = map.reduceByKey(_ + _);
Save counts to local file
The counts could be saved to local file.
When you run all the commands in a Terminal, Spark Shell looks like:
scala> var map = sc.textFile("/home/arjun/data.txt").flatMap(line => line.split(" ")).map(word => (word,1)); map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD at map at <console>:24 scala> var counts = map.reduceByKey(_ + _); counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at <console>:26 scala> counts.saveAsTextFile("/home/arjun/output/"); scala>
You can verify the output of word count.
$ ls part-00000 part-00001 _SUCCESS
Sample of the contents of output file,
part-00000, is shown below :
/home/arjun/output$cat part-00000 (branches,1) (sent,1) (mining,1) (tasks,4)
We have successfully counted unique words in a file with Word Count example run on Scala Spark Shell.
You may use Spark Context Web UI to check the details of the Job (Word Count) that we have just run.
Navigate through other tabs to get an idea of Spark Web UI and the details about the Word Count Job.
Spark Shell Suggestions
Spark Shell can provide suggestions. Type part of the command and click on ‘Tab’ key for suggestions.
scala> counts.sa sample sampleByKeyExact saveAsHadoopFile saveAsNewAPIHadoopFile saveAsSequenceFile sampleByKey saveAsHadoopDataset saveAsNewAPIHadoopDataset saveAsObjectFile saveAsTextFile
To kill the spark shell instance, hit Control+Z on the current shell and kill the spark instance using process id, pid, and with the help of kill command.
Find pid :
~$ ps -aef|grep spark arjun 8895 8113 0 13:01 pts/16 00:00:00 bash /usr/lib/spark/bin/spark-shell arjun 8906 8895 91 13:01 pts/16 00:01:13 /usr/lib/jvm/default-java/jre/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell arjun 9106 8113 0 13:03 pts/16 00:00:00 grep --color=auto spark
In this case, 8906 is the pid.
Kill the instance using pid :
~$ kill -9 8906
In this Apache Spark Tutorial – Scala Spark Shell, we have learnt the usage of Spark Shell using Scala programming language with the help of Word Count Example.