Spark Shell is an interactive shell through which we can access Spark’s API. Spark provides the shell in two programming languages : Scala and Python. In this Apache Spark Tutorial, we shall learn the usage of Scala Spark Shell with a basic word count example.
Scala Spark Shell
Start Spark interactive Scala Shell
Scala shell could be started by opening a Terminal window and running the following command :
For the word-count example, we shall start with option --master local meaning the spark context of this spark shell acts as a master on local node with 4 threads.
$ spark-shell --master local
If you accidentally started spark shell without options, kill the shell instance.
~$spark-shell --master "local"
Using Sparks default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/12 13:07:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/12 13:07:31 WARN Utils: Your hostname, tutorialkart resolves to a loopback address: 127.0.0.1; using 192.168.0.104 instead (on interface wlp7s0)
17/11/12 13:07:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/11/12 13:07:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.104:4040
Spark context available as 'sc' (master = local, app id = local-1510472252847).
Spark session available as 'spark'.
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
From the above Shell startup, following points could be made
Spark context Web UI is available at http://192.168.0.104:4040 . You may Open a browser and hit the url.
Spark context available as ‘sc’, meaning you may access the spark context in the shell as variable named ‘sc’.
Spark session available as ‘spark’, meaning you may access the spark session in the shell as variable named ‘spark’.
Word-Count Example with Spark (Scala) Shell
Following are the three commands that could be run in Spark Scala Shell, one by one
/** map */
var map = sc.textFile("/path/to/text/file").flatMap(line => line.split(" ")).map(word => (word,1));
/** reduce */
var counts = map.reduceByKey(_ + _);
/** save the output to file */
In this step, using Spark context variable, sc, we read a text file
then we split each line using space ” ” as separator
flatMap(line => line.split(" "))
and we map each word to a tuple (word, 1), 1 being the number of occurrences of word.
map(word => (word,1))
We use the tuple (word,1) as (key, value) in reduce stage.
We reduce all the words based on Key
var counts = map.reduceByKey(_ + _);
Save counts to local file
The counts could be saved to local file.
All the commands run in Terminal is shown below :
scala> var map = sc.textFile("/home/arjun/data.txt").flatMap(line => line.split(" ")).map(word => (word,1));
map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD at map at <console>:24
scala> var counts = map.reduceByKey(_ + _);
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at <console>:26
Verify the Output
part-00000 part-00001 _SUCCESS
Sample of the contents of output file, part-00000, is shown below :
We have successfully counted unique words in a file with the help of Scala Spark Shell.
You may use Spark Context Web UI to check the details of the Job (Word Count) we have just done.
Navigate through other tabs to get an idea of Spark Web UI and the details about the Word Count Job.
Spark Shell can provide suggestions. Type part of the command and click on ‘Tab’ key for suggestions.
sample sampleByKeyExact saveAsHadoopFile saveAsNewAPIHadoopFile saveAsSequenceFile
sampleByKey saveAsHadoopDataset saveAsNewAPIHadoopDataset saveAsObjectFile saveAsTextFile
To kill the spark shell instance, hit Control+Z on the current shell and kill the spark instance using pid and with the help of kill command.
Find pid :
~$ ps -aef|grep spark
arjun 8895 8113 0 13:01 pts/16 00:00:00 bash /usr/lib/spark/bin/spark-shell
arjun 8906 8895 91 13:01 pts/16 00:01:13 /usr/lib/jvm/default-java/jre/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell
arjun 9106 8113 0 13:03 pts/16 00:00:00 grep --color=auto spark
In this case, 8906 is the pid.
Kill the instance using pid :
~$ kill -9 8906
In this tutorial – Scala Spark Shell, we have learnt the usage of Spark Shell using Scala programming language with the help of Word Count Example.