Spark – Print contents of RDD
RDD (Resilient Distributed Dataset) is a fault-tolerant collection of elements that can be operated on in parallel.
To print RDD contents, we can use RDD collect action or RDD foreach action.
RDD.collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD.
RDD foreach(f) runs a function f
on each element of the dataset.
In this tutorial, we will go through examples with collect and foreach action in Java and Python.
RDD.collect() – Print RDD – Java Example
In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().
PrintRDD.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class PrintRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // read text files to RDD JavaRDD<String> lines = sc.textFile("data/rdd/input/file1.txt"); // collect RDD for printing for(String line:lines.collect()){ System.out.println("* "+line); } } }
Welcome to TutorialKart Learn Apache Spark Learn to work with RDD
Output
18/02/10 16:31:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 18/02/10 16:31:33 INFO DAGScheduler: ResultStage 0 (collect at PrintRDD.java:18) finished in 0.513 s 18/02/10 16:31:33 INFO DAGScheduler: Job 0 finished: collect at PrintRDD.java:18, took 0.726936 s * Welcome to TutorialKart * Learn Apache Spark * Learn to work with RDD 18/02/10 16:31:33 INFO SparkContext: Invoking stop() from shutdown hook 18/02/10 16:31:33 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040 18/02/10 16:31:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
RDD.collect() – Print RDD – Python Example
In the following example, we will write a Python program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().
print-rdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Print Contents of RDD - Python") sc = SparkContext(conf=conf) # read input text file to RDD rdd = sc.textFile("data/rdd/input/file1.txt") # collect the RDD to a list list_elements = rdd.collect() # print the list for element in list_elements: print(element)
Run this Python program from terminal/command-prompt as shown below.
$ spark-submit print-rdd.py
Output
18/02/10 16:37:05 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/readToRDD/print-rdd.py:15) finished in 0.378 s 18/02/10 16:37:05 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/readToRDD/print-rdd.py:15, took 0.546189 s This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD 18/02/10 16:37:05 INFO SparkContext: Invoking stop() from shutdown hook
RDD.foreach() – Print RDD – Java Example
In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().
PrintRDD.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.VoidFunction; public class PrintRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // read text files to RDD JavaRDD<String> lines = sc.textFile("data/rdd/input/file1.txt"); lines.foreach(new VoidFunction<String>(){ public void call(String line) { System.out.println("* "+line); }}); } }
RDD.foreach() – Print RDD – Python Example
In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().
print-rdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Print Contents of RDD - Python") sc = SparkContext(conf=conf) # read input text file to RDD rdd = sc.textFile("data/rdd/input/file1.txt") def f(x): print(x) # apply f(x) for each element of rdd rdd.foreach(f)
Conclusion
In this Spark Tutorial – Print Contents of RDD, we have learnt to print elements of RDD using collect and foreach RDD actions with the help of Java and Python examples.