Spark – Print contents of RDD

RDD (Resilient Distributed Dataset) is a fault-tolerant collection of elements that can be operated on in parallel.

To print RDD contents, we can use RDD collect action or RDD foreach action.

RDD.collect() returns all the elements of the dataset as an array at the driver program, and using for loop on this array, we can print elements of RDD.

RDD foreach(f) runs a function f on each element of the dataset.

In this tutorial, we will go through examples with collect and foreach action in Java and Python.

RDD.collect() – Print RDD – Java Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().

PrintRDD.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class PrintRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile("data/rdd/input/file1.txt");
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println("* "+line);
		}
	}
}

file1.txt

Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

Output

18/02/10 16:31:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/02/10 16:31:33 INFO DAGScheduler: ResultStage 0 (collect at PrintRDD.java:18) finished in 0.513 s
18/02/10 16:31:33 INFO DAGScheduler: Job 0 finished: collect at PrintRDD.java:18, took 0.726936 s
* Welcome to TutorialKart
* Learn Apache Spark
* Learn to work with RDD
18/02/10 16:31:33 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 16:31:33 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 16:31:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

RDD.collect() – Print RDD – Python Example

In the following example, we will write a Python program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().

print-rdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Print Contents of RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  rdd = sc.textFile("data/rdd/input/file1.txt")

  # collect the RDD to a list
  list_elements = rdd.collect()

  # print the list
  for element in list_elements:
    print(element)

Run this Python program from terminal/command-prompt as shown below.

$ spark-submit print-rdd.py

Output

18/02/10 16:37:05 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/readToRDD/print-rdd.py:15) finished in 0.378 s
18/02/10 16:37:05 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/readToRDD/print-rdd.py:15, took 0.546189 s
This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
18/02/10 16:37:05 INFO SparkContext: Invoking stop() from shutdown hook

RDD.foreach() – Print RDD – Java Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().

PrintRDD.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class PrintRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile("data/rdd/input/file1.txt");
		
		lines.foreach(new VoidFunction<String>(){ 
	          public void call(String line) {
	              System.out.println("* "+line); 
	    }});
	}
}

RDD.foreach() – Print RDD – Python Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().

print-rdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Print Contents of RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  rdd = sc.textFile("data/rdd/input/file1.txt")

  def f(x): print(x)

  # apply f(x) for each element of rdd
  rdd.foreach(f)

Conclusion

In this Spark Tutorial – Print Contents of RDD, we have learnt to print elements of RDD using collect and foreach RDD actions with the help of Java and Python examples.