Read multiple text files to single RDD

To read multiple text files to single RDD in Spark, use SparkContext.textFile() method.

In this tutorial, we shall look into examples addressing different scenarios of reading multiple text files to single RDD.

Read Multiple Text Files to Single RDD

In this example, we have three text files to read. We take the file paths of these three files as comma separated valued in a single string literal. Then using textFile() method, we can read the content of all these three text files into a single RDD.

First we shall write this using Java.

FileToRddExample.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide text file paths to be read to RDD, separated by comma
		String files = "data/rdd/input/file1.txt, data/rdd/input/file2.txt, data/rdd/input/file3.txt";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(files);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.

file1.txt

This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

file2.txt

This is File 2
Learn to read multiple text files to a single RDD

file3.txt

This is File 3
Learn to read multiple text files to a single RDD

Output

18/02/10 12:13:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/02/10 12:13:26 INFO DAGScheduler: ResultStage 0 (collect at FileToRddExample.java:21) finished in 0.613 s
18/02/10 12:13:26 INFO DAGScheduler: Job 0 finished: collect at FileToRddExample.java:21, took 0.888843 s
This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
This is File 2
Learn to read multiple text files to a single RDD
This is File 3
Learn to read multiple text files to a single RDD
18/02/10 12:13:26 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 12:13:26 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 12:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

Now, we shall use Python programming, and read multiple text files to RDD using textFile() method.

readToRdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Run this Spark Application using spark-submit by executing the following command.

$ spark-submit readToRdd.py

Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.

ADVERTISEMENT

Read all text files in a directory to single RDD

Now, we shall write a Spark Application, that reads all the text files in a given directory path, to a single RDD.

Following is a Spark Application written in Java to read the content of all text files, in a directory, to an RDD.

FileToRddExample.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to directory containing text files
		String files = "data/rdd/input";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(files);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

In the above example, we have given the directory path via variable files.

All the text files inside give directory path, data/rdd/input, shall be read to lines RDD.

Now, we shall write a Spark Application to do the same job of reading data from all text files in a directory to RDD, but using Python programming language.

readToRdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Run the above Python Spark Application, by executing the following command in a console.

$ spark-submit readToRdd.py

Read all text files in multiple directories to single RDD

This is next level to our previous scenarios. We have seen how to read multiple text files, or all text files in a directory to an RDD. Now, we are going to learn how to read all text files in not one, but all text files in multiple directories.

First we shall write a Java application to write all text files in multiple directories.

FileToRddExample.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to directories containing text files seperated by comma
		String directories = "data/rdd/input,data/rdd/anotherFolder";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(directories);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

All the text files in both the directories, provided in the variable directories, shall be read to RDD. Similarly, you may provide more that two directories.

Let us write the same program in Python.

readToRdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input,data/rdd/anotherFolder")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

You may submit this Python application to Spark, by running the following command.

$ spark-submit readToRdd.py

Read all text files, matching a pattern, to single RDD

This scenario kind of uses a regular expression to match a pattern of file names. All those files that match the given pattern will be considered for reading into an RDD.

Let us write a Java application, to read files only that match a given pattern.

FileToRddExample.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to directories containing text files seperated by comma
		String files = "data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(files);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}
  • file[0-3].txt would match : file0.txt, file1.txt, file2.txt, file3.txt. Any of these files present, would be taken to RDD.
  • file* would match the files starting with the string file : Example: file-hello.txt, file2.txt, filehing.txt, etc.

Following is a Python Application that reads files to RDD, whose file name match a specific pattern.

readToRdd.py

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Conclusion

In this Spark Tutorial – Read multiple text files to single RDD, we have covered different scenarios of reading multiple files.