Read multiple text files to single RDD
To read multiple text files to single RDD in Spark, use SparkContext.textFile() method.
In this tutorial, we shall look into examples addressing different scenarios of reading multiple text files to single RDD.
- Read multiple text files to single RDD [Java Example] [Python Example]
- Read all text files in a directory to single RDD [Java Example] [Python Example]
- Read all text files in multiple directories to single RDD [Java Example] [Python Example]
- Read all text files matching a pattern to single RDD [Java Example] [Python Example]
Read Multiple Text Files to Single RDD
In this example, we have three text files to read. We take the file paths of these three files as comma separated valued in a single string literal. Then using textFile() method, we can read the content of all these three text files into a single RDD.
First we shall write this using Java.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide text file paths to be read to RDD, separated by comma String files = "data/rdd/input/file1.txt, data/rdd/input/file2.txt, data/rdd/input/file3.txt"; // read text files to RDD JavaRDD<String> lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.
file1.txt
This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD
file2.txt
This is File 2 Learn to read multiple text files to a single RDD
file3.txt
This is File 3 Learn to read multiple text files to a single RDD
Output
18/02/10 12:13:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 18/02/10 12:13:26 INFO DAGScheduler: ResultStage 0 (collect at FileToRddExample.java:21) finished in 0.613 s 18/02/10 12:13:26 INFO DAGScheduler: Job 0 finished: collect at FileToRddExample.java:21, took 0.888843 s This is File 1 Welcome to TutorialKart Learn Apache Spark Learn to work with RDD This is File 2 Learn to read multiple text files to a single RDD This is File 3 Learn to read multiple text files to a single RDD 18/02/10 12:13:26 INFO SparkContext: Invoking stop() from shutdown hook 18/02/10 12:13:26 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040 18/02/10 12:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
Now, we shall use Python programming, and read multiple text files to RDD using textFile() method.
readToRdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt") # collect the RDD to a list llist = lines.collect() # print the list for line in llist: print(line)
Run this Spark Application using spark-submit by executing the following command.
$ spark-submit readToRdd.py
Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.
Read all text files in a directory to single RDD
Now, we shall write a Spark Application, that reads all the text files in a given directory path, to a single RDD.
Following is a Spark Application written in Java to read the content of all text files, in a directory, to an RDD.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directory containing text files String files = "data/rdd/input"; // read text files to RDD JavaRDD<String> lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
In the above example, we have given the directory path via variable files
.
All the text files inside give directory path, data/rdd/input
, shall be read to lines
RDD.
Now, we shall write a Spark Application to do the same job of reading data from all text files in a directory to RDD, but using Python programming language.
readToRdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input") # collect the RDD to a list llist = lines.collect() # print the list for line in llist: print(line)
Run the above Python Spark Application, by executing the following command in a console.
$ spark-submit readToRdd.py
Read all text files in multiple directories to single RDD
This is next level to our previous scenarios. We have seen how to read multiple text files, or all text files in a directory to an RDD. Now, we are going to learn how to read all text files in not one, but all text files in multiple directories.
First we shall write a Java application to write all text files in multiple directories.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directories containing text files seperated by comma String directories = "data/rdd/input,data/rdd/anotherFolder"; // read text files to RDD JavaRDD<String> lines = sc.textFile(directories); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
All the text files in both the directories, provided in the variable directories
, shall be read to RDD. Similarly, you may provide more that two directories.
Let us write the same program in Python.
readToRdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input,data/rdd/anotherFolder") # collect the RDD to a list llist = lines.collect() # print the list for line in llist: print(line)
You may submit this Python application to Spark, by running the following command.
$ spark-submit readToRdd.py
Read all text files, matching a pattern, to single RDD
This scenario kind of uses a regular expression to match a pattern of file names. All those files that match the given pattern will be considered for reading into an RDD.
Let us write a Java application, to read files only that match a given pattern.
FileToRddExample.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class FileToRddExample { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to directories containing text files seperated by comma String files = "data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*"; // read text files to RDD JavaRDD<String> lines = sc.textFile(files); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
- file[0-3].txt would match : file0.txt, file1.txt, file2.txt, file3.txt. Any of these files present, would be taken to RDD.
- file* would match the files starting with the string file : Example: file-hello.txt, file2.txt, filehing.txt, etc.
Following is a Python Application that reads files to RDD, whose file name match a specific pattern.
readToRdd.py
import sys from pyspark import SparkContext, SparkConf if __name__ == "__main__": # create Spark context with Spark configuration conf = SparkConf().setAppName("Read Text to RDD - Python") sc = SparkContext(conf=conf) # read input text files present in the directory to RDD lines = sc.textFile("data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*") # collect the RDD to a list llist = lines.collect() # print the list for line in llist: print(line)
Conclusion
In this Spark Tutorial – Read multiple text files to single RDD, we have covered different scenarios of reading multiple files.