Spark – Create RDD

To create RDD in Apache Spark, some of the possible ways are

  1. Create RDD from List<T> using Spark Parallelize.
  2. Create RDD from Text file
  3. Create RDD from JSON file

In this tutorial, we will go through examples, covering each of the above mentioned processes.

Example – Create RDD from List<T>

In this example, we will take a List of strings, and then create a Spark RDD from this list.

RDDfromList.java

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDfromList {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example")
				.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart"); 
		JavaRDD<String> items = sc.parallelize(data,1);

		// apply a function for each element of RDD
		items.foreach(item -> {
			System.out.println("* "+item); 
		});
	}
}

Example – Create RDD from Text file

In this example, we have the data in text file and will create an RDD from this text file.

ReadTextToRDD.java

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ReadTextToRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
										.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to input text file
		String path = "data/rdd/input/sample.txt";
		
		// read text file to RDD
		JavaRDD<String> lines = sc.textFile(path);
		
		// collect RDD for printing
		for(String line:lines.collect()){
			System.out.println(line);
		}
	}
}

Example – Create RDD from JSON file

In this example, we will create an RDD from JSON file.

JSONtoRDD.java

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class JSONtoRDD {

	public static void main(String[] args) {
		// configure spark
		SparkSession spark = SparkSession
				.builder()
				.appName("Spark Example - Read JSON to RDD")
				.master("local[2]")
				.getOrCreate();

		// read list to RDD
		String jsonPath = "data/employees.json";
		JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD();

		items.foreach(item -> {
			System.out.println(item); 
		});
	}
}

Conclusion

In this Spark Tutorial, we have learnt to create Spark RDD from a List, reading a text or JSON file from file-system etc., with the help of example programs.