Spark Create RDD
To create RDD in Apache Spark, some of the possible ways are
- Create RDD from List<T> using Spark Parallelize.
- Create RDD from Text file
- Create RDD from JSON file
In this tutorial, we will go through examples, covering each of the above mentioned processes.
Example Create RDD from ListT
In this example, we will take a List of strings, and then create a Spark RDD from this list.
RDDfromList.java
import java.util.Arrays; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class RDDfromList { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // read list to RDD List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart"); JavaRDD<String> items = sc.parallelize(data,1); // apply a function for each element of RDD items.foreach(item -> { System.out.println("* "+item); }); } }
Example Create RDD from Text file
In this example, we have the data in text file and will create an RDD from this text file.
ReadTextToRDD.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class ReadTextToRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to input text file String path = "data/rdd/input/sample.txt"; // read text file to RDD JavaRDD<String> lines = sc.textFile(path); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
Example Create RDD from JSON file
In this example, we will create an RDD from JSON file.
JSONtoRDD.java
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class JSONtoRDD { public static void main(String[] args) { // configure spark SparkSession spark = SparkSession .builder() .appName("Spark Example - Read JSON to RDD") .master("local[2]") .getOrCreate(); // read list to RDD String jsonPath = "data/employees.json"; JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD(); items.foreach(item -> { System.out.println(item); }); } }
Conclusion
In this Spark Tutorial, we have learnt to create Spark RDD from a List, reading a text or JSON file from file-system etc., with the help of example programs.