Spark – Create RDD
To create RDD in Apache Spark, some of the possible ways are
- Create RDD from List<T> using Spark Parallelize.
- Create RDD from Text file
- Create RDD from JSON file
In this tutorial, we will go through examples, covering each of the above mentioned processes.
Example – Create RDD from List<T>
In this example, we will take a List of strings, and then create a Spark RDD from this list.
RDDfromList.java
import java.util.Arrays; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class RDDfromList { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // read list to RDD List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart"); JavaRDD<String> items = sc.parallelize(data,1); // apply a function for each element of RDD items.foreach(item -> { System.out.println("* "+item); }); } }
ADVERTISEMENT
Example – Create RDD from Text file
In this example, we have the data in text file and will create an RDD from this text file.
ReadTextToRDD.java
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class ReadTextToRDD { public static void main(String[] args) { // configure spark SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD") .setMaster("local[2]").set("spark.executor.memory","2g"); // start a spark context JavaSparkContext sc = new JavaSparkContext(sparkConf); // provide path to input text file String path = "data/rdd/input/sample.txt"; // read text file to RDD JavaRDD<String> lines = sc.textFile(path); // collect RDD for printing for(String line:lines.collect()){ System.out.println(line); } } }
Example – Create RDD from JSON file
In this example, we will create an RDD from JSON file.
JSONtoRDD.java
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class JSONtoRDD { public static void main(String[] args) { // configure spark SparkSession spark = SparkSession .builder() .appName("Spark Example - Read JSON to RDD") .master("local[2]") .getOrCreate(); // read list to RDD String jsonPath = "data/employees.json"; JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD(); items.foreach(item -> { System.out.println(item); }); } }
Conclusion
In this Spark Tutorial, we have learnt to create Spark RDD from a List, reading a text or JSON file from file-system etc., with the help of example programs.