Spark RDD with custom class objects

To assign Spark RDD with custom class objects, implement the custom class with Serializable interface, create an immutable list of custom class objects, then parallelize the list with SparkContext. Parallelizing returns RDD created with custom class objects as elements.

Java Example

Following example demonstrates the creation of RDD with list of class objects.

CustomObjectsRDD.java

import java.io.Serializable;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import com.google.common.collect.ImmutableList;

public class CustomObjectsRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// prepare list of objects
		List<Person> personList = ImmutableList.of(
			    new Person("Arjun", 25),
			    new Person("Akhil", 2));
		
		// parallelize the list using SparkContext
		JavaRDD<Person> perJavaRDD = sc.parallelize(personList);
		
		for(Person person : perJavaRDD.collect()){
			System.out.println(person.name);
		}
		
		sc.close();
	}
}

class Person implements Serializable{
	private static final long serialVersionUID = -2685444218382696366L;
	String name;
	int age;
	public Person(String name, int age){
		this.name = name;
		this.age = age;
	}
}

Output

18/02/10 21:59:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/02/10 21:59:10 INFO DAGScheduler: ResultStage 0 (collect at CustomObjectsRDD.java:29) finished in 0.223 s
18/02/10 21:59:10 INFO DAGScheduler: Job 0 finished: collect at CustomObjectsRDD.java:29, took 0.661038 s
Arjun
Akhil
18/02/10 21:59:10 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 21:59:10 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 21:59:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
ADVERTISEMENT

Conclusion

In this Spark TutorialSpark RDD with custom class objects, we have learnt to initialize RDD from an immutable list of custom objects using SparkContext.parallelize(), with the help of an Example.