Apache Spark RDD (Resilient Distributed Dataset)

In Apache Spark, RDD is a fault-tolerant collection of elements for in-memory cluster computing.

Spark RDD can contain Objects of any type.

Spark RDD Operations

There are two types of RDD Operations.

  1. Transformations : Create a new RDD from an existing RDD
  2. Actions : Run a computation or aggregation on the RDD and return a value to the driver program.

Transformation are lazy

In Spark, Transformations are lazy. Lazy by meaning, they are not actually acted upon until an action is encountered. For an RDD, all transformations are kept in a queue and when an action is encountered, all the transformations and action are executed.

RDDs are fault tolerant

In Spark, data is stored in RDDs. Transformations could be applied on RDDs or new RDDs could be created using actions. Unlike other big data frameworks, the intermediate data is not stored onto disk storage. Only the information about the transformations an RDD undergoes is stored. So, if in case a node goes down, Spark has information of what has to be done with the input data, hence fault tolerant. Because of this, it avoids time and resource costly reads and writes to persistent data storage.


In this Spark Tutorial, we learned about Spark RDD.