Spark RDD distinct() to get unique elements

Spark RDD distinct() is used to remove duplicate elements from an RDD and return another RDD that contains only unique values. If the same element appears many times in the input RDD, it appears only once in the result RDD.

In this tutorial, we will learn how to get distinct elements from an Apache Spark RDD using Java, Scala, and Python examples. We will also look at how distinct() behaves with partitions, how to collect unique values to a list, and when to use DataFrame distinct() instead of RDD distinct().

Syntax of Spark RDD distinct()

To get distinct elements of an RDD, call distinct() on the RDD. The method returns a new RDD with duplicate elements removed.

</>
Copy
val distinctRdd = rdd.distinct()

In PySpark, the syntax is similar.

</>
Copy
distinct_rdd = rdd.distinct()

You can also pass the number of partitions for the resulting RDD. This is useful when the distinct operation processes a large dataset and you want to control the number of reduce-side partitions.

</>
Copy
distinct_rdd = rdd.distinct(numPartitions=4)

distinct() is a transformation. It is evaluated only when an action such as collect(), count(), foreach(), or saveAsTextFile() is called.

How RDD distinct() removes duplicate values in Spark

For an RDD of simple values such as strings, numbers, or tuples, Spark compares the complete element. If two elements are equal, only one of them is kept in the distinct RDD.

  • Input RDD: ["Learn", "Apache", "Spark", "Learn", "Spark"]
  • Distinct RDD: ["Learn", "Apache", "Spark"]

The order of the output is not guaranteed. Since distinct() may involve a shuffle across partitions, the printed result can appear in a different order from the input list.

Spark RDD distinct() Java example

In this example, we will take an RDD created from a list of strings, and find the distinct of them using RDD.distinct() method.

DistinctRDD.java

</>
Copy
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class DistinctRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Distinct")
				.setMaster("local[2]")
				.set("spark.executor.memory", "2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions");
		JavaRDD<String> words = sc.parallelize(data, 1);

		// get distinct elements of RDD
		JavaRDD<String> rddDistinct = words.distinct();

		// print
		rddDistinct.foreach(item -> {
			System.out.println(item);
		});
		
		sc.close();
	}
}

Output

Functions
Spark
Tutorial Kart
Learn
Apache
with
RDD

The output order can vary when you run the program. If you need a predictable display order for learning or testing, collect the result and sort it before printing. Do not use collect() on very large RDDs because it brings all data to the driver.

Spark RDD distinct() Scala example

In the following example, we will find the distinct elements in an RDD using RDD.distinct() with Scala programming language.

RDDdistinct.scala

</>
Copy
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object RDDdistinct {
	def main(args: Array[String]) {

		/* configure spark application */
		val conf = new SparkConf().setAppName("Spark RDD Distinct Example").setMaster("local[1]")

				/* spark context*/
				val sc = new SparkContext(conf)

				/* map */
				var rdd = sc.parallelize(Seq("Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions"));

				/* reduce */
				var rddDist = rdd.distinct()

				/* print */
				rddDist.collect().foreach(println)

				/* or save the output to file */
				rddDist.saveAsTextFile("out.txt")

				sc.stop()
	}
}

In the Scala example, rdd.distinct() creates a new RDD. The original rdd is not modified because Spark RDDs are immutable.

PySpark RDD distinct() example for unique values

The following PySpark example creates an RDD from a Python list and uses distinct() to remove repeated values.

</>
Copy
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("PySpark RDD Distinct Example").setMaster("local[2]")
sc = SparkContext(conf=conf)

data = ["Learn", "Apache", "Spark", "Learn", "Spark", "RDD", "Functions"]
words = sc.parallelize(data)

unique_words = words.distinct()

for word in unique_words.collect():
    print(word)

sc.stop()

A possible output is shown below. The order may be different on your system.

Apache
Functions
Learn
RDD
Spark

Get unique values from a PySpark RDD as a list

To get unique values as a Python list, call collect() after distinct(). This is suitable only when the distinct result is small enough to fit in driver memory.

</>
Copy
unique_words_list = words.distinct().collect()
print(unique_words_list)
['Apache', 'Functions', 'Learn', 'RDD', 'Spark']

If you need the list in sorted order, sort after collecting a small result.

</>
Copy
unique_words_list = sorted(words.distinct().collect())
print(unique_words_list)

Get distinct keys from a pair RDD in PySpark

For a pair RDD, each element is usually a key-value tuple. To get distinct keys, map each tuple to its key and then call distinct().

</>
Copy
pairs = sc.parallelize([
    ("A", 10),
    ("B", 20),
    ("A", 30),
    ("C", 40),
    ("B", 50)
])

unique_keys = pairs.keys().distinct().collect()
print(sorted(unique_keys))
['A', 'B', 'C']

If you call distinct() directly on the pair RDD, Spark compares the complete tuple. For example, ("A", 10) and ("A", 30) are different elements because their values are different.

Spark RDD distinct() with numPartitions

The distinct() operation can use a shuffle to group equal values across partitions. For larger RDDs, you may specify the number of partitions for the resulting RDD.

</>
Copy
val distinctRdd = rdd.distinct(4)
</>
Copy
distinct_rdd = rdd.distinct(numPartitions=4)

Choosing the partition count depends on the amount of data, cluster resources, and downstream operations. Too few partitions can create large tasks, while too many partitions can add scheduling overhead.

RDD distinct() versus DataFrame distinct() in Spark

Use RDD distinct() when you are already working with low-level RDD elements. Use DataFrame distinct() when your data is structured in rows and columns.

RequirementRecommended APIExample
Remove duplicate RDD elementsRDD distinct()rdd.distinct()
Remove duplicate DataFrame rowsDataFrame distinct()df.distinct()
Get unique values from one DataFrame columnselect() with distinct()df.select("name").distinct()
Drop duplicates based on selected columnsDataFrame dropDuplicates()df.dropDuplicates(["id"])

For example, to get unique values from a PySpark DataFrame column, use the DataFrame API instead of converting the data to an RDD.

</>
Copy
unique_names = df.select("name").distinct()

Common mistakes when using Spark RDD distinct()

  • Expecting sorted output: distinct() removes duplicates, but it does not sort the result.
  • Calling collect() on large results: Use collect() only when the result can fit safely in driver memory.
  • Using RDD distinct() for DataFrame columns: For structured data, prefer df.select("column").distinct().
  • Applying distinct() to pair RDDs without mapping keys: Use keys().distinct() when you need only unique keys.
  • Assuming the original RDD changes: RDD transformations return new RDDs; they do not modify the existing RDD.

FAQs on Spark RDD distinct()

What does distinct() do in Spark RDD?

distinct() removes duplicate elements from an RDD and returns a new RDD that contains only unique elements. The original RDD is not changed.

How do I get unique values in PySpark RDD?

Use rdd.distinct() to create an RDD with unique values. If the result is small, use rdd.distinct().collect() to bring the unique values to a Python list.

How do I get distinct keys from a pair RDD?

Use pairRdd.keys().distinct(). Calling distinct() directly on a pair RDD compares the complete key-value tuple, not only the key.

Does Spark RDD distinct() keep the original order?

No. distinct() does not guarantee the original order of elements. If you need sorted output for a small result, collect the result and sort it, or use suitable Spark sorting operations for distributed data.

Should I use RDD distinct() or DataFrame distinct()?

Use RDD distinct() for low-level RDD data. Use DataFrame distinct() or dropDuplicates() when your data is in columns and rows.

Editorial QA checklist for this Spark RDD distinct() tutorial

  • Confirm that Java, Scala, and PySpark examples use distinct() on an RDD, not on a DataFrame unless the section explicitly discusses DataFrames.
  • Check that output examples do not imply a guaranteed order for distinct RDD elements.
  • Ensure any use of collect() is described as suitable only for small results.
  • Verify that pair RDD examples explain the difference between distinct tuples and distinct keys.
  • Keep the distinction clear between RDD distinct(), DataFrame distinct(), and DataFrame dropDuplicates().

Spark RDD distinct() key takeaways

In this Spark Tutorial on Spark RDD distinct(), we learned how to get unique elements from a Spark RDD using Java, Scala, and PySpark. The main point is simple: distinct() returns a new RDD with duplicate elements removed. For structured data, use the DataFrame API, and for large distributed data, avoid collecting the full result to the driver unless it is small enough.