Spark – RDD.filter()

Spark RDD Filter : RDD.filter() method returns an RDD with those elements which pass a filter condition (function) that is given as argument to the method. In this tutorial, we learn to filter RDD containing Integers, and an RDD containing Tuples, with example programs.

Steps to apply filter to Spark RDD

To apply filter to Spark RDD,

  1. Create a Filter Function to be applied on an RDD.
  2. Use RDD<T>.filter() method with filter function passed as argument to it. The filter() method returns RDD<T> with elements filtered as per the function provided to it.
ADVERTISEMENT

Spark – RDD.filter() – Java Example

In this example, we will take an RDD with integers, and filter them using RDD.filter() method.

FilterRDD.java

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class FilterRDD {

	public static void main(String[] args) {
		// configure spark
        SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Filter")
                .setMaster("local[2]").set("spark.executor.memory","2g");
        // start a spark context
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
 
        // read list to RDD
        List<Integer> data = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); 
        JavaRDD<Integer> rdd = sc.parallelize(data,1);
        
        // filter : where the number (rdd element) is exactly divisible by 3
        Function<Integer, Boolean> filter = k -> ( k % 3 == 0);
        
        // apply filter on rdd with filter passed as argument
        JavaRDD<Integer> rddf = rdd.filter(filter);
 
        // print the filtered rdd
        rddf.foreach(element -> {
            System.out.println(element); 
        });
        
        sc.close();
	}
}

Output

3
6
9

Spark – RDD.filter() – Filter RDD with Tuples

In this example, we will take an RDD with Tuples as elements. We will filter this RDD using filter method. We will filter the elements based on condition that the length of string, which is second element in tuple, is equal to 5.

FilterRDD.java

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

public class FilterRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD filter")
				.setMaster("local[2]")
				.set("spark.executor.memory", "2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn", "Apache", "Spark", "with", "Tutorial Kart");
		JavaRDD<String> words = sc.parallelize(data, 1);

		// map each word to -> (word, word length)
		JavaPairRDD<String, Integer> wordsRDD = words.mapToPair(s -> new Tuple2<>(s, s.length()));

		// filter : where the second element in tuple is equal to 5. (i.e., word length == 5)
		Function<Tuple2<String, Integer>, Boolean> filterFunction = w -> (w._2 == 5);

		// apply the filter on wordsRDD
		JavaPairRDD<String, Integer> rddf = wordsRDD.filter(filterFunction);

		// print filtered rdd
		rddf.foreach(item -> {
			System.out.println(item);
		});
		
		sc.close();
	}
}

Output

(Learn,5)
(Spark,5)

Conclusion

In this Spark Tutorial – Spark RDD.filter(), we have learnt to filter elements of Spark RDD with example programs.