Append or Concatenate Datasets

Spark provides union() method in Dataset class to concatenate or append a Dataset to another.

To append or concatenate two Datasets use Dataset.union() method on the first dataset and provide second Dataset as argument.

Note: Dataset Union can only be performed on Datasets with the same number of columns.

Syntax – Dataset.union()

The syntax of Dataset.union() method is

public Dataset<Row> join(Dataset<?> right)

The function returns Dataset with specified Dataset concatenated/appended to this Dataset.

Example – Concatenate two Datasets

In the following example, we have two Datasets with employee information read from two different data files. We shall use union() method to concatenate these two Datasets.

ConcatenateDatasets.java

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ConcatenateDatasets {

	public static void main(String[] args) {
		// configure spark
		SparkSession spark = SparkSession
				.builder()
				.appName("Spark Example - Append/Concatenate two Datasets")
				.master("local[2]")
				.getOrCreate();

		Dataset<Row> ds1 = spark.read().json("data/employees.json");
		Dataset<Row> ds2 = spark.read().json("data/employees2.json");
		
		// print dataset
		System.out.println("Dataset 1\n==============");
		ds1.show();
		System.out.println("Dataset 2\n==============");
		ds1.show();
		
		// concatenate datasets
		Dataset<Row> ds3 = ds1.union(ds2);
		
		System.out.println("Dataset 3 = Dataset 1 + Dataset 2\n==============================");
		ds3.show();
		
		spark.stop();
	}
}


Output

Dataset 1
==============
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|   Raju|  3000|
+-------+------+

Dataset 2
==============
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|   Raju|  3000|
+-------+------+

Dataset 3 = Dataset 1 + Dataset 2
==============================
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|   Raju|  3000|
| Chandy|  4500|
|   Joey|  3500|
|    Mon|  4000|
| Rachel|  4000|
+-------+------+

General Pitfalls while concatenating Datasets

If number of columns in the two Datasets do not match, union() method throws an AnalysisException as shown below :

Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 2 columns and the second table has 3 columns;;
'Union
:- Relation[name#8,salary#9L] json
+- Relation[name#21,nn#22L,salary#23L] json

In the above case, there are two columns in the first Dataset, while the second Dataset has three columns.

Conclusion

In this Apache Spark Tutorial – Concatenate two Datasets, we have learnt to use Dataset.union() method to append a Dataset to another with same number of columns.