Spark – Add new column to Dataset
A new column could be added to an existing Dataset using Dataset.withColumn() method. withColumn accepts two arguments: the column name to be added, and the Column and returns a new Dataset<Row>. The syntax of withColumn() is provided below.
Syntax – withColumn()
The syntax of withColumn() method is
public Dataset<Row> withColumn(String colName, Column col)
ADVERTISEMENT
Step by step process to add New Column to Dataset
To add a new column to Dataset in Apache Spark
- Use withColumn() method of the Dataset.
- Provide a string as first argument to withColumn() which represents the column name.
- Useorg.apache.spark.sql.functions class for generating a new Column, to be provided as second argument. Spark functions[https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html] class provides methods for many of the mathematical functions like statistical, trigonometrical, etc.
Example – Spark – Add new column to Spark Dataset
In the following example, we shall add a new column with name “new_col” with a constant value. We shall use functions.lit(Object literal) to create a new Column.
DatasetAddColumn.java
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; public class DatasetAddColumn { public static void main(String[] args) { // configure spark SparkSession spark = SparkSession .builder() .appName("Spark Example - Add a new Column to Dataset") .master("local[2]") .getOrCreate(); String jsonPath = "data/employees.json"; Dataset<Row> ds = spark.read().json(jsonPath); // dataset before adding enw column ds.show(); // add column to ds Dataset<Row> newDs = ds.withColumn("new_col",functions.lit(1)); // print dataset after adding new column newDs.show(); spark.stop(); } }
Output
+-------+------+ | name|salary| +-------+------+ |Michael| 3000| | Andy| 4500| | Justin| 3500| | Berta| 4000| | Raju| 3000| | Chandy| 4500| | Joey| 3500| | Mon| 4000| | Rachel| 4000| +-------+------+ +-------+------+-------+ | name|salary|new_col| +-------+------+-------+ |Michael| 3000| 1| | Andy| 4500| 1| | Justin| 3500| 1| | Berta| 4000| 1| | Raju| 3000| 1| | Chandy| 4500| 1| | Joey| 3500| 1| | Mon| 4000| 1| | Rachel| 4000| 1| +-------+------+-------+
Conclusion
In this Spark Tutorial – Add new Column to existing DataSet, we have learnt to use Dataset.withColumn() method and functions class to add a new column to a Dataset.