How to load data from JSON file and execute SQL query in Spark SQL

Load data from JSON data source and execute Spark SQL query

Apache Spark Dataset and DataFrame APIs provides an abstraction to the Spark SQL from data sources. Dataset provides the goodies of RDDs along with the optimization benefits of Spark SQL’s execution engine.

Dataset loads JSON data source as a distributed collection of data. DataFrame is Dataset with data arranged into named columns. The architecture containing JSON data source, Dataset, Dataframe and Spark SQL is shown below :

Load data from JSON file and execute SQL query in Apache Spark - Apache Spark Tutorial - www.tutorialkart.com

JSON -> Dataset -> DataFrame -> Spark SQL -> SQL Query

Load data from JSON file and execute SQL query

Following is a step-by-step process to load data from JSON file and execute SQL query on the loaded data from JSON file :

  1. Create a Spark Session

    Provide application name and set master to local with two threads.

  2. Read JSON data source

    SparkSession.read().json(String path) can accept either a single text file or a directory storing text files, and load the data to Dataset.

  3. Create a temporary view using the DataFrame

  4. Run SQL query

    Temporary view could be considered as a table and attributes under schema root as columns

    Table : people
    Columns : name, salary

  5. Stop spark session

Complete java program to load data from JSON file and execute SQL query in given below:

 

Conclusion :

In this Apache Spark Tutorial, we have learnt to load a json file into Dataset and access the data using SQL queries through Spark SQL.