How to load data from JSON file and execute SQL query in Spark SQL

Load data from JSON file and execute SQL query in Spark SQL by reading the JSON file into a DataFrame, registering the DataFrame as a temporary view, and running a SQL statement with spark.sql().

Apache Spark Dataset and DataFrame APIs provides an abstraction to the Spark SQL from data sources. Dataset provides the goodies of RDDs along with the optimization benefits of Spark SQL’s execution engine.

This tutorial explains the Java Spark SQL flow for JSON data: create a SparkSession, read JSON records, inspect the schema, create a temporary view, execute a SQL query, and display the result.

Dataset loads JSON data source as a distributed collection of data. DataFrame is Dataset with data arranged into named columns. The architecture containing JSON data source, Dataset, Dataframe and Spark SQL is shown below :

Load data from JSON file and execute SQL query in Apache Spark - Apache Spark Tutorial - www.tutorialkart.com — JSON -> Dataset -> DataFrame -> Spark SQL -> SQL Query

How Spark SQL Reads JSON Files Before Running SQL Queries

Spark SQL can read JSON data and represent it as a DataFrame. Each field in the JSON data becomes a column in the DataFrame schema. After that, the DataFrame can be queried using the DataFrame API or SQL syntax.

By default, Spark expects JSON data in newline-delimited JSON format, where each line is a complete JSON object. This format is common in logs, event data, and exported records.

</>

Copy

{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}

If your input file contains one JSON document spread across multiple lines, use the multiLine option while reading the file. For large analytics datasets, newline-delimited JSON is usually easier for Spark to split and process in parallel.

Load Data from JSON File and Execute SQL Query in Spark SQL

Following is a step-by-step process to load data from JSON file and execute SQL query on the loaded data from JSON file.

1. Create a Spark Session for the JSON SQL Example

Provide application name and set master to local with two threads.

</>

Copy

SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark SQL data source JSON example")
        .master("local[2]")
        .getOrCreate();

SparkSession is the entry point used for reading the JSON file, creating the temporary view, and running the SQL query.

2. Read JSON Data Source into a Dataset<Row>

SparkSession.read().json(String path) can accept either a single text file or a directory storing text files, and load the data to Dataset.

</>

Copy

Dataset<Row> people = spark.read().json("path-to-json-files");

When a directory path is given, Spark reads the JSON files inside that directory. In the example below, the path data/sql/json/ contains a file named people.json.

3. Create a Temporary View for SQL Queries on JSON Data

</>

Copy

people.createOrReplaceTempView("people");

The temporary view name people is used like a table name in Spark SQL. It exists for the lifetime of the current Spark session unless it is replaced or dropped.

4. Run a Spark SQL Query on the JSON Temporary View

Temporary view could be considered as a table and attributes under schema root as columns

root
|-- name: string (nullable = true)
|-- salary: long (nullable = true)

Table : people Columns : name, salary

</>

Copy

Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE salary>3500 ");
namesDF.show();

The query selects only the name column from records where the salary value is greater than 3500.

</>

Copy

SELECT name
FROM people
WHERE salary > 3500;

5. Stop the Spark Session After the JSON Query Runs

After the query result is displayed, stop the Spark session to release the resources used by the local Spark application.

</>

Copy

spark.stop();

Complete Java Program to Query JSON File in Spark SQL

Complete java program to load data from JSON file and execute SQL query in given below:

data/sql/json/people.json

{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}
{"name":"Raju", "salary":3000}
{"name":"Chandy", "salary":4500}
{"name":"Joey", "salary":3500}
{"name":"Mon", "salary":4000}
{"name":"Rachel", "salary":4000}

Each line in this file is a separate JSON object. Spark reads these records and infers the columns name and salary.

JSONsqlExample.java

</>

Copy

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
 
public class JSONsqlExample {
 
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL data source JSON example")
                .master("local[2]")
                .getOrCreate();
 
        // A JSON dataset is pointed to by path.
        // The path can be either a single text file or a directory storing text files
        Dataset<Row> people = spark.read().json("data/sql/json/");
 
        // The inferred schema can be visualized using the printSchema() method
        System.out.println("Schema\n=======================");
        people.printSchema();
 
        // Creates a temporary view using the DataFrame
        people.createOrReplaceTempView("people");
 
        // SQL statements can be run by using the sql methods provided by spark
        Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE salary>3500 ");
        System.out.println("\n\nSQL Result\n=======================");
        namesDF.show();
 
        // stop spark session
        spark.stop();
    }
}

Output

Schema
=======================
root
|-- name: string (nullable = true)
|-- salary: long (nullable = true)
 
SQL Result
=======================
+------+
| name|
+------+
| Andy|
| Berta|
|Chandy|
| Mon|
|Rachel|
+------+

Using an Explicit Schema When Loading JSON in Spark SQL

In the example above, Spark infers the JSON schema from the input data. Schema inference is convenient for learning and exploration, but production jobs often use an explicit schema so that field types stay predictable.

</>

Copy

import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;

StructType schema = new StructType()
        .add("name", DataTypes.StringType, true)
        .add("salary", DataTypes.LongType, true);

Dataset<Row> people = spark.read()
        .schema(schema)
        .json("data/sql/json/");

An explicit schema is helpful when a column may be missing from some records, when numeric values must be treated consistently, or when reading a large JSON dataset where repeated schema inference is unnecessary.

Reading Multi-line JSON Files in Spark SQL

The sample file in this tutorial uses newline-delimited JSON. If a file contains one JSON object or array spread across multiple lines, enable the multiLine option while reading it.

</>

Copy

Dataset<Row> people = spark.read()
        .option("multiLine", true)
        .json("data/sql/json/people-multiline.json");

Use this option only when the file layout needs it. For many Spark workloads, one JSON record per line is the preferred input format because it is easier to process in parallel.

Querying Nested JSON Fields with Spark SQL

JSON data may contain nested objects. Spark SQL can represent nested objects as struct columns. After reading the data, use printSchema() to confirm the nested field names before writing the SQL query.

</>

Copy

{"name":"Andy", "salary":4500, "address":{"city":"Hyderabad", "country":"India"}}

</>

Copy

SELECT name, address.city AS city
FROM people
WHERE salary > 3500;

For deeply nested JSON, it is usually better to inspect the schema first and then flatten only the fields required for the query or downstream table.

Common Mistakes When Running Spark SQL on JSON Files

Using regular JSON instead of newline-delimited JSON: If the JSON document spans multiple lines, set multiLine to true.
Querying a wrong column name: Use printSchema() to check the exact field names inferred by Spark.
Forgetting to create the temporary view: spark.sql() can query people only after createOrReplaceTempView("people") is called.
Assuming schema inference is always best: For stable pipelines, define the schema explicitly to avoid unexpected type changes.
Using SQL syntax before escaping special field names: If a JSON field name contains spaces or special characters, use backticks in Spark SQL.

Spark SQL JSON File FAQ

How do I load data from JSON file and execute SQL query in Spark SQL?

Read the JSON file using spark.read().json(path), create a temporary view using createOrReplaceTempView(), and run the SQL query using spark.sql(). The result is returned as a Dataset<Row>.

Can Spark SQL read a directory of JSON files?

Yes. The JSON path can point to a single JSON file or to a directory that contains JSON files. Spark reads the matching files and creates one DataFrame from the input data.

Why does Spark SQL infer the salary column as long?

In this example, all salary values are whole numbers, so Spark infers the column as a numeric long type. If the input contains decimal values, missing values, or mixed data types, the inferred type may differ.

When should I provide a schema for Spark JSON data?

Provide a schema when the JSON structure is known and the job is part of a repeatable pipeline. This helps keep column types stable and avoids scanning the data only to infer the schema.

Can Spark SQL query nested JSON fields?

Yes. Nested JSON objects are represented as struct columns. You can query nested fields using dot notation, such as address.city, after checking the schema with printSchema().

Spark SQL JSON Tutorial QA Checklist

Confirm that the JSON sample is newline-delimited, with one complete JSON object per line.
Verify that the temporary view name in the explanation matches the SQL query table name people.
Check that the SQL query condition salary>3500 matches the displayed output rows.
Make sure Java examples use SparkSession, Dataset<Row>, and spark.sql() consistently.
Keep output-only blocks marked as output and syntax-only examples marked with the proper PrismJS language class.

Official reference: Apache Spark SQL JSON Files documentation.

Conclusion

In this Apache Spark Tutorial, we have learnt to load a json file into Dataset and access the data using SQL queries through Spark SQL.

The key steps are simple: read the JSON source into a DataFrame, inspect the inferred schema, register a temporary view, and query the view with Spark SQL. For production jobs, consider using an explicit schema and the correct JSON input mode for your file structure.

TutorialKart.com

How to load data from JSON file and execute SQL query in Spark SQL

How Spark SQL Reads JSON Files Before Running SQL Queries

Load Data from JSON File and Execute SQL Query in Spark SQL

1. Create a Spark Session for the JSON SQL Example

2. Read JSON Data Source into a Dataset<Row>

3. Create a Temporary View for SQL Queries on JSON Data

4. Run a Spark SQL Query on the JSON Temporary View

5. Stop the Spark Session After the JSON Query Runs

Complete Java Program to Query JSON File in Spark SQL

Using an Explicit Schema When Loading JSON in Spark SQL

Reading Multi-line JSON Files in Spark SQL

Querying Nested JSON Fields with Spark SQL

Common Mistakes When Running Spark SQL on JSON Files

Spark SQL JSON File FAQ

How do I load data from JSON file and execute SQL query in Spark SQL?

Can Spark SQL read a directory of JSON files?

Why does Spark SQL infer the salary column as long?

When should I provide a schema for Spark JSON data?

Can Spark SQL query nested JSON fields?

Spark SQL JSON Tutorial QA Checklist

Conclusion

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning