Apache Spark Tutorial

What is Spark SQL ?

Spark SQL is one of the four libraries of Apache Spark which provides Spark the ability to access structured/semi-structured data and optimize operations on the data through Spark SQL libraries.

Features of Apache Spark SQL

When Spark adopted SQL as a library, there is always something to expect in the store and here are the features that Spark provides through its SQL library.

1. Relational Processing

Spark with its addition of SQL, added relational processing ability to Spark’s existing functional programming.

2. Structured/Semi-structured Data Analysis

Spark supports both structured data analysis and semi-structured anaylysis.

3. Supporting existing Data Formats

There are various data formats evolving in recent time. Also the industry is always embracing the new data formats resulting in piles of data in these data formats. And in this Big Data ecosystem, for a new tool or library, it is always important that it provides compatibility or connections to those existing popular data formats. Spark provides support to data formats like Parquet, JSON, Apache HIVE, Cassandra, etc.

4. Data Transformations

Spark’s RDD API provides best in class performance for the transformations. And Spark exploits this feature with SQL queries convertible to RDDs for transformations.

5. Performance

Spark has the niche of performance over Hadoop. Spark SQL delivers much better performance over Hadoop with increased iterations over datasets because of inmemory processing.

6. Standard JDBC/ODBC Connectivity

Spark SQL libraries provide an interface to connect to Spark SQL through standard JDBC/ODBC connections and perform queries(table operations) on the structured data.

7. User Defined Functions

Spark lets you define your own column-based functions for the transformations to extend the Spark functions.

Get Hands on with Examples

  1. Querying using Spark SQL
  2. Spark SQL with JSON
  3. Hive Tables with Spark SQL


In this Apache Spark Tutorial, we have learnt about Spark SQL, its features/capabilities, architecture, libraries.