What is Spark SQL ?
Spark SQL is one of the four libraries of Apache Spark which provides Spark the ability to access structured/semi-structured data and optimize operations on the data through Spark SQL libraries.
When Spark adopted SQL as a library, there is always something to expect in the store and here are the features that Spark provides through its SQL library.
Spark with its addition of SQL, added relational processing ability to Spark’s existing functional programming.
Structured/Semi-structured data analysis
Supporting existing Data Formats
There are various data formats evolving in recent time. Also the industry is always embracing the new data formats resulting in piles of data in these data formats. And in this Big Data ecosystem, for a new tool or library, it is always important that it provides compatibility or connections to those existing popular data formats. Spark provides support to data formats like Parquet, JSON, Apache HIVE, Cassandra, etc.
Spark’s RDD API provides best in class performance for the transformations. And Spark exploits this feature with SQL queries convertible to RDDs for transformations.
Spark has the niche of performance over Hadoop. Spark SQL delivers much better performance over Hadoop with increased iterations over datasets because of inmemory processing.
Standard JDBC/ODBC Connectivity
Spark SQL libraries provide an interface to connect to Spark SQL through standard JDBC/ODBC connections and perform queries(table operations) on the structured data.
User Defined Functions
Spark lets you define your own column-based functions for the transformations to extend the Spark functions.
Get Hands on with Examples
- Querying using Spark SQL
- Spark SQL with JSON
- Hive Tables with Spark SQL
In this Apache Spark Tutorial, we have learnt about Spark SQL, its features/capabilities, architecture, libraries.