Apache Spark SQL

No Comments

Catalyst (sql/catalyst) – An implementation-agnostic framework for manipulating trees of relational operators and expressions.
Execution (sql/core) – A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
Hive Support (sql/hive) – Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
HiveServer and CLI support (sql/hive-thriftserver) – Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new Data Frame API.

Spark SQL is the Catalyst optimizer, which leverages advanced programming language features such as Scala’s pattern matching and quasi quotes in a novel way to build an extensible query optimizer. Catalyst supports both rule-based and cost-based optimization. It contains a general library for representing trees and applying rules to manipulate them.

It offers several public extension points, including external data sources and user-defined types.

Catalyst’s general tree transformation framework in four phases as

analyzing a logical plan to resolve references,
logical plan optimization,
physical planning, and
code generation to compile parts of the query to Java bytecode.

Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Catalyst includes libraries of nodes for expressions, data types, and logical and physical operators.

Spark SQL takes a logical plan and generates one or more physical plans, using physical operators that match the Spark execution engine. It can push operations from the logical plan into data sources that support predicate or projection pushdown.

Interfaces to Spark SQL:

Spark SQL runs as a library on top of Spark. It exposes SQL interfaces, which can be accessed through JDBC/ODBC or through a command-line console. DataFrame API integrated into Spark’s. We start by covering the DataFrame API, which lets users intermix procedural and relational code. However, advanced functions can also be exposed in SQL through user-defined function (UDFs), allowing them to be invoked, for example, by business intelligence tools.

Goals for Spark SQL

Support relational processing both within Spark programs (on native RDDs) and on external data sources using a programmer friendly API.
Provide high performance using established DBMS techniques.
Easily support new data sources, including semi-structured data and external databases amenable to query federation.
Enable extension with advanced analytics algorithms such as graph processing and machine learning.