Introduction to Spark SQL

Spark SQL is a sub-module of Spark. It is used to operate structured data. It has the following characteristics:

  • Seamlessly blends SQL queries with Spark applications, allowing you to query structured data using SQL or DataFrame apis;
  • Support for multiple development languages;
  • Supports hundreds of external data sources, including Hive, Avro, Parquet, ORC, JSON and JDBC.
  • Support HiveQL syntax and Hive SerDes and UDF, allowing you to access existing Hive repositories.
  • Support for standard JDBC and ODBC connections;
  • Support for optimizer, column storage, and code generation features;
  • Supports extension and ensures fault tolerance.

DataFrame & DataSet

2.1 DataFrame

To support the processing of structured data, Spark SQL provides a new data structure, DataFrame. A DataFrame is a data set consisting of names. It is conceptually equivalent to a table in a relational database or a data frame in R/Python. Because Spark SQL supports development in multiple languages, each language defines a DataFrame abstraction as follows:

language Main abstract
Scala Dataset[T] & DataFrame (alias of Dataset[Row])
Java Dataset[T]
Python DataFrame
R DataFrame

2.2 DataFrame Comparison with RDDs

The main difference between DataFrame and RDDs is that one is for structured data and the other is for unstructured data. Their internal data structure is as follows:

DataFrame has an explicit Scheme structure within the DataFrame, with column names and column field types being known. This benefits in reducing data reads and optimizing execution plans to ensure query efficiency.

How to choose between DataFrame and RDDs?

  • If you want to use functional programming rather than DataFrame apis, use RDDs;
  • If your data is unstructured (such as streaming or character streaming), use RDDs,
  • If your data is structured (such as data in an RDBMS) or semi-structured (such as logs), DataFrame is preferred for performance reasons.

2.3 the DataSet

Dataset is also a distributed Dataset. Introduced in Spark 1.6, Dataset integrates the advantages of RDD and DataFrame, supports Lambda functions, and can only be used in Scala and Java. After Spark 2.0, Spark provides a Structured API that combines DataFrame and Dataset apis to facilitate developers. In this way, users can use aset of standard apis to perform operations on both.

The DataFrame is marked as Untyped API, while the DataSet is marked as Typed API.

2.4 Static typing and runtime type safety

Static-typing and Runtime type-safety are represented as follows:

In practice, if you use a Spark SQL query, you won’t find the syntax error until runtime, whereas if you use a DataFrame and Dataset, you will find the error at compile time (saving development time and overall cost). The main differences between a DataFrame and a Dataset are:

In DataFrame, when you call a function outside the API, the compiler will report an error, but if you use a field name that doesn’t exist, the compiler will still not notice. The API of a Dataset is represented by Lambda functions and JVM-type objects, and any mismatched type parameters are discovered at compile time.

All of this is ultimately interpreted as a type-safety map, corresponding to syntax and analysis errors in development. In the graph, the Dataset is the most rigorous, but the most efficient for developers.

The above description may not be intuitive, but here is an example of code compilation in IDEA:

A possible confusion is that a DataFrame cannot be typed or mistyped even though it has a well-defined Scheme structure. The DataFrame is Untyped.

2.5 Untyped & Typed

A DataFrame API is marked as Untyped while a DataSet API is marked as Typed. DataFrame is Untyped compared to the language or API level. It does have explicit Scheme structures, that is, column names and column types are determined, but this information is maintained entirely by Spark. Spark only checks whether the types are the same as the specified types at run time. This is why, after Spark 2.0, DataFrame is officially recommended as DatSet[Row], a trait defined in Spark whose subclasses encapsulate column field information.

A DataSet is Typed, or strongly Typed. As shown in the following code, the type of the DataSet is explicitly specified by Case Class(Scala) or Java Bean(Java). In this Case, each row of data represents a Person, and this information is guaranteed to be correct by JVM. So field name errors and type errors are detected by the IDE at compile time.

case class Person(name: String, age: Long)
val dataSet: Dataset[Person] = spark.read.json("people.json").as[Person]
Copy the code

DataFrame & DataSet & RDDs

Here’s a quick summary of all three:

  • RDDs is suitable for unstructured data processing, while DataFrame & DataSet is more suitable for structured data and semi-structured data processing.
  • DataFrame & DataSet can be accessed through a uniform Structured API, while RDDs is more suitable for functional programming scenarios.
  • Compared to a DataFrame, a DataSet is strongly Typed, with more rigorous static type checking.
  • DataSets, DataFrames, and SQL all rely on RDDs apis at the bottom and provide structured access interfaces.

4. Operation principle of Spark SQL

The actual execution flow of DataFrame, DataSet, and Spark SQL is the same:

  1. DataFrame/Dataset/SQL programming;
  2. If it is valid code, that is, the code has no compilation errors, Spark converts it into a logical plan.
  3. Spark converts this logical plan into a physical plan and optimizes the code at the same time.
  4. Spark then performs this physical plan (based on RDD operations) on the cluster.

4.1 Logical Plan

The first stage of execution is to transform the user code into a logical plan. It first converts the user code into an unresolved logical plan, which is unresolved because even though your code is syntactically correct, the table or column it references may not exist. Spark uses an Analyzer (analyzer) to parse based on the catalog(information of all tables and DataFrames stored). If parsing fails, execution is rejected, and if parsing succeeds, results are passed to the Catalyst Optimizer, which is a set of rules used to optimize the logical plan, such as by predicate push-down, and finally outputs the optimized logical execution plan.

4.2 Physical Plan

After the optimized logical plan is obtained, Spark starts the physical planning process. It selects the optimal physical plan to execute on the cluster by generating different physical execution strategies and comparing them through a cost model. The output of a physical plan is a series of RDDs and transformations.

4.3 perform

After selecting a physical plan, Spark runs its RDDs code, performs further optimizations at run time, generates native Java bytecode, and returns the results to the user.

The resources

  1. Matei Zaharia, Bill Chambers . Spark: The Definitive Guide[M] . 2018-02
  2. Spark SQL, DataFrames and Datasets Guide
  3. Apache Spark API: RDD, DataFrame and Dataset
  4. A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series