preface
The teachers of LeByte Education taught me the working principle and workflow of Spark SQL architecture, and I would like to share with you. Spark SQL is compatible with Hive because the Spark SQL architecture is similar to Hive’s underlying structure. Spark SQL uses the metadata warehouse (Metastore), HiveQL, user-defined functions (UDF), and serialization and de-sequence tools (SerDes) provided by Hive. Take a closer look at the Underlying architecture of Spark SQL in Figure 1.
Figure 1 Spark SQL architecture
As shown in Figure 1, Spark SQL architecture changes the underlying MapReduce execution engine to Spark as well as the Catalyst optimizer compared to Hive architecture. Spark SQL’s fast computing efficiency is thanks to the Catalyst optimizer. From the time HiveQL is parsed into a syntax abstraction tree, execution plan generation and optimization is managed by Spark SQL’s Catalyst optimizer.
Catalyst optimizer is a new extensible query optimizer, which is based on Scala functional programming structure, Spark SQL developer scalable architecture is mainly designed to in the next version of the iteration, can easily add new optimization technology and function, especially in order to solve the problems in the production of large data (for example, For semi-structured data and advanced data analysis), Spark is an open source project that allows external developers to extend the Catalyst optimizer’s capabilities to meet the needs of the project. Figure 2 shows how Spark SQL works.
Figure 2 Spark SQL working principle
To properly support SQL, Spark needs to be parsed, Optimizer, and Execution. The Catalyst optimizer performs plan generation and optimization without its own five internal components, as described below.
**Parse component: ** This component parses SparkSql strings into an abstract syntax tree /AST based on certain semantic rules (namely, the third-party class library ANTLR).
The **Analyze component: ** This component walks through the AST, binding data types and functions to each node on the AST, and then parses the fields in the data table based on the metadata information Catalog.
**Optimizer component: ** This component is at the heart of Catalyst and is divided into RBO and CBO optimization strategies, with RBO being rule-based and CBO being cost-based.
**SparkPlanner component: ** The optimized logical execution plan OptimizedLogicalPlan is still logical and cannot be understood by the Spark system. In this case, you need to convert the OptimizedLogicalPlan to a physical plan.
**CostModel components: ** Select the best physical execution plan primarily based on past performance statistics.
After understanding the functions of the above components, the following steps explain the Spark SQL workflow.
1. Before the SQL statement is parsed, a SparkSession is created, and metadata related to table names, field names, and field types is stored in the Catalog;
2. When the SQL () method of SparkSession is called, SparkSqlParser is used to parse SQL statements. ANTLR is used for lexical and syntax parsing during parsing.
3. Next, Analyzer is used to bind logical plans. In this stage, Analyzer uses Analyzer Rules and Catalog to parse unbound logical plans and generate bound logical plans.
4. The Optimizer then optimizes Resolved Logical plans based on predefined rules (RBO) and generates Optimized Logical plans.
5. Then SparkPlanner was used to transform the optimized logical Plan to generate multiple Physical plans that could be executed.
6. Then the CBO optimization strategy will calculate the Cost of each Physical Plan according to the Cost Model, and select the Physical Plan with the lowest Cost as the final Physical Plan.
7. The physical plan is eventually executed using QueryExecution, at which point SparkPlan’s execute() method is called, returning the RDD.
The article reprinted the joy byte
Finally, I would like to recommend some super detailed Java self-study courses of B station:
MySQL database: BV1tK4y197JC
NodeJs project: BV1SK4y197G3 JavaScript complete tutorial: BV1yf4y1Y7oM