This is the 12th day of my participation in the August Challenge
inSpark SQLThe use of
We used the Catalyst common Tree conversion operation framework in four phases
- Analysis: Syntax tree and metadata(the Catalog)Bind, getResolved Logical Plan (Analyzed Logical Plan)
- Logic optimization: useRBO(Logic optimization rules) andCBO(Cost optimization rules) to get a new syntax tree
- Physics plan: willLogical PlanConvert to multiplePhysical Plans, the use ofCost ModelChoose the bestPhysical Plans
- Code generation: compile part of the query asJavaThe bytecode
Analysis of the
Spark SQL starts with a relationship that needs to be computed, either from the abstract syntax tree * (AST) * returned by SQL Parser or from DataFrame objects constructed using the API. In both cases, the relationship may contain unresolved attribute references or relationships, for example: in the SQL query SELECT Col FROM Sales, the type of col, or even whether it is a valid column name, is unknown until we query table SALES. If we do not know its type or if it does not match the input table (or alias), then the attribute is not resolved. Spark SQL resolves these properties using Catalyst rules and a Catalyst object to track tables for all data sources. Build an unparsed logical plan of unbound properties and data types, and apply the rules to perform the following steps:
- By name fromCatalogTo find the relationship.
- Map named attributes, such ascolTo the given operator subitem of the input.
- Which properties to checkThe same value is referencedGive themI have the same ID(Later allowed against
col = col
Such expressions are optimized). - Passing and enforcing types through expressions: For example, we have no way of knowing
1 + col
Until col is parsed and its subexpressions may be converted to a compatible type.
Analyzer. Scala.
Logic optimization
The logical optimization phase applies standard rule-based optimization to the logical plan. These rules include: Constant folding, predicate pushdown, projection Pruning, null Propagation, Boolean simplification Expression simplification) and others.
Optimizer. Scala.
Physical planning
Spark SQL generates one or more physical plans using the physical operators that match the Spark execution engine, and then applies the cost model to select one of them.
The cost-based optimizer is only used to select join algorithms: Spark SQL uses broadcast Join (a point-to-point broadcast tool) for known small relationships.
Physical plans also perform rule-based physical optimizations, such as Piplining Projection * or filtering in a Spark map operation. In addition, you can push operations from logical plans to data sources that support predicates or projections to push down.
SparkStrategies.scala.
Code generation
The final stage involves generating Java bytecode to run on the machine. Because Spark SQL often operates on data sets in memory, and processing is CPU-constrained, we wanted to support code generation to speed up execution. The code generation engine acts as a compiler, and Catalyst relies on a special feature of the Scala language, quasi-quotes, to simplify code generation. Quasiquotes allows you to programmatically construct an AST in Scala and then feed it to the Scala compiler to generate bytecode at run time. We used Catalyst to convert a tree representing an SQL expression into an AST representing Scala code to evaluate that expression, and then compile and run the resulting code.
Take the Add, Attribute, and Literal tree nodes above, which allow us to write expressions like (x + y) + 1. Without code generation, such expressions in each row of data would have to be interpreted by traversing a tree of Add, Attribute, and Literal nodes, which would introduce a lot of branching and virtual function calls, slowing down execution. Through code generation, we can write a function to convert a particular expression tree into a Scala AST, as follows:
def compile(node: Node): AST = node match {// constant case Literal(value) => q"$value" // Attribute(name) => q"row.get($name)" // operator "+" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }Copy the code
Strings that begin with q are quasi-quotes and, although they look like strings, are parsed by the Scala compiler at compile time and represent the AST of the code within them.
Quasiquotes can concatenate variables or other AST into these variables and represent them with the $sign. For example, Literal (1) would be the Scala AST of 1, and *Attribute (” x “)* would be row.get (” x “). Finally, trees like Add (Literal (1), Attribute (” x “) become AST of Scala expressions: 1+row.get (” x “).
Quasiquotes do type checking at compile time to ensure that only suitable ASTs or literals can be replaced, which is more useful than string concatenation, and generates the Scala AST tree directly rather than running the Scala parser at run time. In addition, because the generation rules for each node’s code do not need to know how its children are built, they are highly composable. Finally, if Catalyst lacks expression-level optimization, the Scala compiler optimizes the code further. CodeGenerator. Scala.