preface

Shardingsphere-jdbc introduction shardingsphere-JDBC introduction shardingsphere-JDBC introduction shardingsphere-JDBC introduction Shardingsphere-JDBC introduction shardingsphere-JDBC introduction shardingsphere-JDBC introduction Parsing, routing, rewriting, execution, merging these several sub-processes, each sub-process has a corresponding engine to deal with, this paper focuses on the analysis of the sub-process of the parsing engine.

Fragmentation process

Before introducing the parsing engine, let’s take a brief look at each subprocess; We can imagine that there are several processes; First of all, the user operates on logical tables, which will eventually be replaced by physical tables, so you need to parse SQL, in fact, to understand SQL; Then, according to the fragment routing algorithm, which table should be routed to which library; Next, you need to generate real SQL so that the SQL can be executed; There may be multiple SQL generated and each one needs to be executed; Finally, the results of multiple execution are merged and the result set is returned. The whole process is roughly as follows (from the official website) :

It consists of the process of SQL parsing => executor optimization => SQL routing => SQL rewriting => SQL execution => result merging; Each subprocess has a dedicated engine:

  • SQL parsing: divided into lexical parsing and syntax parsing. The SQL is first broken down into non-separable words through a lexical parser. The SYNTAX parser is then used to understand the SQL and ultimately extract the parsing context. Parsing context includes tables, selectors, sorting items, grouping items, aggregation functions, paging information, query conditions, and markers of placeholders that may need to be modified;
  • Executor optimization: merge and optimize sharding conditions, such as OR, etc.
  • SQL routing: Matches the fragment policy configured by the user based on the parsing context and generates routing paths. Currently, fragment and broadcast routes are supported.
  • SQL rewriting: Rewriting SQL into statements that can be executed correctly in a real database. SQL rewriting is divided into correctness rewriting and optimization rewriting;
  • SQL execution: asynchronous execution through multi-threaded executor;
  • Result merging: Merges multiple execution result sets for easy output through a unified JDBC interface. Result merge includes streaming merge, in-memory merge, and append merge using decorator pattern.

This article focuses on SQL parsing, but before doing so we need to have a general understanding of ANTLR’s core components;

About the ANTLR

ANTLR (Another Tool for Language Recognition) is a powerful parser generator that can be used to read, process, execute, or translate structured text or binary files. It is widely used to build languages, tools, and frameworks. ANTLR can generate a parser from the syntax that builds and traverses the parse tree.

ANTLR official address: www.antlr.org

ANTLR consists of two parts:

  • A tool for translating user-defined syntax into a parser/lexical parser in Java, corresponding to antlr-complete.jar;
  • The environment library file required by the parser runtime, corresponding to antlr-runtime.jar;

ANTLR grammar

ANTLR is a.g4-ended file by default. A syntax definition file generally has a common structure as follows:

/** Optional javadoc style comment */ grammar Name; (1) the options {... } import ... ; tokens {... } channels {... } // lexer only @actionName {... } rule1 // parser and lexer rules, possibly intermingled ... ruleNCopy the code
  • Grammar: The name of the syntax must be the same as the file name; The prefixes lexer and parser can be included as follows:

    lexer grammar MySqlLexer;
    parser grammar MySqlParser;
    Copy the code
  • Options: Many options can be specified at the syntax and rule element level. Grammar can contain: superClass, language, tokenVocab, TokenLabelType, contextSuperClass, etc

    options { tokenVocab=MySqlLexer; }
    Copy the code
  • Import: Splits a syntax into logical, reusable chunks, sort of like superclasses;

  • Tokens: Define the types of tokens for grammar that have no associated lexical rules;

    // explicitly define keyword token types to avoid implicit definition warnings tokens { BEGIN, END, IF, THEN, WHILE } @lexer::members { // keywords map used in lexer to assign token types Map<String,Integer> keywords = new HashMap<String,Integer>() {{ put("begin", KeywordsParser.BEGIN); put("end", KeywordsParser.END); . }}; }Copy the code
  • Channels: Only the lexer grammar can contain custom channels, such as:

    channels {
      WHITESPACE_CHANNEL,
      COMMENTS_CHANNEL
    }
    Copy the code

    The above channels can be used like enumerations in a lexer rule:

    WS : [ \r\t\n]+ -> channel(WHITESPACE_CHANNEL) ;
    Copy the code
  • ActionName: Currently, only two defined naming operations (for Java targets) are used outside the syntax rules: header and members; The former injects code into the generated recognizer class file before the recognizer class definition, while the latter injects code into the recognizer class definition as fields and methods.

  • Rules can be divided into Lexer Rules and Parser Rules. The rule format is as follows:

    ``` ruleName : alternative1 | ... | alternativeN ; ` ` `Copy the code

    Lexer Rules: Names begin with a capital letter;

    Parser Rules: The name starts with a lowercase letter.

For more information, please refer to github.com/antlr/antlr…

ANTLR use

Configure the environment

First, you need to download the antlr-complete.jar file from the official website. The version I use here is: 4.7.2; Then you need to configure the CLASSPATH:

.; %JAVA_HOME%\lib; %JAVA_HOME%\lib\tools.jar; E: \ antlr \ antlr - 4.7.2 - complete the jarCopy the code

Check for success:

E:\antlr> Java org.antlr.v4.Tool antlr Parser Generator Version 4.7.2 -o ___ specify output directory where all output is  generated -lib ___ specify location of grammars, tokens files -atn generate rule augmented transition network diagrams ......Copy the code

Grammar file

We need to define our own syntax file based on the syntax provided by ANTLR, such as hello.g4 as follows:

// Define a grammar called Hello
grammar Hello;
r  : 'hello' ID ;         // match keyword hello followed by an identifier
ID : [a-z]+ ;             // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Copy the code

Working with syntax files

Use ANTLR to execute the following command:

E: \ antlr > Java jar antlr - 4.7.2 - complete. Jar hello.html g4Copy the code

The following files are generated in the current directory:

HelloParser.java
HelloLexer.java
HelloListener.java
HelloBaseListener.java
HelloLexer.tokens
Hello.tokens
HelloLexer.interp
Hello.interp
Copy the code

test

First you need to compile the Java class generated above:

E:\antlr>javac Hello*.java
Copy the code

Run the following command to display the tree:

E:\antlr>java org.antlr.v4.gui.TestRig Hello  r -gui
hello zhaohui
^Z
Copy the code

Note: CTRL +D for Unix, CTRL +Z for Windows;

plug-in

In addition to the above methods, you can also use the plug-in directly in the IDE, the various IDE plug-in address can be directly viewed on the official website:

Plug-in address: www.antlr.org/tools.html

Working with syntax files

Right click on “Configure Antlr…” on the hello. g4 file. , as follows:

Some of the more important configurations are: the location of the generated file output, the package name specified by the generated class, and the syntax tree traversal mode;

Syntax tree traversal mode which can be configured in two modes: listener mode and visitor mode

test

G4 syntax file. In IDEA, open Hello.g4 right click on “Test Rule”. ANTLR view is shown as follows:

Code implementation

With the above tests, you can use the code to retrieve the Parse Tree for traversal; Consider the following simple example:

public class HelloDemo {
    public static void main(String[] args) {
        CharStream input = CharStreams.fromString("hello zhaohui");
        HelloLexer lexer = new HelloLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        HelloParser parser = newHelloParser(tokens); ParseTree tree = parser.r(); System.out.println(tree.toStringTree(parser)); }}Copy the code

The following output is displayed:

(r hello zhaohui)
Copy the code

Parsing engine

The parsing process is divided into lexical parsing and grammatical parsing. A lexical parser is used to disassemble SQL into non-divisible atomic symbols called tokens. It classifies them into keywords, expressions, literals, and operators based on the dictionaries provided by different database dialects. The syntax parser is then used to convert the output of the lexical parser into an abstract syntax tree.

Starting with version 3.0.x, ANTLR is used as a lexical parser, with its own dialect for each supported database and its own parser for each database; From the above we can use ANTLR to automatically generate the required parsers, provided we have Lexer and Parser files.

Grammar file

ANTLR provides syntax files for all kinds of data on Github. The path is as follows:

File path: github.com/antlr/gramm…

Mysql, for example, contains two files:

MySqlLexer.g4
MySqlParser.g4
Copy the code

In shardingsphere-SQL-parser-mysql, you can find autogen classes:

We can also do a simple test in IDEA, enter a common query SQL:

SELECT * FROM ORDER WHERE USER_ID=111;
Copy the code

The resulting tree structure is as follows:

Parsing engine

Shardingsphere-jdbc provides the following parsing engine class: SQLParserEngine. The main core methods are as follows:

private SQLStatement parse0(final String sql, final boolean useCache) {
        if (useCache) {
            Optional<SQLStatement> cachedSQLStatement = cache.getSQLStatement(sql);
            if (cachedSQLStatement.isPresent()) {
                return cachedSQLStatement.get();
            }
        }
        ParseTree parseTree = new SQLParserExecutor(databaseTypeName, sql).execute().getRootNode();
        SQLStatement result = (SQLStatement) ParseTreeVisitorFactory.newInstance(databaseTypeName, VisitorRule.valueOf(parseTree.getClass())).visit(parseTree);
        if (useCache) {
            cache.put(sql, result);
        }
        return result;
    }
Copy the code

The two parameters are: logical SQL, whether to use cache; Return the value SQLStatement; The next two key steps are to convert the logical SQL to ParseTree and access ParseTree to obtain the SQLStatement.

Convert ParseTree

To convert SQL to ParseTree, we need to get Parser first. To get Parser, we need to get Lexer.

private static SQLParser createSQLParser(final String sql, final SQLParserConfiguration configuration) {
        Lexer lexer = (Lexer) configuration.getLexerClass().getConstructor(CharStream.class).newInstance(CharStreams.fromString(sql));
        return configuration.getParserClass().getConstructor(TokenStream.class).newInstance(new CommonTokenStream(lexer));
    }
Copy the code

Different data types get different Lexers and SQLParsers; Shardingsphere-jdbc provides a variety of database support;

  • Lexer: MySQLLexer, OracleLexer, PostgreSQLLexer, SQL92Lexer, SQLServerLexer;

  • SQLParser: MySqlParser, OracleParser, PostgreSQLParser, SQL92Parser, SQLServerParser;

MysqlParser: MysqlParser: MysqlParser: MysqlParser: MysqlParser

public final class MySQLParser extends MySQLStatementParser implements SQLParser {
    
    public MySQLParser(final TokenStream input) {
        super(input);
    }
    
    @Override
    public ASTNode parse(a) {
        return newParseASTNode(execute()); }}Copy the code

The execute method of MySQLParser is used to generate MySQLStatementParser automatically.

Get SQLStatement

If you have ParseTree, then you need to traverse the tree to get the SQLStatement. The default traversal for ShardingSphere-JDBC is visitor; The domain model is traversed through the abstract syntax tree by visitor, and the context required by sharding is extracted through the domain model (SQLStatement), and the location that may need rewriting is marked. Similarly, each database needs to provide its own visitor. Currently supported databases include:

Visitor: MySQLVisitor, OracleVisitor, PostgreSQLVisitor, SQL92Visitor, SQLServerVisitor;

SQLStatement

The SQL statement is generated by the visitor. The SQL statement is generated by the visitor.

  • DALStatement: Data Access Layer, including show Databases and tables.
  • Data Manipulation Language (DMLStatement) is a database Manipulation Language, including adding, deleting, modifying, and querying Data.
  • DCLStatement: Data Control Language (DCLStatement)
  • DDLStatement: Data Definition Language, a database Definition Language used to create, modify, and delete tables.
  • RLStatement: Replication, including primary and secondary Replication.
  • Transaction Control Language (TCLStatement) Transaction Control Language (TCLStatement)

If you open the syntax file of the corresponding database, you can find the corresponding rule, such as MySqlParser:

sqlStatement
    : ddlStatement | dmlStatement | transactionStatement
    | replicationStatement | preparedStatement
    | administrationStatement | utilityStatement
    ;
Copy the code

Each of the above types provides its own visitor:

DALVisitor, DCLVisitor, DDLVisitor, DMLVisitor, RLVisitor, TCLVisitor

DMLStatement

For example, the most common query SQL is generated as a DMLStatement. The common subclasses are:

DMLStatement: CallStatement, DeleteStatement, DoStatement, InsertStatement, ReplaceStatement, SelectStatement, and UpdateStatement.

The corresponding syntax files also have corresponding relationships:

dmlStatement
    : selectStatement | insertStatement | updateStatement
    | deleteStatement | replaceStatement | callStatement
    | loadDataStatement | loadXmlStatement | doStatement
    | handlerStatement
    ;
Copy the code

Each of the above operation types needs to be overloaded in the corresponding Visitor. In the case of Mysql, the corresponding DMLVisitor is MySQLDMLVisitor. The related select statement method is overloaded.

 @Override
    public ASTNode visitSelect(final SelectContext ctx) {
        // TODO :Unsupported for withClause.
        SelectStatement result = (SelectStatement) visit(ctx.unionClause());
        result.setParameterCount(getCurrentParameterIndex());
        return result;
    }
    
    @SuppressWarnings("unchecked")
    @Override
    public ASTNode visitSelectClause(final SelectClauseContext ctx) {
        SelectStatement result = new SelectStatement();
        result.setProjections((ProjectionsSegment) visit(ctx.projections()));
        if (null! = ctx.selectSpecification()) { result.getProjections().setDistinctRow(isDistinct(ctx)); }if (null! = ctx.fromClause()) { CollectionValue<TableReferenceSegment> tableReferences = (CollectionValue<TableReferenceSegment>) visit(ctx.fromClause());for(TableReferenceSegment each : tableReferences.getValue()) { result.getTableReferences().add(each); }}if (null! = ctx.whereClause()) { result.setWhere((WhereSegment) visit(ctx.whereClause())); }if (null! = ctx.groupByClause()) { result.setGroupBy((GroupBySegment) visit(ctx.groupByClause())); }if (null! = ctx.orderByClause()) { result.setOrderBy((OrderBySegment) visit(ctx.orderByClause())); }if (null! = ctx.limitClause()) { result.setLimit((LimitSegment) visit(ctx.limitClause())); }if (null! = ctx.lockClause()) { result.setLock((LockSegment) visit(ctx.lockClause())); }return result;
    }
Copy the code
SelectStatement

The SelectStatement corresponding to the SQL query is as follows:

public final class SelectStatement extends DMLStatement {
    
    private ProjectionsSegment projections;
    private final Collection<TableReferenceSegment> tableReferences = new LinkedList<>();
    private WhereSegment where;
    private GroupBySegment groupBy;
    private OrderBySegment orderBy;
    private LimitSegment limit;
    private SelectStatement parentStatement;
    private LockSegment lock;
}
Copy the code

SQL > select * from ‘SQL’; SQL > select * from ‘SQL’; The other types will not post code here. Generate their own Segment based on each type. Finally, wrap SQLStatement as SQLStatementContext for the downstream routing engine to use;

Similarly, syntax files have corresponding relationships:

selectStatement
    : querySpecification lockClause?                                #simpleSelect
    | queryExpression lockClause?                                   #parenthesisSelect
    | querySpecificationNointo unionStatement+
        (
          UNION unionType=(ALL | DISTINCT)?
          (querySpecification | queryExpression)
        )?
        orderByClause? limitClause? lockClause?                     #unionSelect
    | queryExpressionNointo unionParenthesis+
        (
          UNION unionType=(ALL | DISTINCT)?
          queryExpression
        )?
        orderByClause? limitClause? lockClause?                     #unionParenthesisSelect
    ;
Copy the code

conclusion

This paper focuses on the analysis process of the whole sharding process, the core of the whole analysis is ANTLR, if you understand ANTLR related syntax, and traversal way, then the analysis engine is basically not difficult, ANTLR official documents are relatively comprehensive, interested can go to read; The following sections continue to analyze the routing mechanism of sharding.

reference

Shardingsphere.apache.org/document/cu…

Thank you for attention

You can pay attention to the wechat public account “Roll back the code”, read the first time, the article continues to update; Focus on Java source code, architecture, algorithms, and interviews.