The background,

Since it was first written into the government work report in 2014, big data has been developing for seven years. The types of big data also extend from transaction data to interactive and sensor data. The data scale also reached petabytes.

The scale of big data is so large that the acquisition, storage, management and analysis of data are beyond the capabilities of traditional database software tools. In this context, a variety of big data-related tools have emerged to meet the requirements of various business scenarios. From Hadoop Hive, Spark, Presto, Kylin, Druid to non-Hadoop ClickHouse, Elasticsearch…

These big data processing tools have different features and application scenarios, but they provide similar interfaces or operation languages. That is, each component supports THE SQL language. Only based on different application scenarios and features, the implementation of their own SQL dialects. This requires related open source projects to implement SQL parsing themselves. Against this backdrop, ANTLR, the syntax parser generator that was born in 1989, has seen its golden age.

Second, the introduction of

ANTLR is an open source generator for syntax parsers that has been around for more than 30 years. Is an open source project that has stood the test of time. From source code to machine executable, a program basically needs three stages: writing, compiling, and executing.

During compilation, lexical and grammatical analysis is required. ANTLR focuses on the problem of morphing and parsing the source code to produce a tree-like parser. ANTLR supports parsing of almost all major programming languages. As can be seen from ANTLR/GrammarS-V4, ANTLR supports Java,C, Python, SQL and dozens of other programming languages. There is usually no need to extend the programming language, so in most cases these language compilation support is more for study purposes, or for checking syntax and formatting code in various development tools (NetBeans, Intellij).

For SQL, ANTLR is more widely and deeply applied. This is because Hive, Presto, and SparkSQL need to implement customized SQL execution, such as implementing distributed query engines and unique features in various big data scenarios.

Three, based on ANTLR4 to achieve four operations

At the moment we are mainly using ANTLR4. In The book The Definitive ANTLR4 Reference, we describe various interesting application scenarios based on ANTLR4. For example: implement a calculator that supports four operations; JSON and other formatted text parsing and extraction;

Convert JSON to XML; Extract the interface from the Java source code. This section introduces the simple application of Antlr4, and paves the way for the implementation of SQL parsing based on Antlr4. In fact, support for number-crunchingis also a basic requirement for every programming language.

3.1 Self-coding implementation

So what do we do when we don’t have ANTLR4 and we want to implement four operations? One idea is a stack-based implementation. For example, without considering exception handling, the simple four operations code can be implemented as follows:

package org.example.calc;
 
import java.util.*;
 
public class CalcByHand {
    // Define operators and prioritize them. */ has a higher priority
    public static Set<String> opSet1 = new HashSet<>();
    public static Set<String> opSet2 = new HashSet<>();
    static{
        opSet1.add("+");
        opSet1.add("-");
        opSet2.add("*");
        opSet2.add("/");
    }
    public static void main(String[] args) {
        String exp="1 + 3 * 4";
        // Split the expression into tokens
        String[] tokens = exp.split("((? < = [\ \ + | \ \ | | \ \ \ \ * /]) | (? = [\ \ + | \ \ | | \ \ \ \ * /]))");
 
        Stack<String> opStack = new Stack<>();
        Stack<String> numStack = new Stack<>();
        int proi=1;
        // Put them on different stacks based on type
        for(String token: tokens){
            token = token.trim();
 
            if(opSet1.contains(token)){
                opStack.push(token);
                proi=1;
            }else if(opSet2.contains(token)){
                proi=2;
                opStack.push(token);
            }else{
                numStack.push(token);
                // If the operator preceding the operand is a higher-priority operator, the result is pushed
                if(proi==2){ calcExp(opStack,numStack); }}}while(! opStack.isEmpty()){ calcExp(opStack,numStack); } String finalVal = numStack.pop(); System.out.println(finalVal); }private static void calcExp(Stack<String> opStack, Stack<String> numStack) {
        double right=Double.valueOf(numStack.pop());
        double left = Double.valueOf(numStack.pop());
        String op = opStack.pop();
        String val;
        switch (op){
            case "+":
                 val =String.valueOf(left+right);
                break;
            case "-":
                 val =String.valueOf(left-right);
                break;
            case "*":
                val =String.valueOf(left*right);
                break;
            case "/":
                val =String.valueOf(left/right);
                break;
            default:
                throw new UnsupportedOperationException("unsupported"); } numStack.push(val); }}Copy the code

Small amount of code, using the data structure-stack feature, need to control operator precedence, there is no support for parenthesis expressions, and there is no support for expression assignment. Let’s look at the implementation using ANTLR4.

3.2 Implementation based on ANTLR4

The basic process of programming with ANTLR4 is fixed and usually consists of the following three steps:

  • Write the semantic rules of custom syntax according to the rules of ANTLR4 based on requirements, and save them in a file with the suffix G4.

  • Use ANTLR4 tool to process G4 file, generate lexical analyzer, syntax analyzer code, dictionary file.

  • Write code that extends the Visitor class or implements the Listener interface to develop your own business logic code.

Based on the above process, let’s take a look at the details using an existing case.

Step 1: An ANTLr4-based rule defines a syntax file with a g4 suffix for the filename. For example, the syntax rules file that implements the calculator is named Labeledexpr.g4. It reads as follows:

grammar LabeledExpr; // rename to distinguish from Expr.g4
 
prog:   stat+ ;
 
stat:   expr NEWLINE                # printExpr
    |   ID '=' expr NEWLINE         # assign
    |   NEWLINE                     # blank
    ;
 
expr:   expr op=(The '*'|'/') expr      # MulDiv
    |   expr op=('+'|The '-') expr      # AddSub
    |   INT                         # int
    |   ID                          # id
    |   '(' expr ') '                # parens
    ;
 
MUL :   The '*' ; // assigns token name to '*' used above in grammar
DIV :   '/' ;
ADD :   '+' ;
SUB :   The '-' ;
ID  :   [a-zA-Z]+ ;      // match identifiers
INT :   [0-9]+ ;         // match integers
NEWLINE:'\r'? '\n' ;     // return newlines to parser (is end-statement signal)
WS  :   [ \t]+ -> skip ; // toss out whitespace
Copy the code

(Note: This file example is from The Definitive ANTLR4 Reference)

Briefly read the Labeledexpr.g4 file. ANTLR4 rules are defined based on regular expression definitions. Rules are understood from the top down, with each semicolon-terminated statement representing a rule. For example, the first line: Grammar LabeledExpr; It means that our syntax name is LabeledExpr, and this name needs to be consistent with the file name. Java coding has a similar rule: the class name is the same as the class file.

The rule prog indicates that prog is one or more stats.

Rule stat matches three subrules: blank line, expression expr, and assignment expression ID ‘=’ expr.

The expression EXPr ADAPTS five subrules: multiplication and division, addition and subtraction, integer, ID, and parenthesis expressions. Obviously, this is a recursive definition.

Rule **ID: [a-za-z]+** indicates that ID is limited to upper and lower case English strings; INT: [0-9]+; The rule for INT is one or more numbers between 0 and 9, but this definition isn’t really strict. More strictly, it should be limited in length.

Based on the understanding of regular expressions, THE G4 grammar rules of ANTLR4 are relatively easy to understand.

When defining an ANTLR4 rule, note that it is possible for a string to support more than one rule, such as the following two:

ID: [a-zA-Z]+;

FROM: ‘from’;

Obviously, the string “from” satisfies both of the above rules, and the way ANTLR4 handles it is in the order it is defined. In this case, ID is defined before FROM, so the string FROM matches the rule ID first.

In fact, ANTLR4 has done 50% of the work for us after the g4 file is written and defined, and the rest of the development work is based on the interface or abstract class implementation. There are two implementations that handle the generated syntax tree, a Visitor pattern and a Listener(Listener pattern).

3.2.1 Using the Visitor mode

Step 2: Parse the G4 file using the ANTLR4 tool and generate the code. The ANTLR tool parses the G4 file and automatically generates the base code for us. The process diagram is as follows:

The command line is as follows:

antlr4 -package org.example.calc -no-listener -visitor .\LabeledExpr.g4
Copy the code

After the command is executed, the following file is generated:

$tree.. ├ ─ ─ LabeledExpr. G4 ├ ─ ─ LabeledExpr. Tokens ├ ─ ─ LabeledExprBaseVisitor. Java ├ ─ ─ LabeledExprLexer. Java ├ ─ ─ LabeledExprLexer. Tokens ├─ LabeledExprParser. Java ├─ LabeledExprVisitorCopy the code

First develop the entry class calc.java. The Calc class is the entry point to the entire program, and the core code for calling ANTLR4’s Lexer and Parser classes is as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);
Copy the code

Next, define a class that inherits from LabeledExprBaseVisitor and overwrite it as follows:

As you can see from the figure, the generated code corresponds to the rule definition. For example, visitAddSub corresponds to the AddSub rule, and visitId corresponds to the ID rule. And so on… Add and subtract the code is as follows:

/** expr op=('+'|'-') expr */
@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
    int left = visit(ctx.expr(0));  // get value of left subexpression
    int right = visit(ctx.expr(1)); // get value of right subexpression
    if ( ctx.op.getType() == LabeledExprParser.ADD ) return left + right;
    return left - right; // must be SUB
}
Copy the code

Pretty straightforward. When the code is written, it’s Calc. Run the main function of Calc, enter the corresponding operation expression on the interactive command line, and press Ctrl+D to see the operation result. For example 1 + 3 * 4 = 13.

3.2.2 Using the Listener Mode

Similarly, we can use the Listener mode to implement the four operations. The command line is as follows:

antlr4 -package org.example.calc -listener .\LabeledExpr.g4
Copy the code

The execution of this command also produces the framework code for us. Based on the framework code, we can develop the entry class and the interface implementation class. First develop the entry class calc.java. The Calc class is the entry point to the entire program, and the code for calling ANTLR4’s Lexer and Parser classes is as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(new EvalListener(), tree);
Copy the code

You can see that the call logic for generating ParseTree is exactly the same. The Listener code is a little more complex and also uses the stack data structure, but it only needs a stack of operands and does not need to control the priority itself. Take AddSub as an example:

@Override
public void exitAddSub(LabeledExprParser.AddSubContext ctx) {
    Double left = numStack.pop();
    Double right= numStack.pop();
    Double result;
    if (ctx.op.getType() == LabeledExprParser.ADD) {
        result = left + right;
    } else {
        result = left - right;
    }
    numStack.push(result);
}
Copy the code

Just take the operands out of the stack and compute them.

3.2.3 summary

The difference between The Listener and Visitor patterns is clearly explained in The Book The Definitive ANTLR 4 Reference:

The Listener mode:

The Visitor pattern:

  • The Listener mode is traversed through the Walker object itself, regardless of the relationships in the syntax tree. Vistor needs to control the access of child nodes. If a child node is missed, the whole child node cannot be accessed.

  • The Listener mode method has no return value, while the Vistor mode can set any return value.

  • The Listener mode has a clear access stack, while the Vistor mode is a method call stack. If the implementation fails, StackOverFlow may occur.

With this simple example, we drive Antlr4 to implement a simple calculator. Learned the application process of ANTLR4. You learned how the G4 syntax file is defined, the Visitor pattern, and the Listener pattern. Through ANTLR4, we generated the ParseTree, and accessed the ParseTree based on Visitor mode and Listener mode, and implemented four operations.

From the above examples, we can find that without ANTLR4, we can write our own algorithm to achieve the same function. However, ANTLR does not care about the expression string parsing process, but only focuses on the specific business implementation, which is very worry and trouble saving.

More importantly, ANTLR4 provides more imaginative abstract logic than self-implementation, rising to the height of methodology, because it is not limited to solving a problem, but solving a class of problems. It can be said that ANTLR is as different from the common area formula and calculus in mathematics as it is from hard-coded problem solving.

Four, reference Presto source code development SQL parser

The previous introduction of the use of ANTLR4 to achieve four operations, its purpose is to understand the application of ANTLR4. The following figure shows our real purpose: to study how ANTLR4 implements SQL statement parsing in Presto.

Supporting a full SQL syntax is a huge project. There is a complete SQLbase.g4 file in presto that defines all the SQL syntax supported by Presto, including DDL syntax and DML syntax. The document system is relatively large, not suitable for learning to explore a specific point in detail.

In order to explore the process of SQL parsing and understand the logic behind SQL execution, I chose to do my own coding experiment on the basis of simply reading related documents. To do this, define a small goal: to implement an SQL parser. Use this parser to implement the SELECT Field from table syntax, to query the specified field from the local CSV data source.

4.1 Crop the SelectBase. G4 file

Start by defining the SelectBase. G4 file based on the same process used to implement the four algorithms. With the Presto source code as a reference, we don’t need to develop SelectBase. G4 ourselves, just need to crop the G4 file based on Presto. The content after cropping is as follows:

grammar SqlBase;
 
tokens {
    DELIMITER
}
 
singleStatement
    : statement EOF
    ;
 
statement
    : query                                                            #statementDefault
    ;
 
query
    :  queryNoWith
    ;
 
queryNoWith:
      queryTerm
    ;
 
queryTerm
    : queryPrimary                                                             #queryTermDefault
    ;
 
queryPrimary
    : querySpecification                   #queryPrimaryDefault
    ;
 
querySpecification
    : SELECT  selectItem (', ' selectItem)*
      (FROM relation (', ' relation)*)?
    ;
 
selectItem
    : expression  #selectSingle
    ;
 
relation
    :  sampledRelation                             #relationDefault
    ;
 
expression
    : booleanExpression
    ;
 
booleanExpression
    : valueExpression             #predicated
    ;
 
valueExpression
    : primaryExpression                                                                 #valueExpressionDefault
    ;
 
primaryExpression
    : identifier                                                                          #columnReference
    ;
 
sampledRelation
    : aliasedRelation
    ;
 
aliasedRelation
    : relationPrimary
    ;
 
relationPrimary
    : qualifiedName                                                   #tableName
    ;
 
qualifiedName
    : identifier ('. ' identifier)*
    ;
 
identifier
    : IDENTIFIER             #unquotedIdentifier
    ;
 
SELECT: 'SELECT';
FROM: 'FROM';
 
fragment DIGIT
    : [0-9]; fragment LETTER : [A-Z] ; IDENTIFIER : (LETTER |'_') (LETTER | DIGIT | '_' | The '@' | ':')*
    ;
 
WS
    : [ \r\n\t]+ -> channel(HIDDEN)
    ;
 
// Catch-all for anything we can't recognize.
// We use this to be able to ignore and recover all the text
// when splitting statements with DelimiterLexer
UNRECOGNIZED
    : .
    ;
Copy the code

Compared to the 700-plus lines of rules in the Presto source code, we trimmed them to 1/10 of their size. SELECT selectItem (‘,’ selectItem)* (FROM relation (‘,’ relation)*)

By understanding the G4 file, we can also understand the composition of our query more clearly. For example, the most common query data source is a table. But in SQL syntax, our query data table is abstracted as Relation.

This relation may come from a specific data table, or a subquery, or a JOIN, or a sample of data, or the unnest of an expression. In the field of big data, such an extension will greatly facilitate data processing.

For example, using unnest syntax to parse complex types of data, SQL looks like this:

Although SQL is more complex, understanding the G4 file gives you a clear understanding of its structural divisions. Going back to the selectBase.g4 file, we also process the g4 file using the Antlr4 command to generate code:

antlr4 -package org.example.antlr -no-listener -visitor .\SqlBase.g4
Copy the code

This generates the basic framework code. The next step is to handle the business logic yourself.

4.2 Traversing syntax tree encapsulates SQL structure information

The next step is to define the node types of the syntax tree based on the SQL syntax, as shown in the figure below.

With this class diagram, you can clearly see the basic elements of the SQL syntax.

You then implement your own parsing class, AstBuilder, based on the Visitor pattern (again, trimmed from the Presto source code to simplify matters). Take the example of handling the querySpecification rule code:

@Override
public Node visitQuerySpecification(SqlBaseParser.QuerySpecificationContext context)
{
    Optional<Relation> from = Optional.empty();
    List<SelectItem> selectItems = visit(context.selectItem(), SelectItem.class);
 
    List<Relation> relations = visit(context.relation(), Relation.class);
    if(! relations.isEmpty()) {// synthesize implicit join nodes
        Iterator<Relation> iterator = relations.iterator();
        Relation relation = iterator.next();
 
        from = Optional.of(relation);
    }
 
    return new QuerySpecification(
            getLocation(context),
            new Select(getLocation(context.SELECT()), false, selectItems),
            from);
}
Copy the code

Through the code, we have resolved the query data source and the specific field, encapsulated in the QuerySpecification object.

4.3 Using the Statement object to query data

From the previous example of implementing the four algorithms, we know that ANTLR parses user input into a ParseTree. Business developers to implement the relevant interface ParseTree. Presto parses input SQL statements, generates a ParseTree, and traverses the ParseTree to generate the Statement object. The core code is as follows:

SqlParser sqlParser = new SqlParser();
Statement statement = sqlParser.createStatement(sql);
Copy the code

Now that we have a Statement object how do we use it? Combining the previous class diagram, we can see that:

  • Statement of type Query has a QueryBody attribute.

  • QueryBody of type QuerySpecification has select and FROM properties.

From this structure, we can clearly obtain the elements necessary to implement the SELECT query:

  • Get the target Table from the FROM attribute. The convention is that the table name is the same as the CSV file name.

  • The target field SelectItem is obtained from the SELECT attribute. The first line of CSV is the title line.

The entire business process is clear. After parsing the SQL statement to generate the Statement object, follow these steps:

  • S1: obtain the data table and fields to be queried.

  • S2: goes to the data file by the name of the data table and reads the data of the data file.

  • S3: Formats the output field name to the command line.

  • S4: Formats output fields to the command line.

To simplify the logic, the code handles only the mainline and does no exception handling.

/** * Obtain the name of the table and field to be queried */
QuerySpecification specification = (QuerySpecification) query.getQueryBody();
Table table= (Table) specification.getFrom().get();
List<SelectItem> selectItems = specification.getSelect().getSelectItems();
List<String> fieldNames = Lists.newArrayList();
for(SelectItem item:selectItems){
    SingleColumn column = (SingleColumn) item;
    fieldNames.add(((Identifier)column.getExpression()).getValue());
}
 
/** * Determine the data source file for the query based on the table name */
String fileLoc = String.format("./data/%s.csv",table.getName());
 
/** * Reads the specified fields */ from the CSV file
Reader in = new FileReader(fileLoc);
Iterable<CSVRecord> records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
List<Row> rowList = Lists.newArrayList();
for(CSVRecord record:records){
    Row row = new Row();
    for(String field:fieldNames){
        row.addColumn(record.get(field));
    }
    rowList.add(row);
}
 
/** * format output to the console */
int width=30;
String format = fieldNames.stream().map(s-> "% -"+width+"s").collect(Collectors.joining("|"));
System.out.println( "|"+String.format(format, fieldNames.toArray())+"|");
 
int flagCnt = width*fieldNames.size()+fieldNames.size();
String rowDelimiter = String.join("", Collections.nCopies(flagCnt, "-"));
System.out.println(rowDelimiter);
for(Row row:rowList){
    System.out.println( "|"+String.format(format, row.getColumnList().toArray())+"|");
}
Copy the code

The code is only for demonstration function, not considering the exception logic, such as the query field does not exist, CSV file definition field name does not meet the requirements and other issues.

4.4 Realization of effect display

In our project data directory, store the following CSV file:

The sample data of the cities.csv file is as follows:

"LatD"."LatM"."LatS"."NS"."LonD"."LonM"."LonS"."EW"."City"."State"
   41.5.59."N".80.39.0."W"."Youngstown", OH
   42.52.48."N".97.23.23."W"."Yankton", SD
   46.35.59."N".120.30.36."W"."Yakima", WA
   42.16.12."N".71.48.0."W"."Worcester", MA
Copy the code

Run the code to query the data. Use SQL statements to specify fields to query from the CSV file. The result similar to SQL query is as follows:

SQL Example 1: select City, City from cities

SQL Example 2: select name, age from Employee

This section describes how to crop the G4 rule file based on the Presto source code, and then query the data from the CSV file using SQL statements based on the Antlr4 implementation. Based on the cutting of Presto source code experiments, for the study of SQL engine implementation, understanding Presto source code can play a certain role.

Five, the summary

Based on four algorithms and two cases of using SQL to query CSV data, this paper describes the application ideas and process of ANTLR4 in project development. The related code can be seen on Github. Understanding the usage of ANTLR4 can help you understand the definition rules and execution process of SQL, and help you write efficient SQL statements in business development. It is also useful to understand how to compile, define your own DSL, and abstract your business logic. On paper come zhongjue shallow, must know this to practice. It is also fun to explore the source code implementation in the way described in this article.

The resources

1. The Definitive ANTLR4 Reference

2. Presto official documents

3. ANTLR 4 Brief Tutorial

4, Calc class source code

5, EvalVisitor class source code

6, Presto source code

Author: Vivo Internet Development team -Shuai Guangying