Use Antlr to generate lexicon and parser (part 2)

In my previous article, I explained how Antlr was used to generate lexers and parsers. Today is mainly to supplement and improve the grammar rules, see how to use the most efficient speed, improve the grammar function.

Improve the syntax of Expression

As mentioned in the previous article, Antlr can automatically handle the problem of left recursion, so when writing expressions, we can boldly write the form of left recursion to save time. But in this case, we still have to write a rule for each operation, and then we have to write the logical operation and then we have to write the addition operation, and then we have to write the addition operation and then we have to write the multiplication operation, so that we can support the priority, which is a little bit of a hassle.

In fact, Antlr can help us further. We can cover all operations in a single syntax rule, and then support precedence and associativity of expressions in the simplest way possible. In the playscript.g4 syntax rules file, all the expression rules are described in a short piece of code:

expression
    : primary
    | expression bop='. '
      ( IDENTIFIER
      | functionCall
      | THIS
      )
    | expression '[' expression '] '
    | functionCall
    | expression postfix=('+ +' | The '-')
    | prefix=('+'|The '-'|'+ +'|The '-') expression
    | prefix=('~'|'! ') expression
    | expression bop=(The '*'|'/'|The '%') expression  
    | expression bop=('+'|The '-') expression 
    | expression ('<' '<' | '>' '>' '>' | '>' '>') expression
    | expression bop=('< =' | '> =' | '>' | '<') expression
    | expression bop=INSTANCEOF typeType
    | expression bop=('= =' | '! = ') expression
    | expression bop='&' expression
    | expression bop=A '^' expression
    | expression bop='|' expression
    | expression bop='&' expression
    | expression bop='| |' expression
    | expression bop='? ' expression ':' expression
    | 
      
        expression bop=('=' | '+=' | '-=' | '*=' | '/=' | '&=' | '|=' | '^=' | '>>=' | '>>>=' | '<<=' | '%=') expression ;
      =right>Copy the code

This file contains almost all the expression rules we need, including the barely mentioned dot expressions, increment and decrement expressions, array expressions, bitwise expression rules, and so on.

So how does it support priorities? It turns out that priority is determined by the order of the different productions on the right. In standard context-free grammar, the order of productions is independent, but in a concrete algorithm, each production is tried in a certain order.

You can’t do it this way and that way. However, the AST may be different when the same grammar is derived in a different order. We need to note that this is not acceptable from the point of view of grammar theory, but it is acceptable from the point of view of practice. For example, the concept of LL grammar and LR grammar means that the grammar works under LL algorithm or LR algorithm. For example, the previous addition grammar, which is the one with the recursive term on the right, will cause associativity errors in the recursive descent algorithm, but if LR algorithm is used, this problem is completely eliminated and the AST generated is completely correct.

additiveExpression
    :   IntLiteral
    |   IntLiteral Plus additiveExpression
    ;
Copy the code

This syntax of Antlr actually gives extra meaning to the order of production, which is used to indicate priority and provided to the algorithm. So, we can say that these grammars are Antlr grammars because they match Antlr algorithms. Of course, this is just a name I made up, so that you can understand, so that you don’t get confused.

Let’s look at how Antlr implements associativity based on this syntax rule. In the syntax file, Antlr makes a

attribute annotation for the assignment expression, indicating that the assignment expression is right associative. If not marked, it is left combined, to Antlr implementation!
=right>

Let’s keep guessing about Antlr’s internal implementation mechanism. We have analyzed algorithms to ensure correct associativity, such as converting recursion into loops, and then determining the correct parent-child node relationships when constructing the AST. Is Antlr following this line of thinking? Or is there another way? You can check out the code generated by Antlr.

And as you think about it, it’s useful to learn principles. Because when you’re faced with a tool like Antlr, you can guess its implementation mechanism.

Through this simplified algorithm, AST is successfully simplified, no longer has the addition node, the multiplication node and other different nodes, but unified into expression nodes. You may be asking, “If all expression nodes are the same, how do I tell them apart in the parser? How do you know which node is adding or multiplying?”

We can look up whether the current node has a Token for an operator. For example, if the Token or operations (” | | “), or is to be logical, and postfix, bop = = syntax, prefix = these attributes, as a Token of the alias, some operators will also become the expression node attributes. By querying the values of these properties, you can quickly determine the type of operation being performed.

So far, we’ve done all the syntactic work on expressions, so we can feel free to use expressions in scripting languages and focus on perfecting the syntactic work of various statements.

Improve the syntax of various statements

Let’s look at the rules for statements in playscript.g4:

statement
    : blockLabel=block
    | IF parExpression statement (ELSE statement)?
    | FOR '(' forControl ') ' statement
    | WHILE parExpression statement
    | DO statement WHILE parExpression '; '
    | SWITCH parExpression '{' switchBlockStatementGroup* switchLabel* '} '
    | RETURN expression? '; '
    | BREAK IDENTIFIER? '; '
    | SEMI
    | statementExpression=expression '; '
    ;
Copy the code

Just like expressions, a statement rule can cover a variety of common statements, including if statements, for loops, while loops, switch statements, return statements, and so on. Expressions followed by a semicolon are also statements called expression statements.

The syntax of these statements is much simpler than the syntax of expressions in terms of parsing difficulty, and the problems of left recursion, precedence, and associativity do not arise here.

If statement

In languages such as C and Java, if statements are usually written like this:

If (condition) do STH. Else do another thing.Copy the code

More often, though, if and else are followed by a block of statements beginning and ending with curly braces, as in:

If (condition){do something; } else{do something else; }Copy the code

The syntax rules are as follows:

statement : ... | IF parExpression statement (ELSE statement)? . ; parExpression : '(' expression ')';Copy the code

We used the IF and ELSE keywords, and reused already defined statement and expression rules. You see, once the statement rules and expression rules are designed, they can be reused by other syntax rules.

But if statements have their own pitfalls, such as ambiguity. So, next, we analyze the phenomenon of ambiguous grammar by using the if statement.

Solve ambiguity grammar

If statements are nested and else statements are suspended, as in the following code:

If (a > B) if (C > D) do STH. Else C.Copy the code

In the above code, I intentionally unindent the code. So, can you see what else is paired with which if?

If you don’t write grammar rules well enough, you are likely to develop ambiguity, meaning that two different sentences can be derived from the same grammar rules, or two different AST can be generated. Such grammars are called ambiguous grammars, such as the following:

stmt -> if expr stmt
      | if expr stmt else stmt
      | other
Copy the code

Following this syntax rule, the first production derivation or the second production derivation will result in a different AST. In the AST on the left, else pairs with the second if; In the AST on the right, else pairs with the first if.

Most high-level languages parse this sample code to produce the first AST, the else paired with the nearest if, which is what this indented code says:

If (a > B) if (C > D) do STH. Else C.Copy the code

Here is a syntax implementation without ambiguity

stmt -> fullyMatchedStmt | partlyMatchedStmt
fullyMatchedStmt -> if expr fullyMatchedStmt else fullyMatchedStmt
                   | other
partlyMatchedStmt -> if expr stmt
                   | if expr fullyMatchedStmt else partlyMatchedStmt
Copy the code

Following the syntax rules above, there is only one way to derive and only one AST can be generated:

Only partlyMatchedStmt rule can be applied when parsing the first if statement, and fullyMatchedStmt rule can be applied when parsing the second if statement.

At this point, we know that ambiguous grammar can be solved by rewriting the grammatical rules. As for how to rewrite the rules, it is not as clear as left recursion, but we can learn from mature experience.

Going back to the syntax we defined for Antlr, it doesn’t seem very complicated, so how can we ensure that there are no ambiguity issues? Antlr uses the LL algorithm to parse the syntax.

The LL algorithm is depth-first, so when the first statement is parsed, the next level of the if node is created, and the else clause is parsed out of the next level. If Antlr does not use THE LL algorithm, ambiguity will occur. This reaffirms the point we made earlier: grammars often work with parsing algorithms.

For statement

A for statement would look like this:

for (int i = 0; i < 10; i++){
  println(i);
}
Copy the code

The syntax rules are as follows:

statement : ... | FOR '(' forControl ')' statement ... ; forControl : forInit? '; ' expression? '; ' forUpdate=expressionList? ; forInit : variableDeclarators | expressionList ; expressionList : expression (',' expression)* ;Copy the code

As you can see from the syntax rules above, a for statement is ultimately made up of statements, expressions, and variable declarations. After parsing the for statement in the code, the AST is as follows:

Now that I’m familiar with the syntax of the for statement, I want to mention blocks. It’s used in if and for statements, so I’ve added a syntax for the block for your reference:

block : '{' blockStatements '}' ; blockStatements : blockStatement* ; blockStatement : variableDeclarators '; '/ / variable declarations | statement | functionDeclaration / / function declarations | classDeclaration / / class declaration;Copy the code

Now we have a pretty good syntax system, and we’ve done almost all of playScript’s syntax design work, except for the function and class syntax that we’ll cover later. Next, we’ll upgrade the script interpreter to support more syntax and refine the code structure by using the Visitor pattern.

Upgrade the script interpreter with Vistor mode

We used the evaluate() method to traverse the tree from top to bottom in our hand-written scripting language interpreter. As more and more syntax is processed, this method becomes more and more cumbersome to maintain. The Visitor design pattern has a separate method for each AST node, making the code clearer and easier to maintain.

Antlr helps us generate a framework for Visitor processing patterns, which we type on the command line:

antlr -visitor PlayScript.g4
Copy the code

The -visitor argument tells Antlr to generate the following two interfaces and classes:

public interface PlayScriptVisitor<T> extends ParseTreeVisitor<T> {... } public class PlayScriptBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements PlayScriptVisitor<T> {... }Copy the code

In PlayScriptBaseVisitor, you can see a number of visitXXX() methods, one for each AST node, for example:

@Override public T visitPrimitiveType(PlayScriptParser.PrimitiveTypeContext ctx) {... }Copy the code

Where the generic < T > refers to the type of data returned when each node is accessed. In our manual version, we only dealt with integers, so the return value was Integer. Now we are implementing a more advanced version. AST nodes may return various types of data, such as:

Floating-point operations return a floating-point number;
When a character type operation is performed, it returns character data.
It may also be a type designed by the programmer himself, such as an instance of a class.

So let’s just have that Visitor return Object all the time, so it can work for any situation. So our Visitor looks like this (the generic uses Object) :

public class MyVisitor extends PlayScriptBaseVisitor<Object>{
  ...
}
Copy the code

So, in the visitExpression() method, we can write code to evaluate various expressions, such as addition and subtraction, as follows:

public Object visitExpression(ExpressionContext ctx) { Object rtn = null; // if (ctx.bop! = null && ctx.expression().size() >= 2) { Object left = visitExpression(ctx.expression(0)); Object right = visitExpression(ctx.expression(1)); . Type type = cr.node2Type.get(ctx); Switch (ctx.bop.getType()) {case playscriptParser.add: RTN = add(leftObject, rightObject, type); break; Case PlayScriptParser.SUB: // RTN = minus(leftObject, rightObject, type); break; . }}... }Copy the code

ExpressionContext is a node of an expression in the AST. It’s called Context, and it means that you can get all the Context information from that node, including the parent node, the child node, and so on. Where the name of each child node is the same as the name in the syntax, for example, the rule of addition and subtraction syntax is as follows:

expression bop=('+'|'-') expression 
Copy the code

So we can access child nodes using these methods of ExpressionContext:

ctx.expression(); Ctx.expression (0); // Return a list with two members: ctx.expression(0); Ctx.expression (1); // The expression on the left of the ExpressionContext operator is another ExpressionContext object ctx.expression(1); // ctx.bop(); // a Token object of type playscriptParser. ADD or SUB ctx.add (); // a Token object of type playscriptParser. ADD or SUB ctx.add (); // Access the ADD terminator, which returns a non-null value ctx.minus () when adding; // Access the MINUS terminatorCopy the code

We can also evaluate the lower nodes recursively during addition, just like visitExpression(ctx.expression(0)) in the code. Again, to run the entire script, we only need to visit the root node.

So, we can implement a VISIT method for each AST node in this way. To upgrade the entire interpreter. In addition to implementing expression evaluation, we can also write evaluation logic for today’s if and for statements. Take the for statement as an example:

If (forcontrol.forinit ()! = null) { rtn = visitForInit(forControl.forInit()); } while (true) { Boolean condition = true; // If there is no conditional judgment part, it means loop through if (forcontrol.expression ()! = null) { condition = (Boolean) visitExpression(forControl.expression()); } if (condition) {RTN = visitStatement(ctx.statement(0)); // execute forUpdate, usually a statement like "i++". You can't go wrong with this execution order. if (forControl.forUpdate ! = null) { visitExpressionList(forControl.forUpdate); } } else { break; }}Copy the code

You need to pay attention to the execution rules for each part of the for statement, such as:

The forInit section can only be executed once;
ForControl is executed once for each loop to see if the loop continues;
The body of the for statement is then executed;
Finally, the forUpdate section is executed, usually something like “i++”.

Question and answer

Nowadays, all of these functions are implemented in one language. How did the original language implement analysis?

In the compilation world, there’s something called bootstraping, which is the fact that compilers for that language can be written in their own language. This is a sign of language maturity. In previous versions, you write the compiler in another language, but later you should compile it in your own language. Famous languages implement bootstract. For example, the compiler for the GO language was written in Go (earlier versions would have been written in C). It is a milestone in the development of GO. The earliest compiler for a language, that must have been written in a vocabulary. Lift yourself up after a certain point.

After parsing the AST, how can the computer tell me that the “+” in the AST is to perform the add calculation? Is there an intermediate layer?

“+” performs addition, which is dictated by the semantics of the computer language. For example, you can also make “+” do string concatenation, which is also semantically required. So the real difference between computer languages is semantic.

After lexical analysis and grammatical analysis, it is just a data structure. As for what you can do based on this structure, you have to attach semantics. You can append “action commands” to the AST, such as “+” when traversing the AST, adding up both sides. That’s what property computation does. We treat value as a property and use some rules to evaluate the property.

On what basis does ANTLR deal with this ambiguity?

When we implement an algorithm, there’s a certain order to match. Therefore, even ambiguous grammars can be properly parsed under certain algorithms.

Strict unambiguous grammar is more demanding. It has to be algorithmically independent. So it doesn’t matter if you go to the far left or the far right, you get the same result.

The key is to distinguish between “grammar” and “algorithm”. Grammar is ambiguous, but using a specific algorithm is not necessarily ambiguous.

This course is based on the principles of Geek Time compilation course.