Use Antlr to generate lexical and parser (part 1)

I’ve done that in the last couple of chaptersLexical analysis and grammatical analysisThe lexical and grammatical rules mentioned in the examples are also highly simplified. While these are easy to understand and implement as a simple prototype, they are not enough for practical use. In practice, a good compiler has to do a lot of work on both lexical and syntactic aspects, as can be seen visually in the figure below.

Full handwriting would be a bit inefficient if a compiler were to do so much of the work, and we could use existing tools. There are many compiler front-end tools, such as Lex (and the GNU version Flex), Yacc (and the GNU version Bison), JavaCC, and so on. There are two main reasons to choose Antlr:

The first reason is that Antlr supports a wider range of target languages, including Java, C#, JavaScript, Python, Go, C++, Swift. No matter which language you use, you can use it to generate lexical and parsing capabilities.
The second reason is that Antlr’s syntax is simpler. It can solve some common difficulties like left recursion in the tool, which is of great help to improve work efficiency.

I met Antlr

Antlr is an open source tool that supports the generation of lexers and parsers from rule files and is itself implemented in Java.

You can download the Antlr grammar rule library to your local PC for easy viewing.

You can download the Antlr tool and configure it according to the instructions. You will also need to configure the Java environment on your machine (you can find the latest version of the JDK on the Oracle website).

For macs, for example:

$ cd /usr/local$sudo/lib curl - O/https://www.antlr.org/download/antlr-4.9.2-complete.jar/vim. Increased following the following $export CLASSPATH=". : / usr/local/lib/antlr - 4.9.2 - complete. Jar:$CLASSPATH"
$ alias antlr4='Java - jar/usr/local/lib/antlr - 4.9.2 - complete. Jar'
$ alias grun='java org.antlr.v4.gui.TestRig'

$ source .bash_profile
Copy the code

Generate a lexical analyzer with Antlr

Antlr generates the compiler by parsing the rule file. Rules files end in. G4. Lexical rules and grammatical rules can be placed in the same file. But for the sake of clarity, let’s split them into two files and write the lexical rules in one file first.

As a simple exercise, create a hello.g4 file to hold the lexical rules, and then write in some of the lexical rules you’ve used before.

 lexer grammar Hello;  // The lexer keyword means that this is a lexical rule file with the name Hello, the same as the file name
 
/ / key
If :               'if';
Int :              'int';
 
/ / literal
IntLiteral:        [0-9] +; StringLiteral:'"'. *?'"' ;  // String literal
 
/ / operators
AssignmentOP:       '=' ;    
RelationalOP:       '>'|'> ='|'<' |'< =' ;    
Star:               The '*';
Plus:               '+';
Sharp:              The '#';
SemiColon:          '; ';
Dot:                '. ';
Comm:               ', ';
LeftBracket :       '[';
RightBracket:       '] ';
LeftBrace:          '{';
RightBrace:         '} ';
LeftParen:          '(';
RightParen:         ') ';
 
/ / identifier
Id :                [a-zA-Z_] ([a-zA-Z_] | [0-9]) *;// Whitespace characters are discarded
Whitespace:         [ \t]+ -> skip;
Newline:            ( '\r' '\n'? |'\n')-> skip;
Copy the code

Each lexical rule begins with a capital letter, which is Antlr’s convention for lexical rules. Grammar rules start with a lowercase letter. Each of these rules is written using regular expressions that we already know.

Next, we compile the lexical rule by typing the command in the terminal:

antlr4 Hello.g4
Copy the code

This command tells Antlr to compile the rules file and generate a hello.java file and two other auxiliary files. You can open it and have a look at the contents of the file. Next, I compile hello.java with the following command: hello.java

javac *.java
Copy the code

The result is a hello.class file, which is the lexer we generated. Next, let’s write a script file that the generated parser can parse:

int age = 45;
if (age >= 17+8+20){
  printf("Hello old man!");
}
Copy the code

Save the above script as a hello.play file and type the following command on the terminal:

grun Hello tokens -tokens hello.play
Copy the code

The grun command actually calls the parser we just generated, the Hello class, and prints the results of the parsing of hello.play:

As you can see from the results, our lexical analyzer recognizes each Token and records its location, text value, and category in the code. These are all attributes of the Token.

Take the second line [@1, 4:6= ‘age’,< Id >,1:4] as an example, where @1 is the flow number of the Token, indicating that this is Token 1. 4:6 is the beginning and end of the Token in the character stream; Age is the text value, Id is its Token category; The final 1:4 indicates that the Token is in line 1, column 4 in the source code.

Let’s look at the rules for string literals we wrote earlier:

StringLiteral:      '"'. *?'"' ;  // String literal
Copy the code

Our version is quite simplified, in that the double quotes can contain any character. This doesn’t work well in practice, because it doesn’t even provide an escape function. For invisible characters such as carriage returns, we need to provide an escape function such as “\n”. Also, if the string itself has double quotes, escape them, as in “\”. Unicode is also escaped. Finally, the escaped characters themselves need to be escaped, such as “\”.

The following paragraph is the complete rule for string literals in the Java language. If you look at the document, the rules are very detailed, taking into account all sorts of escapes:

STRING_LITERAL:     '"'(~ ["\\\r\n] | EscapeSequence)* '"'; fragment EscapeSequence : '\ \' [btnfr"'| \ \]'\ \' ([0-3]? [0-7])? [0-7]
    | '\ \' 'u'+ HexDigit HexDigit HexDigit HexDigit
    ;
 
fragment HexDigit
    : [0-9a-fA-F]
    ;  
Copy the code

In this rule file, fragment refers to a syntax fragment that is used to make the rule definition clearer. It does not itself generate tokens; only the StringLiteral rule does.

Of course, in addition to string literals, the rules for numeric literals and identifiers can also be more tightly defined

The problem of classifying tokens in lexical rules

In the rules file we practiced earlier, we categorized >=, >, and < as relational operators and counted them as the same class of tokens, while +, *, and so on were separate classes of tokens. So, what can be grouped and what needs to be listed separately?

It really depends on the grammar. That is, in the syntax rules file, can appear in the same rule. They are not different at the level of grammar, only at the level of semantics. For example, although addition and subtraction are different operations, they can appear in the same grammar rule at the same time. They have exactly the same properties in operation, including precedence and associativity. Multiplication and division can appear in the multiplication rule at the same time. You can combine plus and minus, and you can combine times and division. It is also possible to treat each of the four operators as a separate class. However, plus and multiplication cannot be considered in the same class, because they have different priorities in arithmetic operations and certainly appear in different grammatical rules.

Let’s review the “Compilation Principles practice 1: How to implement a lexical analyzer with JS”. A problem encountered when doing lexical analysis in. At that time, we analyzed the problem of lexical conflict, where rules for identifiers and keywords overlap. How does Antlr solve this problem? Quite simply, it introduces the concept of priority. In Antlr’s rules file, the earlier the rule is declared, the higher the priority. So, we put the keyword rule in front of the ID rule. When the algorithm is executed, it checks whether it is a keyword first, and then whether it is an ID, which is an identifier.

It is the same as when we constructed finite automata for lexical analysis. At that point, we determine if it is a keyword, and if it is not a keyword, we identify it as an identifier. Antlr solves this problem simply by declaring the order, saving a lot of work.

Antlr generates parsers

Now that you know how to do a lexical analyzer with Antlr, you know that you can use mature rule files to make your lexical rule files more complete and professional. Next, try using Antlr to generate a parser instead of a handwritten one!

This time the file name is playscript.g4. PlayScript is the name given to our scripting language, and the file begins like this:

grammar PlayScript;
import CommonLexer;   // Import the lexical definition
 
/* Add the following contents to the header of the generated Java source file, such as package names, import statements, etc. * /
@header {
package antlrtest;
}
Copy the code

And then I put in the syntax definition that I did before. Antlr has a built-in mechanism to handle left recursion automatically, so you can feel free to write the syntax rules as follows:

expression
    :   assignmentExpression
    |   expression ', ' assignmentExpression
    ;
 
assignmentExpression
    :   additiveExpression
    |   Identifier assignmentOperator additiveExpression
    ;
 
assignmentOperator
    :   '='
    |   '* ='
    |	'/ ='
    |   '% ='
    |   '+ ='
    |   '- ='
    ;
 
additiveExpression
    :   multiplicativeExpression
    |   additiveExpression '+' multiplicativeExpression
    |   additiveExpression The '-' multiplicativeExpression
    ;
 
multiplicativeExpression
    :   primaryExpression
    |   multiplicativeExpression The '*' primaryExpression
    |   multiplicativeExpression '/' primaryExpression
    |   multiplicativeExpression The '%' primaryExpression
    ;
Copy the code

You may ask, “Why bother solving left recursion problems before you can ignore them with Antlr?” That’s because when you have a problem and you don’t have the tools, you still have to do it manually. Also, some tools may not be as smart, so you need to write rules files that conform to the tool, such as no left recursive syntax rules. The same words:Knowing the basics will help you stand tall.

We continue to run the following command to generate the parser:

antlr4 PlayScript.g4
javac antrl_grammer/*.java
Copy the code

Then test the generated parser:

grun antrl_grammer.PlayScript expression -gui
Copy the code

Test PlayScript’s expression method, which parses expressions, and display the results in a graphical interface.

Let’s enter the following in the console interface

age + 10 * 2  + 10
^D
Copy the code

^D indicates that you press D at the same time as pressing the Ctl key, which is equivalent to entering an EOF character, that is, the end of file symbol on the terminal (^Z is used in Windows). Of course, you can also put these statements in a file in advance, using the filename as the command argument. The parser then parses the syntax and pops up a window to display the AST:

As you can see, the AST is exactly right, as are priorities and associativity. So the parser generated by Antlr makes a lot of sense. From now on, you can concentrate on writing grammar rules and focus on the design and application of the language.

Source address: antrl_grammer

This course is based on the principles of Geek Time compilation course.

Use Antlr to generate lexical and parser (part 1)

I met Antlr

Generate a lexical analyzer with Antlr

The problem of classifying tokens in lexical rules

Antlr generates parsers

Related Posts

A preliminary study on technology management (2)- Management planning

【CTF Contest 】100 steps getShell just one step away — The MOVAPS Issue

Python ID functions – Python zero Basics tutorial