The preface
With the rapid development of modern browsers and front end, especially the MVVM framework, compilers are more and more widely used in front end. For routine work, including but not limited to:
- V8 engine, TSC tools (Typescript compiler)
- Webpack (the underlying compiler is Acorn) and various tools (such as Babel)
- Template compiler for Angular and Vue
- IDE such as VSCode or code editing components (such as flying book code blocks)
- Antlr ecological
For front-end development, there is no need to have a deep understanding of these compilers or the underlying principles of compilation. However, it is still of great benefit to have a basic conceptual understanding of the field in which we get along with each other day and day, and be able to skillfully use relevant tools to serve production and life.
Therefore, this series of shares aims to spread the introduction of front-end development oriented compilation principles and techniques, and to introduce the basic principles and practical applications of ecological tools such as ACORN and ANTLR. It should be pointed out that I am only a front-end research and development of the principle of compilation, which can only lead everyone to understand this field. Therefore, this series of sharing may have the wrong expression, also forget the culvert. If you have more interest in this field, it is recommended to buy professional textbooks for systematic study.
This post is the first in a series that covers the basic concepts of compilation principles and gives you a taste of what you can do by writing an expression calculator.
An overview of the
Let’s take a macro look at compiling first. Typically, when you run the TSC command, or the GCC command, it triggers a general process like the one shown below. That is, the compiler analyzes and processes the input source code (string data) combined with the current language rules agreed by the compiler, and finally outputs the results.
The “conventions of language rules” mentioned here are called context Free grammar (CFG) in textbooks. Simple understanding is the grammar rules of various programming languages. For example, TSC is typescript syntax, Babel is a specific version of ES (such as ES6) syntax, and GCC is C/C ++ syntax.
After the “compile process”, different compilers will produce different “compile results”. For example, TSC produces JS code, Babel produces JS code for a given target (such as ES5), and GCC produces binary executables for a given platform (such as x86).
The whole “compile processing” process is generally divided into two large stages, “compile front end” and “compile back end”. These two stages are also the two main theoretical research directions of the whole compilation principle.
As you can see from the figure above, the compile front end is primarily how the compiler reads and understands the input. Typically, the compilation front end produces an intermediate for the compilation back end to consume, such as a tree-like data structure known as an Abstract Syntax Tree (AST). The compilation backend, on the basis of the results of front-end parsing, is further optimized and transformed to generate the target results.
In the figure above, only the simplest concepts are drawn for the compilation back end. In the actual compilation principle, the compilation back end is a more complex and core module in the compilation principle, which involves the core theory of many programming fields, including but not limited to stack management, memory management, garbage collection, code optimization, instruction pipeline optimization and so on. However, since this share is limited to an introductory basis applicable to the front end, this complex area of the compile back end is not covered.
Next, let’s take a relatively in-depth look at the compile front end.
Context-free grammar
As mentioned earlier, the compiler recognizes and understands the input in terms of a context-free grammar (CFG). CFG is used in theory to formally define the syntax of a language, or to systematically describe the constructs of a programming language (such as expressions and statements). Let’s assume a very simple language that can only declare integer constants like JS and arrow functions that take no arguments and only return constant addition directly.
const a = 10
const b = 20
const c = () = > a + b
Copy the code
The grammatical expression of this language is as follows:
program :: statement+
statement :: declare | func
declare :: KEYWORD_CONST VARIABLE OPERATOR_EQUAL INTEGER
func :: KEYWORD_CONST OPERATOR_LEFT_BRACKET OPERATOR_RIGHT_BRACKET OPERATOR_EQUAL OPERATOR_RIGHT_ARROW expression
expression :: VARIABLE + VARIABLE
KEYWORD_CONST :: "const"
OPERATOR_EQUAL :: "="
OPERATOR_LEFT_BRACKET :: "("
OPERATOR_RIGHT_BRACKET :: ")"
OPERATOR_RIGHT_ARROW :: ">"
INTEGER :: \d+
VARIABLE :: \w[\w\d$_]*
Copy the code
As you can see, the expression of the whole grammar covers a lot of regular expression concepts. The expression is a top-down specification that first restricts a program to consist of one or more statements, which in turn can be declare or func statements. Declarations are composed of const keywords, symbols =, and integers from left to right. Integer definitions are directly constrained by regular expressions. Function statements are similar.
As you can see, the above grammar is divided into two large parts. The top half defines the statements and the expressions they recursively construct, often referred to as grammar rules. The second half defines the basic words that can be arranged to form a statement, often called lexer rules.
You can try to write code that satisfies the syntax convention of declaring integer constants and arrow functions that take no arguments and only return constant addition. In other words, if you write some code that does not meet this macro description, is it certain that there will be no matching grammar when you get close to this grammar? Through such an attempt, one can gain a deeper understanding of how context-free grammar formalizes the grammar of a language.
In practice, lexical rules are often not listed separately, but are written directly into grammatical rules. For example, the above grammar can be simplified as:
program :: statement+
statement :: declare | func
declare :: "const" variable "=" integer
func :: "const" "(" ")" "=" ">" expression
expression :: variable + variable
variable :: \w[\w\d$_]*
integer :: \d+
Copy the code
The compiler automatically recognizes the same string constants (such as “const”) at compile time and merges them into the same lexical rule. Next, let’s take a brief look at the syntax of the javascript language. ECMAScript 2020 Language Specification (Rbuckton.github. IO), ECMAScript® 2022 Language Specification (TC39.es)
Lexical analysis and Token streams
With context-free grammar as the syntax rule for the compiler, let’s look at how the compiler reads and understands the input source code within the constraints of the grammar. In the above compilation front end diagram, we see that the compiler first converts the source code into a token stream. Generally speaking, a token is a data structure with a type and a value, and a token stream can simply be an array of tokens. For example, the code in the minimalist language given above can be resolved into the following token stream according to the lexical rules in its grammar:
[
{ type: "KEYWORD_CONST", value: "const" }, { type: "VARIABLE", value: "a" },
{ type: "OPERATOR_EQUAL", value: "=" }, { type: "INTEGER", value: "10" }
...
]
Copy the code
The method of generating token streams from source code under lexical rules is a relatively complex topic. In general, there are two directions: one is that the compiler hardcodes these lexical rules in its own code implementation logic; One is to use a tool to automatically generate the code logic corresponding to the lexical rules using the grammatical expressions above as input. The first method will be described in detail in a later section of this article, and the second in a later share of this series.
Syntax analysis and abstract syntax tree
The token stream generated by lexical analysis is usually output as an abstract syntax tree structure as shown in the following example after syntax analysis:
{
type: 'program',
statements: [{
type: 'declare',
variable: 'a',
value: 10,
}, {
type: 'declare',
variable: 'b',
value: 10
}, {
type: 'func',
name: 'c',
expression: {
operator: '+',
left: {
type: 'variable',
value: 'a'
},
right: {
type: 'variable',
type: 'b'
}
}
}]
}
Copy the code
With the AST mentioned above, further processing can be done through methods including depth or breadth traversal. Such as executing code, compressing code, converting to ES5, etc. Let’s take the example of executing code:
In front end practice, for The javascript language, there is not only a uniform ECMA language grammar specification at The grammar level, but also a set of conventions for abstract syntax trees: Github-Estree/ESTree: The Estree Spec, known to The community as ESTree. With this convention AST specification, the entire front-end community, production tools uniformly produce data structures in this format without caring about downstream, and consumer tools uniformly use this format for processing without caring about upstream.
Take webPack, which we are most familiar with. Underneath it is the Acorn tool. Acorn will convert the JS source code into the standard ESTREE described above, and WebPack will consume the ESTREE downstream, such as traversing, extracting and analyzing require/import dependencies, converting the code and exporting it. On the other hand, in addition to generating JS code (like csS-Loader, which essentially converts CSS code into JS code), a more advanced use of a custom WebPack loader is to output estree directly to Webpack as upstream. The WebPack kernel no longer uses Arco for secondary parse, thus improving performance.
We can view estree data structures by pasting js code in the online tool AST Explorer. This tool is very useful, when we are writing some estREE processing tools, you can use this tool at any time to view the estREE construction of specific code, so as to write targeted estREE traversal code. By extension, there is also an online tool for viewing the AST in typescript: typescript AST Viewer (TS-COMMENCEMENT viewer.com).
After introducing AST, the natural question is, how does the compiler convert token streams into AST, subject to grammar rules?
As with source code to token streams mentioned earlier, AST generation can be done in two main directions: one is to hard-code the constraints of grammar rules into the compiler’s code logic, and the other is to convert grammar rules directly into syntax parse code using automatic generation tools. The former is a common solution used by lingua-specific compilers, where the parse code is written by hand and errors and exceptions entered into the source code can be reported and handled in greater detail. The aforementioned ArCO, as well as TSC, Babel, and the familiar Vue, Angular’s template compiler, are all primarily this approach. The latter, on the other hand, is more commonly used for non-specific programming languages, such as simple but volatile syntax that can be customized in some business, or just complex processing rules for string text.
The first method is briefly described below. The second approach will be covered in a future installment of this series.
Recursive descent technique
Recursive descent is one of the most commonly used hand-written compiler front-end compiler techniques in the industry. Due to time constraints, the underlying theory of the technology is not introduced in depth. Interested students are recommended to buy professional textbooks. Here is just a simple example to take you to understand the technology.
As we know, any language needs to support expressions for adding, subtracting, multiplying and dividing constants. Here, we are trying to implement an expression calculator for addition, subtraction, multiplication and division of integers (with parentheses).
First try to define the rules for this expression using a context-free grammar:
expr :: term "+" expr | term "-" expr | term
term :: factor "*" term | factor "/" term | factor
factor :: "(" expr ")" | int | "-" int
int :: \d+
Copy the code
Intuitively, the definition of this grammar has the concept of recursion. For example, expr can be made up of term, term can be made up of factor, factor can be made up of expr, thus forming a recursive loop. In addition to evaluating expressions, the concept of recursion can be found everywhere in actual programming languages. For example, in JS, the body of the function, and can be composed of functions, layer upon layer nested.
This recursive definition, rigorous and concise, can be used to accurately describe the grammar of a language with very few expressions. This kind of grammatical expression is technically known as the BNF paradigm. The recursive descent technique is to parse the source code from top to bottom in a recursive way consistent with the BNF paradigm.
In practice, writing recursively downward code that is consistent with the BNF paradigm can run into “ambiguity” and “left recursion” problems. Ambiguity is easy to understand. When parsing to a certain position, the next character satisfies both possible rules. The compiler cannot automatically decide which way to go. In this case, the compiler needs to look ahead one more character, using the next two characters in total to determine which rule to parse. But maybe moving on isn’t enough. You have to move on. The analogy, which requires looking forward k characters in order to disambiguate, is called LL(k) grammar. The rules of grammar can be adjusted to minimize k, or to disambiguate as much as possible. For some grammars, however, all ambiguities cannot be disambiguated (complete disambiguation cannot express complete grammatical rules). The general goal is to do this, to avoid ambiguity by looking forward just once, which is to do LL(1).
The concept of “left recursion” is more complicated and will not be discussed here. The simple understanding is that some forms of recursion may lead to an infinite loop and need to be eliminated. For BNF normal form, there are formal mathematical means to eliminate left recursion, which will not be introduced here. After eliminating left recursion, grammar rules that can be used to guide coding are directly given:
expr :: term expr_
expr_ :: "+" term | "-" term | eps
term :: factor term_
term_ :: "*" factor term_ | "/" factor term_ | eps
factor :: "(" expr ")" | int | "-" int
int :: \d+
Copy the code
Guided by the BNF paradigm above, we can easily write parsing code that is consistent with recursive descent:
*/ function isDigit(CHR: string) {return /\d/.test(CHR); } class Calc {/** private readonly exprstr: string; Private curIdx: number = -1; /** * private curIdx: number = -1; Static calculate(exprstr: string) {return new Calc(exprstr).expr(); } private constructor(exprstr: string) { this.exprstr = exprstr; } // private peekChr() {// while(this.exprstr[this.curidx + 1] === ") { this.curIdx++; } return this.exprstr[this.curIdx + 1]; */ private eatChr() {this.curidx ++; return this.exprstr[this.curIdx]; } private expr() {let value = this.term(); Expr :: term expr_ (true) {ll(1) if (this.peekchr () === '+') {// expr_ :: "+" term | "-" term this.eatChr(); const termReturn = this.term() value += termReturn; } else if (this.peekChr() === '-') { this.eatChr(); value -= this.term(); } else { break; } } return value; } private term() { let value = this.factor(); // term :: factor term_ while(true) { if (this.peekChr() === '*') { // term_ :: "*" factor | "/" factor this.eatChr(); value *= this.factor(); } else if (this.peekChr() === '/') { this.eatChr(); value /= this.factor(); } else {break; } } return value; } private factor() { let sign = 1; if (this.peekChr() == '-') { this.eatChr(); sign = -1; } let value: number; if (isDigit(this.peekChr())) { value = sign * this.int(); } else if (this.peekChr() == '(') { // factor :: this.eatChr(); Value = sign * this.expr(); // Recursively parse the expression in parentheses this.eatchr (); } return value; } private int() { let value = 0; While (isDigit(this.peekchr ())) {const CHR = this.eatchr (); value = value * 10 + Number(chr); } return value; } } console.log( Calc.calculate("34 + 4 * (3 - 6)") );Copy the code
Summary & Notice
This share introduces the basic concept of compilation principle from the perspective of perceptual cognition. In practice, the most useful point for the front end is actually easy to grasp: understanding the basic concepts of the AST and being able to use the relevant tools for production and life. If you are interested, this series will further introduce front-end compilation techniques and ecological tools from a practical perspective.