background
We know that programming languages are mainly divided into “compiled language” and “interpreted language”. Compiled language is the compiler converts the programming language into machine language before the code is run, which does not need to be re-translated, but directly uses the compiled results. Interpreted languages also need to be converted from programming language to machine language, but at run time.
JavaScript is often classified as an “interpreted language”, leading many to believe that front-end code does not need to be compiled. However, JavaScript engines compile in much the same way as traditional compiled languages, except that unlike traditional compiled languages, they do not compile ahead of time. And with the booming development of modern browsers and front-end fields, compilers are more and more widely used in front-end fields. In terms of daily work, including but not limited to the following aspects:
- V8 engine, typescript compiler (TSC)
- Webpack Loader compiler (ACorn), Babel, SWC and other compiler tools.
- Template compiler for Angular, Vue, JSX, etc
As a front-end developer, it is not necessary to have a thorough knowledge of these compilers or the underlying compilation principles, but it is also helpful to have a basic understanding of the principles of compilation in the future. This article will guide you to learn some basic concepts of compilation principle, and explain the basic flow of front-end compilation with Babel as an example.
An overview of the
Let’s review the basic knowledge of the compiling principle, from the macro, compilation is essentially a conversion technology, from a programming language into another programming language, or from a high-level languages into low-level languages, or from a high-level language to high-level language, the so-called high-level language and low-level language is mainly refers to the distinction between the following:
- High-level language: there are many language features used to describe logic, such as branches, loops, functions, object-oriented, etc., close to people’s thinking, developers can quickly express all kinds of logic through it. C++, javascript.
- Low-level language: related to hardware and execution details, will operate registers, memory, specific memory and register between the copy, developers need to understand familiar with the working principle of the computer, familiar with specific execution details. Such as assembly language, machine language.
Whatever the compilation process is, it is basically one of the following:
The compiler rules above refer to the syntax rules of various programming languages. Different compilers produce different “compilation results”, for exampleC/C++
The language is compiled into binary machine code, which is then handed to the operating system. For example, when we run TSC, TS code will be compiled into JS code, and when we run Babel, ES6 + code will be compiled into JS code for the specified target (ES5).
Generally speaking, the whole compilation process is mainly divided into two stages: front-end compilation and back-end compilation, which can be roughly divided into the following several processes:
As can be seen from the figure above, the front end of compilation is mainly to help the computer read the source code and understand the structure, meaning, role of the source code, the source code by a string of meaningless character stream parsing into a specific meaning of components. Typically, the compile front end produces an intermediate for the compile back end to consume, such as the common abstract syntax tree AST, while the compile back end optimizes and transforms the results of the front-end parsing and generates the final object code.
Context-free grammar
As mentioned earlier, the compiler compiles according to the “convention compilation rules”, where the “convention compilation rules” refers to the “context independent grammar (CFG)”. CFG is used to theoretically formalize the syntax of a language, or to systematically describe the constructs (such as expressions and statements) of a programming language.
In fact, almost all of the programming language is defined by context-free grammar, compared with regular expressions, but are more powerful than a regular expression function, it can express complicated grammar, such as the C language grammar use regular expressions to say impossible, but you can use a set of rules for CFG to express.
To understand context-free grammars, you need to understand the following concepts:
- Terminator: can be understood as a basic symbol, lexical symbol, is irreplaceable, is fixed, cannot be generated by grammar rules.
- Nonterminal: syntactic variables that are replaceable
- Production rules: Grammar is composed of terminal sets, non-terminal sets, and production rules. Production rules define how symbols are converted to and from one another. To the left of the rule is the rule header, which is not the terminal, and is the symbol that can be substituted; On the right is the production, which is the concrete content.
For example, a, b, c, d are terminal characters (lowercase) and (S, a) are non-terminal characters (uppercase). S – > cAd, A – > A | ab said production rules. S->cAd, then you can produce “cAd”, “cabd”, etc
S -> cAd
A -> a | ab
Copy the code
Context-free grammar is abstract and not the focus of this study. If you are interested in it, you can learn more about it. Here, there is also an answer to refer to how to understand “context-free grammar”.
Let’s simulate how to define a language syntax using CFG. Let’s assume a very simple language, which can only declare integer constants like JS, and declare arrow functions that take no arguments and can only return constant addition directly.
const a = 10
const b = 20
const c = () => a + b
Copy the code
The grammar of this language is as follows:
program :: statement+
statement :: declare | func
declare :: CONST VARIABLE ASSIGN INTEGER
func :: CONST LPAREN RPAREN ARROW expression
expression :: VARIABLE + VARIABLE
CONST :: "const"
ASSIGN :: "="
LPAREN :: "("
RPAREN :: ")"
ARROW :: "=>"
INTEGER :: \d+
VARIABLE :: \w[\w\d$_]*
Copy the code
It can be seen that the expression of the whole grammar covers many concepts of regular expressions. This expression is a top-down specification:
- A program is made up of one or more statements.
- Expressions, in turn, can consist of declaration statements or function statements.
- The declaration statement consists of const, keyword, symbol =, and integer from left to right.
- Integer definitions are constrained directly by regular expressions. Function statements are similar.
As you can see, the above grammar is divided into two large parts. The upper part defines statements and the expressions constructed from them recursively, and is often referred to as “grammar rules”; The second half defines the basic vocabulary that can be arranged to form statements, often referred to as “lexer rules.”
In practice, lexical rules are often written directly into grammatical rules rather than listed separately. For example, the above grammar can be simplified as:
program :: statement+
statement :: declare | func
declare :: "const" variable "=" integer
func :: "const" variable "=" "(" ")" "=>" expression
expression :: variable + variable
variable :: \w[\w\d$_]*
integer :: \d+
Copy the code
The above form is called the BNF grammar expression, it is a description language used to describe the context-free grammar, in the form of symbol < > : : = > < use symbol expressions, the symbols < > here is terminator, and expression by a symbol sequence, or with instructions to choose vertical bar ‘|’ space of more than one symbol sequence, Each sequence of symbols as a whole is a possible substitute for the symbol on the left. A symbol that never appears on the left is a terminal.
With BNF, we can externalize, formalize, and even implement a language to solve domain-specific problems. For another example, BNF is used to describe four operations:
Result: : = number (" + "|" - ") exp | number / / non-terminal exp: : = number (" * "|"/") exp | number / / non-terminal number: : = [0-9] + / / terminatorCopy the code
Alternatively, we can take a peek at ECMA and JSON’s BNF:
- JSON Schema的BNF: Syntax – JSON Schema
- ECMA BNF: Function&class BNF
Compiler workflow
Let’s take a closer look at the various stages of compilation.
Lexical analysis
Just as the first step to learning a language is to learn the words, the first step for the compiler to recognize the source code is to perform word segmentation, recognizing each word or symbol. At this stage, the lexical analyzer splits the source code into a set of token strings:
- First, the source code string is scanned from left to right, separated by whitespace characters (Spaces, newlines, tabs, etc.) into untyped tokens.
- Secondly, according to the lexical rules, the “finite state machine” is used to match the string pattern of the tokens split in the first step to identify the type of each Token (V8 token.h).
In general, a token is a data structure with types and values, whereas a token stream can be simply understood as a token array. Take the following line of code for example:
const name = 'xujianglong'; [{type: "CONST", value: "CONST"}, {type: "IDENTIFIER", value: "name"}, {type: "ASSIGN", value: "=" }, { type: "STRING", value: "xujianglong" }, { type: "SEMICOLON", value: ";" }, ]Copy the code
So what is a finite state machine? How does it convert string code into token?
First of all, let’s consider that the morphology describes the smallest word format, such as the above example of the line of code, using whitespace character split into these tokens: [‘const’,’name’,’=’,’xujianglong’,’; ‘], but how to identify each token type? The simplest and most crude way we can write an if else statement or write a re, but this seems to be not elegant and easy to maintain, and the use of state machines is used by many programming languages.
Finite-state machine (English: Finite-State Machine, ABBREVIATED: FSM) is also known as finite-state automaton (English: Finite-State Automation, abbreviated: FSA, or state machine for short, is a mathematical calculation model that represents a finite number of states and behaviors such as transitions and actions between these states.
As shown in the figure, the user enters S1 state machine from other state machines. If the user enters 1, the user enters S1 state machine, and if the user enters 0, the user enters S2 state machine. In the S2 state machine, if the user continues to enter 1, the S2 state machine continues, and if the user enters 0, the S1 state machine returns. It’s a cyclical process.
It sounds a bit abstract. Compared with the code word segmentation, we can treat the processing process of each word as a state, read the overall input (source code) in turn according to each character, change the current state according to each character read, and throw out each token after it is recognized. Let’s take a simple example of four operations: 10 + 20-30
First of all, we defined three state machines, namelyNUMBER
Represents the value,ADD
Stands for plus,SUB
Stands for minus sign:
- When we get to “1”, because of this input we need to change the internal state of the state machine to be
NUMBER
Continue iterating over the next character “0”, because “1” and “0” are integral.
- When “+” is analyzed, “+” is input in the state machine. Obviously, “+” is an operation symbol, and it cannot be spliced together with the last “10”. So now the state changes. We push the currentToken, which was “10” last time into tokens and change the state machine to
ADD
.
- In turn, the tokens array looks like this:
[
{ type: "NUMBER", value: "10" },
{ type: "ADD", value: "+" },
{ type: "NUMBER", value: "20" },
{ type: "SUB", value: "-" },
{ type: "NUMBER", value: "30" },
]
Copy the code
Syntax analysis
At this stage, the Parser converts the token array from lexical analysis into the abstract syntax tree AST. For example, the previous line of code defining variables can be viewed in the AST Explorer, an online tool:
{
"type": "Program",
"start": 0,
"end": 28,
"body": [
{
"type": "VariableDeclaration",
"start": 1,
"end": 28,
"declarations": [
{
"type": "VariableDeclarator",
"start": 7,
"end": 27,
"id": {
"type": "Identifier",
"start": 7,
"end": 11,
"name": "name"
},
"init": {
"type": "Literal",
"start": 14,
"end": 27,
"value": "xujianglong",
"raw": "'xujianglong'"
}
}
],
"kind": "const"
}
],
"sourceType": "module"
}
Copy the code
For JavaScript, AST also has a set of conventions: github-estree /estree: The ESTree Spec, or ESTree as The community calls it, enables tools in The entire front-end community to produce a uniform data format without caring about downstream, and downstream consumer tools to use this uniform format without caring about upstream, thus decoupling upstream and downstream. In the case of WebPack, the underlying tool is acorn. Acorn converts JS source code into the aforementioned standard ESTree, which is consumed by WebPack as a downstream, such as traversing, extracting and analyzing require/import dependencies. Convert the code and print it.
The process of generating an AST needs to follow syntax rules (expressed as context-free grammar). In the above code, we use “VariableDeclaration”, which can be expressed as:
VariableDeclaration :: Kind Identifier Init? ; Kind :: Const | Let | Var; Init :: '=' Expression | Identifier | Literal; Expression: : BinaryExpression | ConditionalExpression |... ; Literal :: StringLiteral | ... ;Copy the code
Now that we have the syntax rules, we need to think about how the compiler converts the token stream into an AST, subject to the syntax rules. There are also two main directions to generate AST:
- One is to hardcode the constraints of grammar rules into the compiler’s code logic, which is a common approach used by language-specific compilers. This approach is often to manually write parse code to report and handle errors and exceptions from input source code in greater detail. The aforementioned Arorn, as well as TSC, Babel, and familiar vue, angular’s template compiler, are mostly this way.
- The second is to use automatic generation tools to translate grammar rules directly into syntax parse code. This is more commonly used in non-specific programming languages, such as simple but mutable syntax that is customized in some businesses, or complex processing rules that are simply string text.
Here we take the first way the most basic “recursive descent algorithm” (recursive descent compilation technology, is one of the industry’s most commonly used handwritten compiler to achieve the compiler front-end technology, interested in special can be in-depth study) as an example to briefly describe the process of example code generation AST:
Try to match VariableDeclaration to Const to Identifier to try to match Init, recursively drop to '=' to try to match Expression, recursively drop to fail to match, backtrack to Literal, Traceback that the VariableDeclaration matched successfully, construct the corresponding type node and insert it into the ASTCopy the code
Semantic analysis
Not all compilers have semantic analysis; Babel, for example, does not. However, for most other programming languages (including TypeScript), there is a step of semantic analysis, especially for statically typed languages, and type checking is one of the steps of semantic analysis
In the semantic analysis phase, the compiler starts to traverse the AST one or more times to check the semantic rules of the program. This includes declaration checking and type checking, as in the previous assignment statement:
- Whether the variable name in the statement is declared
- Whether a const variable is changed
- Whether the types of the two operands of the plus operation match
- Whether the number of arguments and types of a function match the number and types of arguments it declares
Semantic checking procedure and people to the source code to read and understand the steps, are generally in the process of traverse of the AST, when you meet the variable declarations and function declarations will be variable names, types and the parameters of the function name, a return type, number and type and other information saved in the symbol table, when the use of variables and functions, According to the name in the symbol table to find and check whether the name is declared, the name type is used correctly, and so on.
The syntax tree can also be optimized for semantic checking. For example, expressions containing only constants can be calculated first, such as:
a = 1 + 2 * 9; Will be optimized to: A = 19;Copy the code
Once the semantic analysis is complete, the structural parsing of the source code is complete, all compile-time errors are eliminated, and all variable and function names used are bound to their declaration locations (addresses). The compiler can now say that it truly understands the source code and can begin code generation and optimization.
Intermediate code generation and optimization
Most compilers do not generate object code directly, but generate some kind of intermediate code, which is then regenerated into object code. There are several reasons why Sir Is an intermediate code:
- To make compiler development easier, it is easier to translate a high-level language into intermediate code, and that intermediate code into object code, than it is to translate a high-level language directly into object code.
- In order to increase the modularity, portability, and extensibility of compilers, intermediate code is generally independent of either high-level languages or target machine architectures, providing a medium for developing widely adaptable compilers
- For code optimization, in general, computer generated code directly is large than one hand-written assembly, repeat a lot, computer scientists have fixed format for some in the middle of the code of conduct a lot of research work, put forward a lot of widely used optimization algorithm, the efficiency is very high, can optimize the intermediate code, It’s much better than optimizing the object code directly.
In the early days of the JavaScript compiler V8 engine, there was no intermediate code generation and native executable code was generated directly from the AST, but the lack of an intermediate process of conversion to bytecode reduced the opportunity for code optimization. To improve performance, V8 started with an architecture that introduced bytecode, compiling the AST into bytecode, and then converting native code with the JIT tool.
The v8 compilation process is not covered here, but can be shared separately.
- V8 engine details (III) — Evolution of V8 from bytecode
- Understand V8 bytecode “translation”
Generating object code
With intermediate code, object code generation is relatively easy, because intermediate code is designed with object code easily generated. Compilers for different high-level languages generate different object code, for example:
- C, C++, Go, the object code is assembly language (may be different object code), through the assembler finally get machine code;
- After Java is compiled by JavAC, bytecodes are generated. Only lexical analysis, syntax analysis, semantic analysis and intermediate code generation are carried out. Bytecodes are interpreted into machine code one by one by the interpreter at runtime.
- JavaScript, at runtime, goes through the entire compilation process, but also into bytecode, and then the parser parses the bytecode execution;
- As for webpack, Babel, these front-end tools are finally compiled into the corresponding JS code.
Babel compilation process
As mentioned above, a general compiler is a tool for converting from a high-level language to a low-level language. Specifically, some of the tool types on the front end, such as TS to JS, JS to JS, are tools for converting from a high-level language to a high-level language. These are often called transpilers, or transpilers for short. So Babel is a JavaScript compiler.
Babel is primarily used to convert code written in ECMAScript 2015+ syntax to backward compatible JavaScript syntax, and can also polyfill apis not supported by the target environment. To be able to run in current and older browsers or other environments. For example:
// Babel input: ES2015 arrow function [1, 2, 3]. Map (n => n + 1); [1, 2, 3]. Map (function(n) {return n + 1; });Copy the code
Babel’s workflow can be broken down into the following steps:
The following details the phases of the Babel workflow.
Parse
Parse converts the original code string into the AST tree. In a broad sense, parse includes the lexical and syntax analysis phases described earlier in the compilation process. The parse process includes several Babel plug-ins that allow Babel to parse more syntax, such as JSX.
parse(sourceCode) => AST
Copy the code
The parse phase, mainly through the @babel/ Parser package, formerly called Babylon, is based on the Acorn implementation and extends a lot of syntax, It supports parsing of esNext (now supported by ES2020), JSX, flow, typescript and other non-standard grammars that require syntax plug-ins.
We can manually call the Parser method to convert and get an AST:
const parser = require("@babel/parser"); const fs = require('fs'); const code = `function square(n) { return n * n; }; ` const result = parser.parse(code) console.log(result);Copy the code
The final output AST is as follows:
{
type: "FunctionDeclaration",
id: {
type: "Identifier",
name: "square"
},
params: [{
type: "Identifier",
name: "n"
}],
body: {
type: "BlockStatement",
body: [{
type: "ReturnStatement",
argument: {
type: "BinaryExpression",
operator: "*",
left: {
type: "Identifier",
name: "n"
},
right: {
type: "Identifier",
name: "n"
}
}
}]
}
}
Copy the code
As you can see, each layer of an AST has a similar structure, which is also called a Node. An AST can consist of a single Node or hundreds or thousands of nodes, which together describe the program syntax for static analysis. Each node has a string type of type, which is used to indicate the node type. Babel defines a type that contains all JavaScript syntax, such as:
- Declaration statement: for example
FunctionDeclaration
,VaraibaleDeclaration
,ClassDeclaration
Such statement;
- Identifier:
Identifier
, variable or function parameter.
- Literal:
StringLiteral
,NumbericLiteral
,BooleanLiteral
Such literal types;
- Statement:
WhileStatement
,ReturnStatement
Such statements;
- Expressions:
FunctionExpression
,BinaryExpression
,AssignmentExpression
And so on.
All of these nodes are nested to form an AST tree, such as the tree structure formed by a variable assignment statement:
Transform
In the transform stage, the depth first traversal is performed on the AST generated by parse in the previous step, and the tree structure is modified by adding, deleting, modifying and searching matched nodes. In Babel, the AST is modified with the configured plugin or Presets to get the new AST, and our Babel plug-in is mostly used for this phase.
transform(AST, BabelPlugins) => newAST
Copy the code
Babel uses the @babel/traverse package to traverse the AST, find the nodes that need to be modified, and then convert them. This process is similar to manipulating DOM trees. When we talk about “entering” a node, we really mean that we are accessing them, and we use the term because of the concept of visitor patterns. Visitor Visitor is a cross-language schema for AST traversal. Simply put, it is an object that defines a method for obtaining a specific node in a tree structure.
Visitor is an object consisting of a variety of types, or an enter and exit. The visitor completes the “enter” or “exit” steps of a node of a type. The visitor defines what to do when a node of a type is matched during the AST.
traverse(ast, { /* VisitNodeObject */ enter(path, state) {}, exit(path, state) {}, /* [Type in t. ode[" Type "]] */ Identifier(path, state) {}, // call StringLiteral: {// Call enter(path, state) {} when entering StringLiteral (string node) node, exit(path, state) {} when entering the node, exit(path, state) {}, 'FunctionDeclaration | VariableDeclaration' (path, state) {}, / / / / called when} enter FunctionDeclaration and VariableDeclaration node)Copy the code
The procedure for accessing a node is as follows:
Let’s look at a simple example:
const parser = require('@babel/parser') const traverse = require('@babel/traverse').default const fs = require('fs'); const code = `function square(n) { return n * n; } `; const ast = parser.parse(code); const newAst = traverse(ast, { enter(path) { if (path.isIdentifier({ name: "n" })) { path.node.name = "x"; } }, FunctionDeclaration: { enter() { console.log('enter function declaration') }, exit() { console.log('exit function declaration') } } });Copy the code
In the above example, “n” is replaced with “x” by identifying the identifier, where path is the path in the traversal process and retains context information. There are many properties and methods, which can be customized after accessing the specified node, for example:
path.node
Points to the current AST node,path.parent
Points to the parent AST node;
Path. getSibling, path.getNextSibling, path.getPrevSibling
Get sibling node;
path.isxxx
Check whether the current node is of xx type.
Path. The insertBefore, path. InsertAfter
Insert node;
Path. replaceWith, path.replaceWithMultiple, and replaceWithSourceString
Replace nodes;
path.skip
Skip traversal of the current node’s children,path.stop
End Follow-up operations.
With @babel/traverse we can do a lot of custom things during the Tranform phase, such as deleting console.log statements, inserting expressions in specific places, etc., to affect the output. For another example, add location information to the console statement like console.log(‘[18,0]’, 111).
import * as t from "@babel/types"; Traverse (AST, {visitor: {CallExpression(path, state) {const callee = path.node.callee; // Traverse (AST, {visitor: {CallExpression(path, state) {const callee = path.node. if ( callee.object.name === 'console' && ['log', 'info', 'error'].includes(callee.property.name) ) { const { line, column } = path.node.loc.start; const locationNode = types.stringLiteral( `[ ${line} , ${column} ]` ); path .node.arguments.unshift(locationNode); }}},})Copy the code
Generate
This stage is a reverse operation. The new AST is used to generate the code we need. In the generation stage, it is essentially traversing the abstract syntax tree, recursively calling the corresponding string code according to the type and attribute of each node in the abstract syntax tree. This is done in Babel via the @babel/generator API.
generate(newAST) => newSourceCode
Copy the code
Here’s another example of the transform code:
const parser = require('@babel/parser'); const traverse = require('@babel/traverse').default; const generate = require('@babel/generator').default; const code = `function square(n) { return n * n; } `; const ast = parser.parse(code); traverse(ast, { enter(path) { if (path.isIdentifier({ name: "n" })) { path.node.name = "x"; }}}); const output = generate(ast, {}, code); console.log(output) // { // code: 'function square(x) { return x * x; }', // map: null, // rawMappings: undefined // }Copy the code
As you can see from the code above, the variable “n” in the function is replaced by “x”.
conclusion
Compilation principle involves a lot of concepts and knowledge, so as to realize complete compile link GCC compiler, single source, 600 w as sufficient to explain its depth and width, but the compiling principle of value is huge, can saying is the progress of the compiling principle is all kinds of flowers in a high-level language, and thus improve the productivity of software industry. Babel is an important tool in the field of front-end engineering. If you understand the compilation process of Babel, you can integrate the compilation principles of other tools such as v8 engine, TSC, JSX template and other front-end aspects.
Reference documentation
- Babel’s official website
- JavaScript: V8 compilation process
- babel-handbook
❤️ Thank you
That is all the content of this sharing. I hope it will help you
Don’t forget to share, like and bookmark your favorite things.
Welcome to the public account ELab team harvest dachang good article ~
We are from the front end department of Bytedance, responsible for the front end development of all bytedance education products.
We focus on product quality improvement, development efficiency, creativity and cutting-edge technology and other aspects of precipitation and dissemination of professional knowledge and cases, to contribute experience value to the industry. Including but not limited to performance monitoring, component library, multi-terminal technology, Serverless, visual construction, audio and video, artificial intelligence, product design and marketing, etc.
Bytedance school/social recruitment promotion code: RD7ZKXV
Post links: jobs.toutiao.com/s/NQ2WjBR