Why do YOU need to learn about AST? Because AST is so important, you may not know it, but it’s everywhere. To be more specific:
- The browser
js
The engine tojs
The first thing is parsingjs
generateAST
Interpretation execution followed, compilation optimized execution. webpack
babel
eslint
prettier
.
There are too many examples to go over, but the principles behind these tools depend on the AST.
These tools all involve a process:
Disassemble the code -> generate the AST-> traverse the AST and modify it -> regenerate the code.
- The code is just a string, and making changes to the code requires fine-grained splitting of the string, that is, iterating through the code character by character, breaking each character into its corresponding pieces (
token
); - Generated after traversal is complete
token
List, according totoken
withtoken
That is, the grammar rules, which, when combined, form a tree representation. This tree isAST
. - get
AST
You can then iterate through the tree and do some transformations (add, delete, change) on the nodes of the tree. - After finishing the operation, according to the modified
AST
Print out the code.
So the AST is actually an intermediate.
What is theAST
After the above introduction, we already have a general understanding of AST. AST abstract syntax tree. It is an abstract representation of the syntactic structure of source code. It represents the syntactic structure of a programming language as a tree, with each node in the tree representing a structure in the source code. There are two steps to generate an AST:
Lexical analysis (lexical analyzer
)
A lexical analyzer, also called a scanner, literally scans our code, iterating over every character, using predefined rules to convert each character into a token.
var answer = 6 * 7;
Copy the code
To generate the token:
[{"type": "Keyword"."value": "var"
},
{
"type": "Identifier"."value": "answer"
},
{
"type": "Punctuator"."value": "="
},
{
"type": "Numeric"."value": "6"
},
{
"type": "Punctuator"."value": "*"
},
{
"type": "Numeric"."value": "Seven"
},
{
"type": "Punctuator"."value": ";"}]Copy the code
- Grammatical analysis (
Syntax analyzer
)
Syntax analysis is to iterate the token list and associate the tokens according to the syntax rules to form a tree structure, which is the AST. So the AST represents the syntactic structure of the source code, and each node in the tree represents a structure in the source code.
The AST generated by the above example:
{
"type": "Program"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
"type": "Identifier"."name": "answer"
},
"init": {
"type": "BinaryExpression"."operator": "*"."left": {
"type": "Literal"."value": 6."raw": "6"
},
"right": {
"type": "Literal"."value": 7."raw": "Seven"}}}]."kind": "var"}]."sourceType": "script"
}
Copy the code
Implement a simpleJS
The compiler
With that in mind, let’s take a look at how to implement a simple JS compiler. Don’t be afraid, step by step, the goal is to achieve a simple JS compiler, the most important is to understand the principle. The main function of this compiler is to convert ES6 syntax into ES5 syntax. Sound familiar? Exactly what we used to do with Babel. For simplicity, our compiler will mainly:
let name = "Zhang";
Copy the code
Converted to
var name = "Zhang";
Copy the code
Compilers typically take three steps:
Parsing
(Parsing) : Responsible for parsing and generating codeAST
Transformation
(Transformation) : Add, delete and modify the code according to the AST.Code Generation
(Code generation) : based on the transformationAST
Regenerate the new code.
So now our goal is clear, to achieve these three steps. now
Parsing
(analysis)
Parsing phase We have discussed the two phases of lexical parsing and syntax parsing. Here we recommend esprima, an online parsing tool, to check the generated tokens and AST.
Lexical analysis
In the lexical analysis stage, we need to generate the token list:
[{"type": "Keyword"."value": "let"
},
{
"type": "Identifier"."value": "name"
},
{
"type": "Punctuator"."value": "="
},
{
"type": "String"."value": "\" zhang SAN \ ""
},
{
"type": "Punctuator"."value": ";"}]Copy the code
The idea behind this step is to iterate through the code string and then categorize the string (generating tokens).
Start by defining some type variables that will be used later:
// constants.js
const TokenTypes = {
Keyword: "Keyword".Identifier: "Identifier".Punctuator: "Punctuator".String: "String"
}
const AST_Types = {
Literal: "Literal".Identifier: "Identifier".AssignmentExpression: "AssignmentExpression".VariableDeclarator: "VariableDeclarator".VariableDeclaration: "VariableDeclaration".Program: "Program"
}
module.exports = {
TokenTypes,
AST_Types
}
Copy the code
The implementation idea is as follows:
// tokenizer.js
const tokens = require("./constants")
const KEYWORD = /let/ // Match the keyword
const PUNCTUATOR = / [\ =;] / // match "=", ";"
const WHITESPACE = /\s/ // Matches Spaces
const LETTERS = /[a-z]/i // Match the character
const {TokenTypes } = tokens
function tokenizer(input) {
const tokens = [] // The token list stores the tokens and eventually returns them
let current = 0 // Where does the tag traverse to the string
// Loop through the code string with a while loop until the entire string is iterated
while (current < input.length) {
let char = input[current] // hold the current iterated character
// *************** handles the keyword and variable name ***************
if (LETTERS.test(char)) {
let value = ' '
// Run a loop through all the letters and store them in value.
while (LETTERS.test(char)) {
value += char
char = input[++current]
}
if(KEYWORD.test(value)) { // Determine if the current string is a keyword
// enter the keyword
tokens.push({
type: TokenTypes.Keyword,
value: value
})
} else { // Otherwise, the variable name
// Enter the variable name
tokens.push({
type: TokenTypes.Identifier,
value: value
})
}
// Enter the next loop
continue
}
// *************** check for symbols, "=", ";" * * * * * * * * * * * * * * *
if (PUNCTUATOR.test(char)) {
const punctuators = char // Create a variable to hold the matching symbol
current++
// enter the symbol
tokens.push({
type: TokenTypes.Punctuator,
value: punctuators
})
// Enter the next loop
continue
}
// *************** handle Spaces and skip ***************
if (WHITESPACE.test(char)) {
current++
continue
}
// *************** processes the string ***************
if (char === '"') {
let value = ' '
char = input[++current] // Omit the opening quotation marks
// until the next quote is encountered
while(char ! = ='"') {
value += char
char = input[++current]
}
char = input[++current] // Ignore closing quotes
tokens.push({ type: TokenTypes.String, value: '"'+value+'"' })
continue
}
/ / * * * * * * * * * * * * * * * if you do not satisfy the current matching rules throw an error * * * * * * * * * * * * * * *
throw new TypeError('Unknown' + char)
}
return tokens
}
module.exports = tokenizer
Copy the code
A tokenizer function is used to generate tokens. Because it is a simple implementation, it only determines the characters that need to be processed. If you want to implement some more complex parsing, you can enrich the above matching parsing logic.
Syntax parsing
In the syntax phase we need to use tokens to generate the AST.
{
"type": "Program"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
"type": "Identifier"."name": "name"
},
"init": {
"type": "Literal"."value": "Zhang"."raw": "\" zhang SAN \ ""}}]."kind": "let"}]."sourceType": "script"
}
Copy the code
The implementation idea is as follows:
const tokens = require("./constants")
const {TokenTypes, AST_Types } = tokens
// Accept tokens as an argument
function parser(tokens) {
// Record where tokens are traversed currently
let current = 0
// Define the walk function by iterating through the token node
// For different types of nodes, the corresponding processing method is also different
function walk() {
// Start parsing from the current token
const token = tokens[current]
// *************** check whether the string is ***************
if (token.type === TokenTypes.String) {
// If current is autoincrement.
current++;
// Then return a new AST node
return {
type: AST_Types.Literal,
value: JSON.parse(token.value),
row: token.value
}
}
// *************** check if the variable name is ***************
if (token.type === TokenTypes.Identifier) {
// If so, current increases.
current++;
// Then return a new AST node
return {
type: AST_Types.Identifier,
name: token.value,
};
}
// *************** check if it is the operator keyword ***************
if (token.type === TokenTypes.Punctuator) {
// If so, current increases.
current++;
// Check if it is =
if(/ \ = /.test(token.value)){
return {
type: AST_Types.AssignmentExpression,
operator: token.value
}
}else{ // ignore; Number is not included in the AST
return}}// *************** check if the keyword is ***************
if ( token.type === TokenTypes.Keyword) {
var value = token.value
current++; // Now ++, because the name of the variable immediately following the declaration, the next step is to return the variable name
const variable = walk() // Get the defined variable
current++ // the next one should be the = sign, so we'll just skip current++ and not count it in the AST
const rightVar = walk()
current++ // Next should be; I'm just going to skip current++ here, not count in the AST
// Define a declaration
const declaration = {
type: AST_Types.VariableDeclarator,
id: variable, // Define a variable
init: rightVar // The value assigned
}
// Define the node to return
return {
type: AST_Types.VariableDeclaration,
declarations: [declaration],
kind: value,
};
}
// *************** throws an error when encountering a node of unknown type. * * * * * * * * * * * * * * *
throw new TypeError(token.type);
}
// Create the AST and define the root node to be a node of type 'Program'.
const ast = {
type: AST_Types.Program,
body: [].sourceType: "script"
};
// start the walk function and place the node in ast.body.
while (current < tokens.length) {
ast.body.push(walk());
}
return ast;
}
module.exports = parser
Copy the code
The core point is the walk function, which makes recursive calls to the assignment statement. At the same time, we ignore some token processing, such as “=”, “;” . We also don’t see “=”, “;” in the AST generated by Esprima. .
So far we have implemented a very simple AST generation tool. But it takes a lot of judgment and special treatment to make it work in real development, but that’s the general idea. We focus on understanding principles and ideas.
Transformation
After the AST is generated, we can add, delete, modify and check the AST.
Remember our goal is to get
let name = "Zhang";
Copy the code
Converted to
var name = "Zhang";
Copy the code
So we need to replace let with var, which means we need to convert the AST to the following form:
{
"type": "Program"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
"type": "Identifier"."name": "name"
},
"init": {
"type": "Literal"."value": "Zhang"."raw": "\" zhang SAN \ ""}}]."kind": "var"}]."sourceType": "script"
}
Copy the code
traverser
(Traverser)
First we need a traverser function that can traverse the AST. This function has several key points:
ast
Is a tree structure, using depth first traversaltraverser
Take two parametersast
.visitor
traverser
Responsible for the traverseAST
.visitor
Contains methods that accept different types of nodes and their parents, for example:const visitor = { VariableDeclaration(node, parent){}};Copy the code
- These methods add, delete and modify matched nodes
The implementation is as follows:
// traverser.js
const constants = require("./constants")
const { AST_Types } = constants
function traverser(ast, visitor) {
// traverseNode is called from each node in the tree
function traverseArray(array, parent) {
array.forEach(function(child) {
traverseNode(child, parent);
});
}
// Process the ast node functions, using the visitor defined conversion function for the conversion
function traverseNode(node, parent) {
// First look at the visitor to see if there is a handler for the type.
const method = visitor[node.type]
// If so, call the handler method
if (method) {
method(node, parent)
}
// Each node of a different type is treated separately below.
switch (node.type) {
case AST_Types.Program: // The top level Program starts, the body is an array, so call traverseArray
traverseArray(node.body, node)
break
// If no conversion is required, exit directly
case AST_Types.VariableDeclaration:
case AST_Types.VariableDeclarator:
case AST_Types.AssignmentExpression:
case AST_Types.Identifier:
case AST_Types.Literal:
break
// Also, if the current node cannot be recognized, an error is thrown.
default:
throw new TypeError(node.type)
}
}
The root node has no parent, so null is passed here
traverseNode(ast, null)}module.exports = traverser
Copy the code
Transformer
Transformer, which calls the traverser to generate a new AST. We need to define the traverser visitor parameter to enable the traverser visitor to add, remove, or change the AST node:
// transformer.js
const traverser = require("./traverser")
const constants = require("./constants")
const { AST_Types } = constants
// Transformer accepts the AST as a parameter
function transformer(ast) {
// newAst is used to store newAst
const newAst = {
type: AST_Types.Program,
body: [].sourceType: "script"
};
// For simplicity, newast. body is directly mounted to the _context property of the AST.
// After processing a node in this way, _context can be obtained through the parent of the current node, and then push
// Save the current modified node
ast._context = newAst.body
// Pass AST and visitor into traverser
traverser(ast, {
// Convert let to var
VariableDeclaration: function(node, parent) {
const variableDeclaration = {
type: AST_Types.VariableDeclaration,
declarations: node.declarations,
kind: "var"
};
// Put the new VariableDeclaration into the context.
parent._context.push(variableDeclaration)
}
});
// Finally return the new AST
return newAst
}
module.exports = transformer
Copy the code
Code Generation
Finally, it’s time to regenerate the code based on the new AST.
const constants = require("./constants")
const { AST_Types } = constants
function codeGenerator(node) {
// Handle different types of nodes
switch (node.type) {
// If it is a Program node, loop through each node in its body property and add a newline symbol
case AST_Types.Program:
return node.body.map(codeGenerator)
.join('\n')
case AST_Types.VariableDeclaration: // Handle variable declarations
return (
node.kind + ' ' + node.declarations.map(codeGenerator)
)
case AST_Types.VariableDeclarator: // Process declared variables and values
return (
codeGenerator(node.id) + '=' +
codeGenerator(node.init)
)
case AST_Types.Identifier: // The variable name is returned directly
return node.name
case AST_Types.Literal: // String with "",; return
return '"'+node.value+'"'+";"
// If we cannot identify the node, an error is thrown.
default:
throw new TypeError(node.type)
}
}
module.exports = codeGenerator
Copy the code
This step is relatively simple, generating a new AST by recursively iterating through it and stitching the corresponding pieces together.
Finally, let’s look at the compiler we wrote in action:
const transformer = require("./transformer")
const tokenizer = require("./tokenizer")
const parser = require("./parser")
const codeGenerator = require("./codeGenerator")
const tokens = tokenizer('let name = "张三";')
const ast = parser(tokens)
const newAst = transformer(ast)
const newCode = codeGenerator(newAst)
console.log(newCode) // var name = "";
Copy the code
The complete code is here, if you are interested, you can refer to it.