An overview of the
This paper is divided into three parts
Part 1: What is AST and why do we use it
Part two: Implement a small compiler and understand the role the AST plays
Part 3: Introduce how AST is used in Babel
The second part is the most important, and I recommend downloading sample code to get a feel for the process
Babel is well known among front-end developers for its role in converting ES6, ES7, or draft syntax into ES5 syntax,
This way, there are no compatibility issues in the browser, and developers can try out the latest syntax,
Do you have any questions about how Babel achieves this capability?
.
Think about 5 s
. done ….
Babel implements this transformation based on an AST (Abstract Syntax tree) technology, which is referred to as AST for short
It is not only possible to convert ES6 to ES5 based on AST, but the following are current applications
- Syntax validation (ESLint)
- Code formatting
- CSS pretreatment
- Programming language conversion (e.g. Lisp to JavaScript)
- React Native, Taro, Uni-app, etc.
It’s all part of front-end engineering, but it’s not the whole story.
You can do anything with it (AST) for a particular type of code file
What is AST?
Let’s look at the definition first
It is a hierarchical program representation that presents source code structure according to the grammar of a Buta collection of facts cannot be called science any more than a pile of facts. Buta collection of facts cannot be called science any more than a pile of facts. Buta collection of facts cannot be called science any more than a pile of facts. Each AST node corresponds to a source nodeCopy the code
Now, you’re probably as confused as I am, and this description isn’t enough to form a concrete understanding of it in our minds
Let’s start with a small example, if we have a piece of code like this
function calc(a, b) { return a + b; }
Copy the code
Since we’re just doing addition, we want to change the name of the function to sum, what can we do?
I’m sure some of you thought of replacing it with the string replace method.
Ok, let’s create a transform method to implement this capability
const code = ` function calc(a, b) { return a + b; } `;
function transform (originalCode) {
if (typeoforiginalCode ! = ='string') {
return originalCode;
}
return originalCode.replace(/calc/.'add');
}
console.log(transform(code));
/ / output
// function add(a, b) {
// return a + b;
// }
Copy the code
It seems pretty simple, so let’s change the example again, and now it looks like this
function plus (args = []) {
return args.reduce((total, item) = > total + item, 0);
}
function subtract (args = []) {
return args.reduce((total, item) = > total - item, 0);
}
function multiply (args) {
return args.reduce((total, item) = > total * item, 0);
}
function divide (args) {
return args.reduce((total, item) = > total / item, 0);
}
function calc(type, ... args) {
switch (type) {
case 'plus':
return plus(args);
case 'subtract':
return subtract(args)
case 'multiply':
return multiply(args)
case 'divide':
return divide(args)
}
}
Copy the code
If we want to add a default output to console.log, or a default return to the switch statement, it might be a bit of a hassle
The actual scene of programming language is more complex, and using AST to deal with it is the mainstream way at present. AST is a virtual node tree describing code
For example, the React virtual DOM is used to describe real DOM nodes in HTML.
This abstract tree structure makes it easy for us to process it to get the desired result.
That little example at the beginning, how does AST represent that
Function calc(a, b) {return a + b; }Copy the code
// AST represents {"type": "Program", "start": 0, "end": 38, "sourceType": "module", "interpreter": null, "body": [{"type": "FunctionDeclaration", "start": 0, "end": 38, "id": { "type": "Identifier", "start": 9, "end": 13, "loc": { "start": { "line": 1, "column": 9 }, "end": { "line": 1, "column": 13 }, "identifierName": "calc" }, "name": "calc" }, "generator": false, "async": false, ... }Copy the code
As far as Program and Identifier are concerned, you don’t need to worry about them for a moment, and we will explain them later. Now, as long as you know that AST describes the complete code structure, you can understand it after reading the code implementation of the compiler in Part 2
The parser here is @babel/ Parser, and there are a lot of parsers out there, and Babel is one of the most widely used tools, and we’ll talk about the others later
astexplorer.net/
Is a visual AST presentation site where you can paste code to see how an AST is made
Github.com/fkling/aste…
You can check out the various parser and Transformer tool libraries currently in mainstream use here
Compiler composition and implementation
Compiler composition
Having seen the AST above, let’s take a closer look at the role AST plays in implementing a compiler with over 200 lines of code
The workflow of most compilers can be roughly divided into three steps:
Parsing => 3. Parsing => 3. Parsing => 3. Code GenerationCopy the code
Parsing
Parsing is the process of converting source code into an AST, which is divided into two steps, lexical analysis and syntax analysis
Lexical analysis
Lexical analysis requires traversing the source code from beginning to end and breaking it down into the smallest semantically expressive units of token for later generation of AST.
What are semantically minimal units and tokens, as illustrated by the initial example
function calc(a, b) {
return a + b;
}
Copy the code
Calc is a minimal word, and if you break it down into c, A, L, and C, it’s just an ordinary letter, and we don’t know what they’re related to each other,
Similarly, function and return are the smallest semantic words that cannot be split. What is token
Token is the data in {type: ‘XXX ‘, value:’ XXX ‘} format, which facilitates the generation of AST
[{type: 'string'.value: 'function' },
{ type: 'name'.value: 'calc' },
{ type: 'paren'.value: '(' },
{ type: 'string'.value: 'a' },
{ type: 'string'.value: 'b' },
{ type: 'paren'.value: ') ' },
{ type: 'paren'.value: '{' },
{ type: 'string'.value: 'return' },
{ type: 'string'.value: 'a' },
{ type: 'operator'.value: '+' },
{ type: 'string'.value: 'b' },
{ type: 'paren'.value: '} '},]Copy the code
Syntax analysis
By analyzing the tokens generated by lexical analysis, generate AST. The Program type is the top-level node
{
"type": "Program",
"start": 0,
"end": 38,
"sourceType": "module",
"interpreter": null,
"body": [
{
"type": "FunctionDeclaration",
"start": 0,
"end": 38,
"id": {
"type": "Identifier",
"start": 9,
"end": 13,
"loc": {
"start": {
"line": 1,
"column": 9
},
"end": {
"line": 1,
"column": 13
},
"identifierName": "calc"
},
"name": "calc"
},
"generator": false,
"async": false,
...
}
Copy the code
2. Transformation
The next phase is the transformation phase, which basically takes the AST as a parameter, walks through the AST, makes the changes we want, and generates a new AST
3. Code Generation
Finally, there is the code generation phase, where we iterate through the new AST of the transformation phase to finally get the code structure we want to output
Function calc(a, b) {return a + b; Function add(a, b) {return a + b; }Copy the code
Compiler implementation
This is the highlight of this article. We will implement a compiler from scratch and look at the implementation details of each stage
Let’s implement a cross-language converter that compiles Lisp function calls to JavaScript function calls
Lisp | JavaScript | |
---|---|---|
2 + 2 | (add 2 2) | add(2, 2) |
2 + (5-3) | (add 2 (subtract 5 3)) | add(2, subtract(5, 3)) |
It doesn’t matter if you don’t know Lisp, we’re going to focus on the conversion process
This correspondence is important, and everything we’re going to do is based on this correspondence in the table
As mentioned earlier, the implementation of a compiler has three stages:
Parsing => 3. Parsing => 3. Parsing => 3. Code GenerationCopy the code
We define the corresponding function according to the stage
1, parsing,
Tokenizer: Lexical analysis
Parser: Parses
2, conversion,
Traverser: Provides a visitor interface to traverse the AST tree and modify to generate a new tree
Transformer: Invoke the traverser
3. Code generation
CodeGenerator: : Generates code
1, parsing,
Tokenizer: Lexical analysis
** (subtract 4 2)) ** The token generated from the above code would look like this: ** [* {type: 'paren', value: '(' }, * { type: 'name', value: 'add' }, * { type: 'number', value: '2' }, * { type: 'paren', value: '(' }, * { type: 'name', value: 'subtract' }, * { type: 'number', value: '4' }, * { type: 'number', value: '2' }, * { type: 'paren', value: ')' }, * { type: 'paren', value: ')' }, * ] * */
// We accept the source code and set two variables current and tokens
function tokenizer(input) {
The current variable is used to indicate the current position of the traversal, like a virtual cursor
let current = 0;
// Tokens array is used to collect tokens
let tokens = [];
// We create a while loop to traverse the source code
// Because the token length is different, the current may be increased several times in a loop
while (current < input.length) {
// Cache the current field in input
let char = input[current];
// The first case we need to detect is the opening of parentheses, which is later used by the function call CallExpression. but
// Now we just need to worry about the characters.
//
// We check if there is an open parenthesis
if (char === '(') {
// If the character is open parenthesis, we push a new token with type paren and value (
tokens.push({
type: 'paren'.value: '('});// move the virtual cursor forward
current++;
// Then continue to the next loop
continue;
}
// Check the closing parentheses, as before, add 1 token, add current, continue the next loop
if (char === ') ') {
tokens.push({
type: 'paren'.value: ') '}); current++;continue;
}
// We now need to check for Spaces. This is interesting because Spaces exist to separate characters and are not really important for token storage, so we'll just skip it
let WHITESPACE = /\s/;
if (WHITESPACE.test(char)) {
current++;
continue;
}
// The token type is a number, which is different from the previous one because the number may be consecutive
//
// (add 123 456)
/ / ^ ^ ^ ^ ^ ^
// Only two tokens are allowed
//
// So when we get to the first number, we do the following
let NUMBERS = / [0-9];
if (NUMBERS.test(char)) {
// We create a value variable to store characters
let value = ' ';
// We loop back until the character is not a number
while (NUMBERS.test(char)) {
value += char;
char = input[++current];
}
// Push 1 token to tokens
tokens.push({ type: 'number', value });
// Continue to the next loop
continue;
}
// Our language also supports strings, a paragraph of text wrapped around"
//
// (concat "foo" "bar")
// ^^^ ^^^ string tokens
//
// let's start checking quotes
if (char === '"') {
// Use value to collect string tokens
let value = ' ';
// We skip the double brackets themselves and start with the last 1 bit
char = input[++current];
// Iterate backwards until another one is found."
while(char ! = ='"') {
value += char;
char = input[++current];
}
// Skip closed double quotes
char = input[++current];
// We add string token to tokens
tokens.push({ type: 'string', value });
continue;
}
// The last token is the name token, also a sequence of letters, which in Lisp is the name of a function
//
// (add 2 4)
// ^^^
// Name token
//
let LETTERS = /[a-z]/i;
if (LETTERS.test(char)) {
let value = ' ';
// We iterate over the loop letters and add them to the value variable
while (LETTERS.test(char)) {
value += char;
char = input[++current];
}
// We add the name token and continue the loop
tokens.push({ type: 'name', value });
continue;
}
// Finally we check the syntax, if we do not match the result, throw an error, stop execution
throw new TypeError('I dont know what this character is: ' + char);
}
// At the end of tokenizer, we return to tokens
return tokens;
}
Copy the code
Parser: Parses
/** * We try to parse the tokens array into AST ** [{type: 'paren', value: '('},... => { type: 'Program', body: [...] } * /
// We define the parser function to receive tokens from Tokenizer
function parser(tokens) {
// Create the current variable to represent the current position of the virtual cursor
let current = 0;
// We create the root node of the AST. Program is the top node of the AST
let ast = {
type: 'Program'.body: [],};// We will encounter nested functions, so this time we will use recursion instead of the while loop
function walk() {
// Retrieve the token for the virtual cursor position
let token = tokens[current];
// We check if the token is of type number
if (token.type === 'number') {
// 如果是 number,current +1
current++;
// We return an AST node called NumberLiteral
return {
type: 'NumberLiteral'.value: token.value,
};
}
// If it is a string token, we create a StringLiteral node like number
if (token.type === 'string') {
current++;
return {
type: 'StringLiteral'.value: token.value,
};
}
// Next we look at CallExpressions, which start with an open parenthesis
if (
token.type === 'paren' &&
token.value === '('
) {
// current +1, skip the opening parenthesis, we don't care about it
token = tokens[++current];
// We create a base node with type CallExpression and set name to the token value
// The function name is followed by the opening parentheses
let node = {
type: 'CallExpression'.name: token.value,
params: [],};// current +1, we skip the function name and look directly at the expression
token = tokens[++current];
// To collect the Params of CallExpression, we query each token backwards until we encounter closing parentheses
//
// This is when recursion is needed. Instead of trying to analyze parameters directly that may have an infinite number of nested nodes, we use recursion.
//
To illustrate this concept, take our lisp code as an example. You can observe that the arguments to Add are a number and a nesting
// The CallExpression, which in turn has its own parameters.
//
// (add 2 (subtract 4 2))
//
// You should notice that our tokens array has multiple closing brackets
//
/ / /
// { type: 'paren', value: '(' },
// { type: 'name', value: 'add' },
// { type: 'number', value: '2' },
// { type: 'paren', value: '(' },
// { type: 'name', value: 'subtract' },
// { type: 'number', value: '4' },
// { type: 'number', value: '2' },
// {type: 'paren', value: ')'}, <<< close parentheses
// {type: 'paren', value: ')'}, <<< close parentheses
/ /]
//
// We rely on the walk function to add current
// So we create a while loop that continues through the token until we encounter a closing parenthesis
while( (token.type ! = ='paren') ||
(token.type === 'paren'&& token.value ! = =') ')) {// We call walk until we return a node and we push it to Node.params
node.params.push(walk());
token = tokens[current];
}
// Finally, we put current+1, skipping the closing parentheses
current++;
/ / returns the node
return node;
}
// If the token does not match, a type error is thrown
throw new TypeError(token.type);
}
// Then we start the walk function and push nodes to ast.body
//
// The reason we do it in one loop is because CallExpression (function call expressions) can be multiple and unrelated
//
// (add 2 2)
// (subtract 4 2)
//
while (current < tokens.length) {
ast.body.push(walk());
}
// At the end of parser, we return the AST
return ast;
}
Copy the code
2, conversion,
Traverser: Provides a visitor interface to traverse the AST tree and modify to generate a new tree
/** * Now we have the AST, and we can use the visitor pattern to access different nodes. * When we encounter a matching type, we call different methods on the visitor. * Visitor pattern is a design pattern that separates data operations from data operations. Traverse (ast, {* Program: {* enter(node, parent) {* //... * }, * exit(node, parent) { * // ... * }, * }, * * CallExpression: { * enter(node, parent) { * // ... * }, * exit(node, parent) { * // ... * }, * }, * * NumberLiteral: { * enter(node, parent) { * // ... * }, * exit(node, parent) { * // ... *}, *}, *}); * /
// We define the traverser function to receive the AST and a visitor, where we will define two functions
function traverser(ast, visitor) {
// The traverseArray function will allow us to traverse the array, and we need to define a traverseNode function next
function traverseArray(array, parent) {
array.forEach(child= > {
traverseNode(child, parent);
});
}
// traverseNode takes node nodes and their parent nodes
// We can pass both as arguments to our visitor method, i.e
function traverseNode(node, parent) {
// Initially, see if the node type has a corresponding visitor method
let methods = visitor[node.type];
// If there is an Enter method, we call it, passing two arguments
if (methods && methods.enter) {
methods.enter(node, parent);
}
// Next, we need to make a difference according to the current node type
switch (node.type) {
// To start, the top-level structure is Program, which has a property named body,
// We use traverseArray recursively to process all child nodes
case 'Program':
traverseArray(node.body, node);
break;
// Next, we do the same with CallExpression, cycling through its params, the expression body below
case 'CallExpression':
traverseArray(node.params, node);
break;
NumberLiteral and StringLiteral, which have no children below them, so no extra processing is required
case 'NumberLiteral':
case 'StringLiteral':
break;
// If the above match is not triggered, a type error is thrown as before
default:
throw new TypeError(node.type);
}
If there is an exit method on the visitor, we execute it
if(methods && methods.exit) { methods.exit(node, parent); }}// Finally we start the traverseNode function with AST, which is at the top level and has no parent node
traverseNode(ast, null);
}
Copy the code
Transformer: Invoke the traverser
/** * Next we have transformer, we iterate through the AST node tree, processing through a method on the visitor, Generate new AST tree * * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * source AST | AST * after conversion ---------------------------------------------------------------------------- * { | { * type: 'Program', | type: 'Program', * body: [{ | body: [{ * type: 'CallExpression', | type: 'ExpressionStatement', * name: 'add', | expression: { * params: [{ | type: 'CallExpression', * type: 'NumberLiteral', | callee: { * value: '2' | type: 'Identifier', * }, { | name: 'add' * type: 'CallExpression', | }, * name: 'subtract', | arguments: [{ * params: [{ | type: 'NumberLiteral', * type: 'NumberLiteral', | value: '2' * value: '4' | }, { * }, { | type: 'CallExpression', * type: 'NumberLiteral', | callee: { * value: '2' | type: 'Identifier', * }] | name: 'subtract' * }] | }, * }] | arguments: [{ * } | type: 'NumberLiteral', * | value: '4' * ---------------------------------- | }, { * | type: 'NumberLiteral', * | value: | '2' * * * | |}}}] * * | |}]} * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * / / / Function transformer(AST) {// We create newAst, like the AST above, Let newAst = {type: 'Program', body: [],}; // There are usually better abstractions to do this. If you want to do this, you can use a _content method on the old AST to store a reference to the new AST. Here we try to handle ast._context = newast.body as simply as possible; Traverser (ast, {// The first traverser method is used to process NumberLiteral NumberLiteral: {// We will call Enter (node, parent) when we access this node type. {// The new node is also called NumberLiteral, and we will push it into the new AST tree. 'NumberLiteral', value: node.value,});},}, // { enter(node, parent) { parent._context.push({ type: 'StringLiteral', value: Node.value,});},}, // Now it's time for CallExpression: {enter(node, parent) {// We create the new language CallExpression expression structure let expression = {type: 'CallExpression', callee: {type: 'Identifier', name: node.name, }, arguments: [],}; // Now we define the new _context on the original CallExpression node, Mount the node node to which you want to convert expression. Arguments node._context = expression.arguments; CallExpression if (parent. Type! == 'CallExpression') {// we use ExpressionStatement node to wrap CallExpression node, // why do we do this, because in JavaScript, Expression = {type: 'ExpressionStatement', expression: Expression,};} // Finally we push CallExpression into the _context of the parent, which can be another function call parent._context.push(expression);},}}); Return newAst;}Copy the code
3. Code generation
CodeGenerator: : Generates code
/** * now let's look at the last step: */ function codeGenerator(node) {switch (node.type) {// If the Program root node. Case 'Program': return node.body.map(codeGenerator).join('\n'); // ExpressionStatement (ExpressionStatement) Case 'ExpressionStatement': return (codeGenerator(node.expression) + '; '/ / < <... Because we want it to be syntactically correct)); // For CallExpression, we will print its callee, concatenation (, map iterates through its arguments, and finally add) case 'CallExpression': return ( codeGenerator(node.callee) + '(' + node.arguments.map(codeGenerator) .join(', ') + ')' ); Case 'Identifier': return node.name; Case 'NumberLiteral': return node.value; // For NumberLiteral, we wrap "" around node value and return case 'StringLiteral': return '" + node.value + '"'; Default: throw new TypeError(Node.type); // If no processing rule is matched, throw a type error. }}Copy the code
4. Encapsulate the methods in stages 1, 2 and 3 into compiler
/** * Finally, we create a compiler function * here, we string them together with paths like this, Tokens => Parser => AST * 3. Ast => Transformer => newAst * 4. NewAst => generator => output */
function compiler(input) {
let tokens = tokenizer(input);
let ast = parser(tokens);
let newAst = transformer(ast);
let output = codeGenerator(newAst);
return output;
}
Copy the code
Recall our previous chart, if you have a mental model in mind
Lisp | JavaScript | |
---|---|---|
2 + 2 | (add 2 2) | add(2, 2) |
2 + (5-3) | (add 2 (subtract 5 3)) | add(2, subtract(5, 3)) |
Complete sample code for the compiler:Click here to get
How does Babel escape through AST?
Recall the three stages of the compiler:
Parsing => 3. Parsing => 3. Parsing => 3. Code GenerationCopy the code
Babel has toolkits for each stage
1. The analytic (Parsing) = > @ Babel/parser (https://babeljs.io/docs/en/babel-parser) 2. Translation (Transformation) = > @ Babel/traverse by 3 (https://babeljs.io/docs/en/babel-traverse). Code Generation (Code Generation) = > @ Babel/generator (https://babeljs.io/docs/en/babel-generator)Copy the code
The structure of AST is relatively complex, and Babel has a package of @babel/types to facilitate the operation of AST nodes
Let’s continue with the second example, adding console to the output of the function name to make it easier to identify the printed information
function plus (args = []) {
return args.reduce((total, item) = > total + item, 0);
}
function subtract (args = []) {
return args.reduce((total, item) = > total - item, 0);
}
function multiply (args) {
return args.reduce((total, item) = > total * item, 0);
}
function divide (args) {
console.log(args);
return args.reduce((total, item) = > total / item, 0);
}
function calc(type, ... args) {
console.log(type);
console.log(args);
switch (type) {
case 'plus':
return plus(args);
case 'subtract':
return subtract(args)
case 'multiply':
return multiply(args)
case 'divide':
return divide(args)
}
}
Copy the code
Attach the complete code
const t = require('@babel/types');
const babelParser = require('@babel/parser');
const generate = require('@babel/generator').default;
const traverse = require('@babel/traverse').default;
const code = ` function plus (args = []) { return args.reduce((total, item) => total + item, 0); } function subtract (args = []) { return args.reduce((total, item) => total - item, 0); } function multiply (args) { return args.reduce((total, item) => total * item, 0); } function divide (args) { console.log(args); return args.reduce((total, item) => total / item, 0); } function calc(type, ... args) { console.log(type); console.log(args); switch (type) { case 'plus': return plus(args); case 'subtract': return subtract(args) case 'multiply': return multiply(args) case 'divide': return divide(args) } } `;
/ / 1. Parsing
const ast = babelParser.parse(code);
/ / 2. The conversion
/ / the node type and data structure, can see estree specification: https://github.com/estree/estree
traverse(ast, {
// This visitor is called when a node is entered
enter(path) {
console.log('Log: path.node.type ->', path.node.type);
},
// This visitor is called when the node exits
exit(path) {
// console.log('exit')
},
// You can write the node type directly, and if the type is scanned, that visitor is called
CallExpression(path) {
const { callee } = path.node;
// Check whether console.log has a parent node
const isConsoleLog = t.isMemberExpression(callee) && callee.object.name === 'console' && callee.property.name === 'log';
if (isConsoleLog) {
// If the parent node is a declared function, take its function name and add it to console
const funcPath = path.findParent(p= > p.isFunctionDeclaration());
if (funcPath) {
constfuncName = funcPath.node.id.name; path.node.arguments.unshift(t.stringLiteral(funcName)); }}}})// 3. Code generation
const output = generate(ast)
console.log('Input \n', code)
console.log('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --')
console.log('Output \n', output.code)
Copy the code
The calc function console.log has added the function name as expected
function calc(type, ... args) {
console.log("calc", type);
console.log("calc", args); . }Copy the code
This is the end of AST introduction, do you have some harvest
Finally, I strongly recommend taking some time to think about the code in Part 2
References:
- Github.com/estree/estr…
- Github.com/jamiebuilds…
- babeljs.io/docs/en/
- Juejin. Cn/post / 684490…