This article is based on github.com/jamiebuilds… Warehouse source code for learning
I was doing some research on the build side recently, and I was looking at the Babel website and I came across this quote:
For an awesome tutorial on compilers, check out the-super-tiny-compiler, which also explains how Babel itself works on a high level.
For the sake of learning, I went to look at the source code of this repository. Here are some notes I learned. I hope you can understand how Babel works after reading
preface
The -super-tiny-Compiler is a super-small compiler with less than 200 lines of code, but it can learn the basic compile principle through this compiler, including Babel, which is also developed based on this principle. An example of the repository itself is compiling a set of Lisp-style function syntax into A C-style function syntax, for example:
LISP style | In c. | |
---|---|---|
2 + 2 | (add 2 2) | Add (2, 2) |
4-2 | (subtract 4 2) | substract(4, 2) |
2 plus 4 minus 2. | (add 2 (substract 4 2)) | add(2, (substract(4, 2))) |
This is probably what the Compiler needs to do this time around. It may not look syntactically complete, but it can also illustrate the main ideas of modern compilers.
Most Compilers break the compilation process into three main processes: parse, Transform and code generate:
- Parse primarily transforms source code into a more abstract code representation
- Transform does whatever compiler expects with the above abstract expression
- Code generate Generates new code from the transformed code
Parse
Parse consists of two main steps: lexical analysis and syntax analysis.
- Lexical analysis divides source code into tokens based on expression. Tokens are a group of objects used to describe a single syntax, such as numbers, labels, punctuation, operators, etc
- Grammar analysis rearranges tokens generated by lexical analysis into a structure called Abstract Grammar Number (AST). AST is a nested tree structure that is easy to use and displays complete information.
Subtract 4 2 tokens Add 2 (Subtract 4 2)
[{type: 'paren'.value: '(' },
{ type: 'name'.value: 'add' },
{ type: 'number'.value: '2' },
{ type: 'paren'.value: '(' },
{ type: 'name'.value: 'subtract' },
{ type: 'number'.value: '4' },
{ type: 'number'.value: '2' },
{ type: 'paren'.value: ') ' },
{ type: 'paren'.value: ') '}]Copy the code
The AST structure processed by syntax analysis is as follows:
{
type: 'Program'.body: [{type: 'CallExpression'.name: 'add'.params: [{type: 'NumberLiteral'.value: '2'}, {type: 'CallExpression'.name: 'subtract'.params: [{type: 'NumberLiteral'.value: '4'}, {type: 'NumberLiteral'.value: '2',}]}]}]}Copy the code
Transform
Transform basically takes the abstract syntax tree that Parse gets and makes some changes based on it. The TranForm process can either modify the AST based on the current language style or use a new language style. The following shows the workflow of the Transform process based on the previous AST structure. As you can see, the elements in the AST all look similar, and these elements make up the children of the AST, whose data structure types describe a single part of the code (variables, declarations, expressions, and so on). For example, the node of type NumberLiteral mentioned above:
{
type: 'NumberLiteral'.value: '2',}Copy the code
Or a more complex child node type:
{
type: 'CallExpression'.name: 'substract'.params: [...nested nodes here ...],
}
Copy the code
In transfrom, we can add, delete, and modify abstract syntax tree nodes, or we can create a new one based on the existing abstract syntax tree. Since our goal here is to convert the input code to code in a new language style (Lisp -> C), we will create a new AST for the new language. So here we need to specify how to modify the AST:
Traversal (traverse)
To handle all nodes, we can traverse them with depth-first search:
{
type: 'Program',
body: [{
type: 'CallExpression',
name: 'add',
params: [{
type: 'NumberLiteral',
value: '2'
}, {
type: 'CallExpression',
name: 'subtract',
params: [{
type: 'NumberLiteral',
value: '4'
}, {
type: 'NumberLiteral',
value: '2'}}}}]]]Copy the code
The traversal process looks like this:
- Program – Starts at the top of the AST
- CallExpression (add) – The first child of Program
- NumberLiteral (2) – CallExpression (add) is the first child element
- CallExpression (Subtract) – The second child of CallExpression (Add)
- NumberLiteral (4) – CallExpression (subtract) The first child element
- NumberLiteral (2) – CallExpression (subtract) Second child element
If you operate directly within an AST rather than generating a new AST, you might need to introduce all kinds of abstractions. But for now, the method of accessing all the nodes is sufficient. The word visiting represents a mode of operating on elements within the structure of an object.
Visitors (access)
Here we can create a visitor object that contains methods to receive different nodes. Such as:
var visitor = {
NumberLiteral() {},
CallExpression(){}};Copy the code
So when we walk through the AST, if a node of the type is matched, we can call a method in that visitor to process it.
Code generate
The final stage of Compiler is generate. This stage may overlap with Transformation, but the main part of code generation is to output code according to the AST. Generate works in several different ways. Some Compilers reuse previously generated tokens, while others create separate code representations to output the code linearly, but for the rest of our discussion we will focus on using the AST generated previously. Our generator needs to know how to print all the type nodes in the AST and then recursively call itself, knowing that all the code is printed into a long string.
Code implementation
The above is all the parts of Compiler, but not all of them are like this. Different Compiler has different purposes, so different steps may be required. Let’s start coding:
Tokenizer
According to the previous theoretical analysis, we first carried out tokenizer in the parser stage. This function takes a string and splits it into an array of tokens: ex: (add 2 (substract 4 2)) =>[{type: ‘paren’, value: ‘(‘},…
So you can write a function like this:
const tokenizer = (input) = > {
const tokens = [];
let current = 0;
while (current < input.length) {
let char = input[current];
if (char === '(') {
tokens.push({
type: 'paren'.value: '(',
})
current++;
continue;
}
if (char === ') ') {
tokens.push({
type: 'paren'.value: ') ',
})
current ++;
continue;
}
// Whitespace does not need to be handled at this time
if (/\s/.test(char)) {
current++;
continue;
}
// Handle numbers
if (/ [0-9].test(char)) {
let value = ' ';
// Iterate until a non-numeric character is encountered
while (/ [0-9].test(char)) {
value += char;
char = input[++current];
}
tokens.push({
type: 'number',
value,
})
continue;
}
// Process strings
if(/[a-z]/i.test(char)) {
let value = ' ';
while(/[a-z]/i.test(char)) {
value += char;
char = input[++current];
}
tokens.push({
type: 'name',
value,
});
continue;
}
// If there is an unmatched token, an error is thrown
throw new Error(`Unknown token: ${char}`);
}
return tokens;
}
Copy the code
Parser (Parser)
The lexical analyzer takes the token array from the parsing and converts it into an AST structure. For example: [{type: ‘paren’, value: ‘(‘},… => { type: ‘Program’, body: […] }
const parser = (tokens) = > {
let current = 0;
const walk = () = > {
const token = tokens[current];
// If it is a node of type number, return a new AST node
if (token.type === 'number') {
current++;
return {
type: 'NumberLiteral'.value: token.value,
}
}
// Next check the CallExpression type, starting with the left parenthesis
if (
token.type === 'paren' &&
token.value === '('
) {
// Skip the left parenthesis
token = tokens[++current];
// First a root node for CallExpression is created, and then name is set to the current token.value
// Because the token after the left parenthesis must be the function name
const node = {
type: 'CallExpression'.name: token.value,
params: [],};// Jump the function name again
token = tokens[++current];
// Loop through each of the following tokens until a close parenthesis is encountered
// These tokens become 'params' of' CallExpression '
// Lisp-style code: (add 2 (substract 4 2))
/** * tokens will have a lot of parentheses * [{type: 'paren', value: '('}, {type: 'name', value: 'add'}, {type: 'number', value: '2' }, { type: 'paren', value: '(' }, { type: 'name', value: 'subtract' }, { type: 'number', value: '4' }, { type: Value: 'number', '2'}, {value: type: 'paren', ')}, < < < right parentheses {type: 'paren' value: } <<< close parenthesis] When we encounter nested CallExpressions, we will use the walk function to process * */
while( token.type ! = ='paren'|| (token.value ! = =') ' && token.type === 'paren')
) {
node.params.push(walk());
token = tokens[current];
}
current++;
return node;
}
throw new Error(`Unknown token type: ${token.type}`);
}
const ast = {
type: 'Program'.body: [],}/** * use recursive functions to process nodes into ast.body */
while (current < tokens.length) {
ast.body.push(walk());
}
return ast;
}
Copy the code
Traversal (visitors)
After parsing the AST, you next need an visitors to walk through the nodes. Then, when a node of a certain type is encountered, you can call the corresponding type handler in visitors:
traverse(ast, {
Program(node, parent) {
// ...
},
CallExpression(node, parent) {
// ...
},
NumberLiteral(node, parent) {
// ...}});Copy the code
So our code could be written like this:
const traverser = (ast, visitor) = > {
// Used to invoke the traverseNode function for each element in the array
const traverseArray = (array, parent) = > {
array.forEach((child) = > {
traverseNode(child, parent);
});
}
// The function takes a node and its parent as arguments
// This node is passed to the appropriate handler in the visitor
const traverseNode = (node, parent) = > {
const method = visitor[node.type];
if (method) {
method(node, parent);
}
// Handle different nodes separately
switch (node.type) {
case 'Program':
traverseArray(node.body, node);
break;
case 'CallExpression':
traverseArray(node.params, node);
break;
// In this case, there is no child node, just break out
case 'NumberLiteral':
break;
default:
throw new Error(`Unknown node type: ${node.type}`);
}
}
traverseNode(ast, null);
}
Copy the code
Converter (transformer)
The converter works with the traverser above, which takes the previously built AST and passes it along with the visitor to the traverser to produce a brand new AST. The original AST structure is (Add 2 (Subtract 4 2)):
{
type: 'Program',
body: [{
type: 'CallExpression',
name: 'add',
params: [{
type: 'NumberLiteral',
value: '2'
}, {
type: 'CallExpression',
name: 'subtract',
params: [{
type: 'NumberLiteral',
value: '4'
}, {
type: 'NumberLiteral',
value: '2'}}}}]]]Copy the code
After conversion, the AST structure is (Add (2, Subtract (4, 2)) :
{
type: 'Program',
body: [{
type: 'ExpressionStatement',
expression: {
type: 'CallExpression',
callee: {
type: 'Identifier',
name: 'add',
},
arguments: [{
type: 'NumberLiteral',
value: '2'}, {type: 'CallExpression',
callee: {
type: 'Identifier',
name: 'subtract',
},
arguments: [{
type: 'NumberLiteral',
value: '4'}, {type: 'NumberLiteral',
value: '2',}]}]}}Copy the code
We can then write the corresponding converter code as follows:
const transformer = (ast) = > {
const newAst = {
type: 'Program'.body: [],
}
ast._context = newAst.body;
traverser(ast, {
// Process the NumberLiteral type
NumberLiteral: (node, parent) = > {
parent._context.push({
type: 'NumberLiteral'.value: node.value,
});
},
// Works with CallExpression types
CallExpression: (node, parent) = > {
// Initialize a new node of CallExpression
// put a nested Identifier node inside it
const expression = {
type: 'CallExpression'.callee: {
type: 'Identifier'.name: node.name
},
arguments: [],}; node._context = expression.arguments;// Check whether the parent is CallExpression
if(parent.type ! = ='CallExpression') {
// if not, the CallExpression node is packaged in an ExpressionStatement node called 'ExpressionStatement'
// This is done because top level CallExpression can also be considered declarative statements in JavaScript
expression = {
type: 'ExpressionStatement',
expression,
};
}
// Finally we put CallExpression into the parent's contextparent._context.push(expression); }});return newAst;
}
Copy the code
Code Generator
The code generator is also a recursive function that eventually prints each node in the AST into a large string:
const codeGenerator = (node) = > {
switch (node.type) {
// If it is Program, each node in the 'body' attribute will be traversed
// Recursively codeGenerator these nodes and print the result on a new line
case 'Program':
return node.body.map(codeGenerator).join('\n');
// Make a recursive call to the expression property for the ExpressionStatement with a semicolon
case 'ExpressionStatement':
return `${codeGenerator(node.expression)}; `;
// For CallExpression it makes recursive calls to callee attributes, followed by (parentheses)
// Then make a recursive call to the arguments property with) parentheses
case 'CallExpression':
return `${codeGenerator(node.callee)}(${node.arguments.map(codeGenerator).join(', ')}) `;
// For Identifier, return name
case 'Identifier':
return node.name;
// For NumberLiteral, return value directly
case 'NumberLiteral':
return node.value;
default:
throw new Error(`Unknown node type: ${node.type}`); }}Copy the code
The Compiler (Compiler)
At this point, we can create a Compiler function to complete the compiler work by calling the above function:
- input => tokenizer => tokens
- tokens => parser => ast
- ast => transformer => newAst
- newAst => generator => output
The code takes just a few simple steps:
const compiler = (input) = > {
const tokens = tokenizer(input);
const ast = parser(tokens);
const newAst = transformer(ast);
return codeGenerator(newAst);
}
Copy the code
We can input the previous set of test examples to ensure that the results are correct.
conclusion
At this point, a tiny-Compiler is created, and the compilation process for Babel is based on this, because Babel itself is a source-to-source compiler. The whole process is divided into three steps: parser -> transform -> code generate. For details, please refer to babeljs. IO /docs/en/#ba… Hopefully, this article will help readers understand the compilation process of relevant compilation tools.