Introduction to the
Two days ago, I happened to find a project on GitHub: the -super-tiny-Compiler, which is probably the simplest compiler, according to the official introduction. It happened that I was learning the course “Compilation Principle” in this semester, so MY interest suddenly came up. After a brief look, this project is a compiler that converts a Lisp expression into a C expression, involving lexical analysis, grammar analysis, AST tree traversal transformation and the final code output. Now LET me take you to a simple implementation.
Lexical analysis
Lexical analysis is also called parsing. The first step that every compiler needs to do is lexical analysis. In simple terms, the “source code” for transformation is taken apart to form a widget called token. Here’s an example of unpacking a JavaScript statement:
let name = "touryung";
Copy the code
[let, name, =, “touryung”,;] To facilitate the next operation. Because we implement this is the simplest compiler, so the compiler internal only to realize the recognition of parentheses, Spaces, numbers, strings, variables.
The overall framework
To implement a lexical analyzer (parser), we first need to build a framework. The whole idea of lexical analysis is to iterate over the input string, then identify the different tokens and store it in the token array.
The framework is as follows, with the meanings of the different parts marked in the notes:
const WHITESPACE = /\s/; / / space
const NUMBERS = / [0-9]; / / digital
const LETTERS = /[a-z]/i; / / variable
function tokenizer(input) {
let current = 0; // The currently identified subscript
let tokens = []; / / token array
while (current < input.length) { / / traverse
let char = input[current]; // The character currently traversed
// Different token recognition operations
throw new TypeError(`I dont know what this character is: ${char}`);
}
return tokens;
}
Copy the code
The next step is to identify different tokens.
Identify the brackets
Recognizing the parentheses is simple. When iterating through the current character to the left and right parentheses, put an object that describes the current token into the token array.
// Recognize the left parenthesis
if (char === "(") {
tokens.push({ type: "paren".value: "(" }); // Press in the object that describes the current token
current++;
continue;
}
// Recognize the close parenthesis
if (char === ")") {
tokens.push({ type: "paren".value: ")" });
current++;
continue;
}
Copy the code
Identify the blank space
This is important because whitespace is virtually irrelevant to the syntax of the programming language, which is why Javascript code can be compressed and still work. Therefore, when we recognize a space, we do not need to put it into the token array for the next operation.
In fact, tokens like whitespace, comments, and line breaks that do not affect the program syntax are not sent to the next step in lexical analysis.
So, when we recognize a space, we just keep iterating:
// space is not processed
if (WHITESPACE.test(char)) {
current++;
continue;
}
Copy the code
Identify numbers/variables/strings
Why did I write these three tokens together? The reason is that the matching logic of all three tokens is similar, starting with numbers. Since the matched token may no longer be a single character, the loop needs to continue internally until the entire token is matched.
// number, loop to get the number
if (NUMBERS.test(char)) {
let value = ""; // Assign the initial value to the matched number
while (NUMBERS.test(char)) { // traversal, if there is still a match to sum
value += char;
char = input[++current];
}
tokens.push({ type: "number", value }); // Press in the object that describes the current token
continue;
}
// Variable, similar to number
if (LETTERS.test(char)) {
let value = "";
while (LETTERS.test(char)) {
value += char;
char = input[++current];
}
tokens.push({ type: "name", value });
continue;
}
// String, before and after "" need to skip
if (char === '"') {
let value = "";
char = input[++current]; // Skip the preceding quotes
while(char ! = ='"') { // The end condition is matched to the trailing quotation mark
value += char;
char = input[++current];
}
char = input[++current]; // Skip the quotes
tokens.push({ type: "string", value });
continue;
}
Copy the code
It should be noted that the recognition string is similar to the above two types, but there are two differences:
- In the string identification, you need to skip the quotation marks and match only the specific values in the middle.
- In the middle traversal the end condition matches the quotes at the end.
One might ask, how do you know that a skipped quote is a string after that? That’s where the token description object, pressed into the array, comes in. It has a type attribute that specifies the type of the current token.
A small summary
At this point, the lexical analysis work is done, in fact, relatively easy to understand, so can intuitively observe the lexical analysis output token array is what it looks like? Sure, just write a sample test, such as:
let source = "(add 2 (subtract 4 2))"; / / the source code
let tokens = tokenizer(source);
console.log(tokens);
Copy the code
This is a Lisp statement that evaluates 2+(4-2) and uses it as input to get the token array as follows:
[
{ type: 'paren', value: '(' },
{ type: 'name', value: 'add' },
{ type: 'number', value: '2' },
{ type: 'paren', value: '(' },
{ type: 'name', value: 'subtract' },
{ type: 'number', value: '4' },
{ type: 'number', value: '2' },
{ type: 'paren', value: ')' },
{ type: 'paren', value: ')' }
]
Copy the code
This is the perfect way to unpack the source code we started with.
Syntax analysis
The next step is syntax analysis. The function of syntax analysis is to convert the token array output in the previous step into the corresponding AST (Abstract syntax tree) according to the specific programming language syntax. Since the tree structure is involved, this step naturally involves recursive operations.
The overall framework
In general, the parsing section also needs to be framed first. The overall idea is to traverse the token array and recursively build the AST tree. The framework is as follows:
function parser(tokens) {
let current = 0;
function walk() {
let token = tokens[current];
// Convert different tokens to AST nodes
throw new TypeError(token.type);
}
let ast = { // This is the outermost structure of the AST tree and is fixed
type: "Program".body: [],};while (current < tokens.length) { // Iterate through the token array to build a tree structure
ast.body.push(walk());
}
return ast;
}
Copy the code
Build number and string nodes
The construction of these two types of nodes is relatively simple, just return the object describing the node:
// Build integer nodes
if (token.type === "number") {
current++;
return {
type: "NumberLiteral".value: token.value,
};
}
// Build a string node
if (token.type === "string") {
current++;
return {
type: "StringLiteral".value: token.value,
};
}
Copy the code
Build the function call node
Those of you who know Lisp know that parentheses are the essence of Lisp, such as function calls of the form add 1 2. Therefore, we need to use parentheses for identification. The specific code is as follows:
if (token.type === "paren" && token.value === "(") { // Open parenthesis
token = tokens[++current]; // Skip the open parenthesis
let node = { // The function calls the node
type: "CallExpression".name: token.value,
params: [],}; token = tokens[++current];/ / skip the name
// Collect parameter nodes recursively as long as they are not close parentheses
while(! (token.type ==="paren" && token.value === ")")) {
node.params.push(walk()); // Add parameters
token = tokens[current];
}
current++; // Skip the closing parenthesis
return node;
}
Copy the code
The important thing to note here is that an argument can also be the result of a function call, so you need to recursively call the walk function when parsing the argument.
It is also worth noting that we have used this code structure many times:
if (value === "(") {
// ...
while(! value ===")") {
// ...}}Copy the code
This structure is obviously suitable for traversing an interval, so we need to use it when parsing paired elements such as strings and parentheses.
A small summary
After a few simple steps, the token array will be converted into an AST tree structure. The feeling is still very magical. At this point, our output and programming are as follows:
{
"type": "Program"."body": [{"type": "CallExpression"."name": "add"."params": [{"type": "NumberLiteral"."value": "2"
},
{
"type": "CallExpression"."name": "subtract"."params": [{"type": "NumberLiteral"."value": "4"
},
{
"type": "NumberLiteral"."value": "2"}]}]}Copy the code
Traverse and transform the AST tree
At this point, we have an AST tree, and the compiler’s ability to convert source code into object code can actually be seen as converting the source AST tree into the target AST tree. To achieve this transformation, we need to traverse the tree and then operate on the corresponding nodes.
Traverse the tree
As we can see from the above, the body attribute and the function call in the AST tree are actually array types, so we need to define a method to iterate over the array type. It is simple to iterate over each element in the array separately.
// Access the array
function traverseArray(array, parent) {
array.forEach((child) = > traverseNode(child, parent));
}
Copy the code
When traversing a specific node, we need to call the Enter method of that node type to perform the access (transform AST) operation. Different types of node enter methods are different.
function traverseNode(node, parent) {
let method = visitor[node.type]; // A method to remove the current type
if (method && method.enter) { // Execute the corresponding Enter method
method.enter(node, parent);
}
switch (node.type) { // Perform different traversal operations for different types of nodes
case "Program":
traverseArray(node.body, node);
break;
case "CallExpression":
traverseArray(node.params, node);
break;
case "NumberLiteral":
case "StringLiteral":
break;
default:
throw new TypeError(node.type); }}Copy the code
Again, you might ask, why does the second parameter need to be passed to the parent when you call enter? This is actually related to the logic of the actual transformation part, which will be explained later.
Convert the AST tree
The overall framework
Again, we can start by building a general framework for the same type of node access (transformation) methods later. Here the transformation idea is more important: how can we traverse the old AST tree and add the transformed node to the new AST tree?
The implementation idea here is generally divided into the following steps:
- Add one to the old AST tree
_context
Context property that points to the array node of the new AST tree - When traversingThe children of the old AST array nodeThe converted child element is placed in the
_context
Properties of the - Depending on the nature of the JavaScript reference type, the goal is to put the nodes of the transformed sum into the new AST tree.
It can be expressed in the figure as follows:
I believe this answers the question above as to why the second argument needs to be passed to the parent when the Enter method is called.
function transformer(ast) {
let newAst = { // The outermost structure of the new AST tree
type: "Program".body: [],};// _context is used to press the new AST while traversing the old child node
ast._context = newAst.body;
let visitor = {
// Different types of node access (transformation) methods
};
traverser(ast, visitor); // Start traversing the old AST tree
return newAst;
}
Copy the code
Conversion of numbers and strings
{
NumberLiteral: {
enter(node, parent) {
parent._context.push({ // Press the new AST tree
type: "NumberLiteral",
value: node.value,
});
},
},
StringLiteral: {
enter(node, parent) {
parent._context.push({
type: "StringLiteral", value: node.value, }); }}},Copy the code
The function calls the transformation of the node
The function call node is special because its arguments can be treated as children, so you need to point the _context property of the current node to the corresponding parameter array of the new AST tree.
If the current function call is not nested within another function call, an ExpressionStatement can be added to indicate that the current node is an entire statement. For example, the parenthesis inside (Add 2 (Subtract 4 2)) cannot be considered a complete statement because it exists as an argument to another function.
{
CallExpression: {
enter(node, parent) {
let expression = { // Function call node of the new AST tree
type: "CallExpression",
callee: {
type: "Identifier",
name: node.name,
},
arguments: [],
};
node._context = expression.arguments; // Parameter array processing
// If the current function call is not nested in another function callif (parent.type ! = ="CallExpression") {
expression = {
type: "ExpressionStatement", expression: expression, }; } parent._context.push(expression); }},}Copy the code
A small summary
So far, we have completed the AST tree traversal and transformation. This is not a small amount of work, but it is the best part of the compilation. If everything goes well, we can now get the following new AST tree:
{
"type": "Program"."body": [{"type": "ExpressionStatement"."expression": {
"type": "CallExpression"."callee": {
"type": "Identifier"."name": "add"
},
"arguments": [{"type": "NumberLiteral"."value": "2"
},
{
"type": "CallExpression"."callee": {
"type": "Identifier"."name": "subtract"
},
"arguments": [{"type": "NumberLiteral"."value": "4"
},
{
"type": "NumberLiteral"."value": "2"}]}]}}Copy the code
This is the structure of the AST tree corresponding to the C code, and comparing it to the AST tree of Lisp before, you can see many differences.
Code generation
Finally, the most exciting moment, generating object code! This step is relatively easy. From the AST tree generated in the previous step, iterate it recursively and generate the final code:
function codeGenerator(node) {
switch (node.type) {
case "Program":
return node.body.map(codeGenerator).join("\n");
case "ExpressionStatement":
return `${codeGenerator(node.expression)}; `;
case "CallExpression": // Generate a function call
return `${codeGenerator(node.callee)}(${node.arguments
.map(codeGenerator)
.join(",")}) `;
case "Identifier": // Generate the variable name
return node.name;
case "NumberLiteral":
return node.value; // Generate a number
case "StringLiteral":
return `"${node.value}"`; // Generate string (don't forget quotes)
default:
throw new TypeError(node.type); }}Copy the code
Eventually, we made the transition from the Lisp example code (Add 2 (Subtract 4 2)) to C code Add (2, Subtract (4, 2)).
Great summary
This article takes you from zero to achieve the most basic functions of a compiler, involving lexical analysis, syntax analysis, AST tree traversal transformation and other content. The principles of compilation sound esoteric (and they are esoteric), but the basic part is that all that parsing, lexical parsing, comes down to handling strings.
My research is on the front end, so people may think that it is not usually involved in the content of compilation principles, but in fact, once you look into the Babel ES6+ code into ES5 work is actually done by the compiler, as well as the recent popular ESBuild, Anything that involves converting code will definitely involve compilation, and even Vue has a compiler built in for template compilation.
Said so much, the original intention or hope that everyone in the usual learning to wade into the new field of knowledge, expand their skills, so as to improve their technical vision and upper limit.
(End of play)