First of all, I would like to tell you why I have this idea, because recently in my spare time learning SWIFT and Swiftui often use this syntax called tail closure, I feel very interesting. At the same time, since I had seen the -super-tit-compiler in Jamie Builds a long time ago, I wondered if I could implement a similar, interesting, fun and simple compiler myself. Hence the js-trailing-closure-toy-compiler project, and today’s article.
For those of you who are not familiar with Swift, let me explain what a tail closure is. In short, if the last argument to a function is also a function, then we can use tail closure to pass the last function. You can see the following code example:
// #1: simple version // Swift way a(){// tail closure of contents} // JavaScript way a(() => {}) // #2: function with arguments // Swift way, A (1, 2, 3){// A (1, 2, 3, () => {}) // #3: A (1, 2, 3, () => {}) // #3: A (1, 2, 3, () => {}) A (1, 2, 3){arg1, arg2 in; // Arg2 in; // Arg1 in; // Arg2 in;
If there are any questions about Closures for Swift, take a look at the official documentation for Closures. It’s also very clear.
I remember reading the source code for the -super-tiiny-compiler project a long time ago, but just briefly reading it. I thought I had mastered some of the knowledge and principles. But when I wanted to realize this idea in my own mind. However, I found that I had not mastered some methods and skills practiced in this project before. So I decided to have a good understanding of the source code of this project, and then implement a sample of the same functionality as the original project before starting to implement my own small compiler.
Friendly hint, the following article content is relatively long, suggest to collect after read carefully again.
The implementation process of the compiler
As we can see from the-super-tiny-compiler, for the general compiler. There are mainly four steps to complete the compilation process, which are as follows:
- Tokenizer: Converts our code text strings into meaningful units (i.e
token
). Such asif
."hello"
.123
.let
.const
And so on. - Parser: Converts the token retrieved in the previous step to an Abstract Syntax Tree for the current language, i.e., the AST(Abstract Syntax Tree). Why would you do that? Because when we do this, we know the order and hierarchy of the statements in the code. You know the order in which you run it, the context, and so on.
- Transformer: Transforms the AST obtained in the previous step into the target language. Why do you want to do this? For the same function of the program statements, if you choose to implement the language is not the same, then their syntax is likely to be different. As a result, their corresponding abstract syntax trees are also different. So we need to do a transformation in preparation for the next step in generating code in the target language.
- CodeGenerator: This step is relatively simple. After knowing the syntax of the target language, we can generate the desired code for the target language easily and quickly based on the new abstract syntax tree generated in the previous step.
The steps above are the general flow of the compiler, but it’s not enough to just know the flow, you need to do it yourself. If you are interested, you can click here to experience the JavaScript Trailing Closure Toy Compiler first. If you are interested in the implementation process, you can continue to read the following, I am sure you will have a great harvest after reading, may want to implement an interesting compiler of your own.
Tokenizer: Converts a code string totoken
The first thing we need to understand is why do we want to convert strings to individual stringstoken
Because if we don’t do the conversion, we don’t know what this program is trying to represent becausetoken
Is a necessary condition for understanding a program.
This is like console.log(” Hello world!”). We know what it does at a glance, but how do we think about it? Student: Is it a Console first we know it’s a Console object and then. We know to get the property operator of the object, then the log method, and then the call to the method needs (open bracket to start with, then Hello World! String as an argument, followed by a close parenthesis to indicate closure.
So the purpose of converting a string to a token is to give us an idea of what this program is trying to represent. Because based on the value of each token, and where the token is located, we can know exactly what this token represents and what it does.
The first step for us as a compiler is to partition the tokens we need, as shown in the code example above. We can see that there are several types of tokens we need:
- digital: for instance,
1
.66
And so on. - string: for instance,
"hello"
And so on. - identifier: for instance,
a
In the context of our compiler, it usually means the name of a function or variable. - parentheses:
(
and)
, is used here to represent a function call. - Curly braces:
{
and}
, is used here to represent the body of a function. - The comma:
.
, which is used to split the parameters. - Whitespace characters:
token
.
Since our compiler is only focused on the desired implementation of the tail closure, we need to focus on the types of tokens above for now.
In fact, this step is relatively simple. It is to read the token in a circular manner according to our requirements. The code is shown as follows:
// Tokens const tokenizer = (input) => {// const idReg = /[a-z]/i; const spaceReg = /\s/; // const Tokens = []; // const len = input.length; if (len > 0) { let cur = 0; while(cur < len) { let curChar = input[cur]; If (numreg.test (curChar)) {let num = "; while(numReg.test(curChar) && curChar) { num += curChar; curChar = input[++cur]; } tokens.push({ type: 'NumericLiteral', value: num }); continue; } if (idreg.test (curChar)) {let idVal = "; while(idReg.test(curChar) && curChar) { idVal += curChar; curChar = input[++cur]; } if (idVal === 'in') {array. push({type: 'InKeyword', value: idVal}); } else { tokens.push({ type: 'Identifier', value: idVal }); } continue; } if (curChar === "") {let strVal ="; curChar = input[++cur]; while(curChar ! == '"') { strVal += curChar; curChar = input[++cur]; } tokens.push({ type: 'StringLiteral', value: strVal }); // You need to handle the last double quotation mark in the string cur++; continue; } / / determine whether left parenthesis if (curChar = = = '(') {tokens. Push ({value: type:' ParenLeft ', '('}); cur++; continue; } / / determine whether right parenthesis if (curChar = = = ') ') {tokens. Push ({type: 'ParenRight' value: ')}); cur++; continue; } / / determine whether left brace if (curChar = = = '{') {tokens. Push ({type:' BraceLeft 'value:' {'}); cur++; continue; } if (curChar === '}') {tokens. Push ({type: 'BraceRight', value: '}'}); cur++; continue; } / / judge whether commas if (curChar = = = ', ') {tokens. Push ({value: type: 'Comma', ', '}); cur++; continue; } // if (spacereg. test(curChar)) {cur++; continue; } throw new Error(`${curChar} is not a good character`); } } console.log(tokens, tokens.length); return tokens; };
The above code is not very complicated, but there are a few points to be aware of, and if you are not careful you can easily go wrong or enter an endless loop. Here are some of the things I think could go wrong:
- The outer layer is used
while
Loop. Each loop starts with the character corresponding to the current subscript. The reason why it’s not usedfor
The loop is about the subscript of the current charactercur
It’s driven by the judgment within, usingwhile
It’s more convenient. - If strings, numbers, and identifiers are read, an inner loop is required to read continuously until the next character is not of the desired type. Because if you read more than one character, you need to determine whether the next character matches the current type. If it does not, the read of the current type is terminated, and the current loop needs to break out of the current loop and start the next outer loop.
- For strings, the “at the beginning and end of the string is skipped and does not count in the value of the string. If whitespace is encountered, you need to skip it.
The process is not technically difficult and requires a little more patience. After the implementation is complete, we can test it:
tokenizer(`a(1){}`)
You can see the output as follows:
(6) [{...}, {...}, {...}, {...}, {...}, {...}] 0: {type: "Identifier", value: "a"} 1: {type: "ParenLeft" value, "("} 2: {type: "NumericLiteral", value: "1"} 3: {type: "ParenRight", value: ")"} 4: {type: "BraceLeft", value: "{"} 5: {type: "BraceRight", value: "}"}
As you can see, the output is exactly what we want, and we’ve been 25% successful at this point. The next step is to convert the resulting token array into an AST abstract syntax tree.
Parser:token
Array conversion toAST
Abstract syntax tree
The next step is to convert the Token array into an AST(Abstract Syntax Tree). After doing this, we convert the code strings into meaningful tokens. Once we have these tokens, we can deduce the whole abstract syntax tree based on the meaning of each token.
For example, if we encounter {, we know that all tokens in between represent the body of a function until the next} is encountered (leaving aside the other cases for now).
An example of Token is shown below:
The program statement represented should be:
a(1) {
// block
};
The corresponding abstract syntax tree would look like this:
{
"type": "Program",
"body": [
{
"type": "CallExpression",
"value": "a",
"params": [
{
"type": "NumericLiteral",
"value": "1",
"parentType": "ARGUMENTS_PARENT_TYPE"
}
],
"hasTrailingBlock": true,
"trailingBlockParams": [],
"trailingBody": []
}
]
}
We can simply take a look at the abstract syntax tree above. First, the outermost type is Program, and then the content inside the body represents the content of our code. In this case, our body array has only one element, which represents CallExpression, which is a function call.
This callExpression function has the name a, and then the first argument to the function is type numericLiteral, and the value is 1. The parent node type for this parameter is ARGUMENTS_PARENT_TYPE, which is also explained below. Then the callExpression hasRailingBlock value is true, indicating that this is a tail closure function call. TrailingBlockParams then indicates that the tail closure has no parameters, and TrailingBody indicates that the contents of the tail closure are empty.
The above is just a simple explanation, the detailed code part is as follows:
// const Tokens => {const Tokens = {type: 'Program', body: []}; let cur = 0; const walk = () => { let token = tokens[cur]; If (token. Type === 'numericLiteral ') {cur++; return { type: 'NumericLiteral', value: token.value }; } if (token.type === 'stringLiteral ') {cur++; return { type: 'StringLiteral', value: token.value }; } / / is Comma returned directly if (token. Type = = = 'Comma') {cur++; return; } // If (token. Type === 'Identifier') {const callExp = {type: Identifier;} // If (token. 'CallExpression', value: token.value, params: [], hasTrailingBlock: false, trailingBlockParams: [], trailingBody: [] }; Const SpecifyParentNodeType = () => {CallExp.params = CallExp.params. Filter (p = >p); callExp.trailingBlockParams = callExp.trailingBlockParams.filter(p => p); callExp.trailingBody = callExp.trailingBody.filter(p => p); callExp.params.forEach((node) => { node.parentType = ARGUMENTS_PARENT_TYPE; }); callExp.trailingBlockParams.forEach((node) => { node.parentType = ARGUMENTS_PARENT_TYPE; }); callExp.trailingBody.forEach((node) => { node.parentType = BLOCK_PARENT_TYPE; }); }; const handleBraceBlock = () => { callExp.hasTrailingBlock = true; // tokens = tokens[++cur]; const params = []; const blockBody = []; let isParamsCollected = false; while(token.type ! == 'BraceRight') { if (token.type === 'InKeyword') { callExp.trailingBlockParams = params; isParamsCollected = true; token = tokens[++cur]; } else { if (! isParamsCollected) { params.push(walk()); token = tokens[cur]; } else {// handle the data inside the curly braces blockbody.push (walk()); token = tokens[cur]; }}} // If isParamsCollect is false, "" or" "or" "is not used in parentheses. IsParamsCollect) {// If no parameters are collected, they are not callExp.trailingbody = params; } else { callExp.trailingBody = blockBody; } // handle the right curly braces cur++; }; // Token is a function call or const next = tokens[cur + 1]; if (next.type === 'ParenLeft' || next.type === 'BraceLeft') { token = tokens[++cur]; If (token.type === 'ParenLeft') {// Token = tokens[++cur]; while(token.type ! == 'ParenRight') { callExp.params.push(walk()); token = tokens[cur]; } // handle the right parenthesis cur++; // Token = tokens[cur]; // handle the trailing closure; 'func()' if (token && token. Type === 'BraceLeft') {handleBraceBlock(); } } else { handleBraceBlock(); } // SpecifyParentNodeType ();} // SpecifyParentNodeType (); return callExp; } else { cur++; return { type: 'Identifier', value: token.value }; } } throw new Error(`this ${token} is not a good token`); }; while (cur < tokens.length) { ast.body.push(walk()); } console.log(ast); return ast; };
I have added some notes to the key points to make it easier for you to understand. Here is a brief explanation of the above code again.
First we need to iterate through tokens. We define the outermost structure of the abstract syntax tree as:
const ast = {
type: 'Program',
body: []
};
This is defined so that subsequent node objects can be added to our abstract syntax tree according to certain rules.
We then define a walk function to walk through the tokens array. The walk function returns a number, a string, or a comma. When the token is an identifier, there are many cases to judge.
For an identifier, there are two processes in our situation:
- One case is a single identifier, in which case the token followed by the identifier is neither (nor {.
-
The other case represents a call to a function. For a call to a function, we need to consider the following cases:
- In one case, the function has only one tail closure and takes no other arguments. Such as
a{}
; - In the other case, a call to a function has no tail closure and may or may not take arguments. Such as
a()
ora(1)
; - The last is a call to a function that contains a tail closure. Such as
a{}
.a(){}
.a(1){}
And so on.In the case of a tailed closure you also need to consider whether the tail closure has parameters, such asa(1){b, c in }
.
- In one case, the function has only one tail closure and takes no other arguments. Such as
Next, we will briefly explain how to handle token as an identifier. If we judge token as an identifier, we will first define an object called CallExp of type CallExpression, which is used to represent the syntax tree object of our function call. This object has the following properties:
- Type: Represents the type of the node
- Value: denotes the name of the node, in this case the function name
- Params: Represents the arguments to a function call
- HastrailingBlock: Indicates whether the current function call contains a tail closure
- TrailingBlockParams: Indicates whether the tail closure has parameters
- TrailingBody: The content inside the tail closure
We then determine what type of token is following the identifier. If this is a function call, the token following the current token must be (or {. If not, we simply return the identifier.
If it is a function call, we need to do two things, one is to collect the parameters of the function call, and one is to determine whether the function call contains a tail closure. The collection of function parameters is relatively simple. First, determine whether the token following the current token represents the end of the collection of parameters (if so, start collecting parameters until the next token of type is encountered). Also note that since the argument could be a function, we need to call the walk function again when we collect the argument to help us recursively process the argument.
The next step is to determine whether a function call contains a tail closure. For the determination of a tail closure, there are two cases to be considered: one is that the function call contains a parameter, which contains a tail closure after the parameter; The other is a call to a function that takes no arguments and is simply a tail closure. So we need to deal with both of these cases.
Since there are two places where we need to decide whether or not we have a tail closure, we can pull out this part of the logic into the handleBraceBlock function, which will help us handle the tail closure. Now let’s explain how tail closures are handled.
If we determine that the next token is {then we need to handle the tail closure. We first set the value of the CallExp’s HastrailingBlock property to true; Then you need to determine whether the tail closure has parameters, and you need to deal with the internal contents of the tail closure.
How do I collect the parameters of the tail closure? We need to determine if the “in” keyword is in the tail closure. If the “in” keyword is in the tail closure, then there are parameters in the tail closure. If there are no parameters in the tail closure, then there are no parameters in the tail closure, and we need to deal with the contents inside the tail closure.
Because we don’t know if the in keyword is in the tail closure at first, the contents we collect may be the contents of the tail closure, or may be parameters; So when we encounter the end token of the} tail closure, if there is no in keyword during this period, then all we have collected is the contents of the tail closure.
We need to use the walk function to recurse either the parameters of the tail closure or the contents of the closure, since the parameters and contents may not be primitive numeric type values. (To simplify the operation, we also use the walk function to recurse the parameters of the tail closure.)
Before returning the CallExp object, we need to do a little extra processing using the SpecifyParentNodeType help function. The first one is to remove the token representation, and the other one is to specify the parent node type for the node in the CallExp object’s params, trailingBlockParams, and trailingBody properties. For both Params and TrailingBlockParams, the parent node type is of type ARGUMENTS_PARENT_TYPE; For trailingBody, its parent node type is of type BLOCK_PARENT_TYPE. This processing is convenient for us to proceed to the next operation. We will explain this again as we go through the following steps.
Transformer: Transforms the old AST into the target language’s AST
The next step is to convert the raw AST we’ve captured into the target language’s AST, so why do we do this? This is because the same coding logic behaves differently in different host languages. So we need to convert the original AST to the AST of our target language.
So how do we do that? The original AST is a tree structure, and we need to traverse the tree structure; Traversal requires depth-first traversal, because for a nested structure, the contents of the outside can only be determined after the contents of the inside are determined.
There is a design pattern that we use for traversing the tree, and that is the Visitor pattern. We need a visitor object to our tree depth-first traversal of the type of the object, the visitors to object to the processing function of different types of nodes, when a node, we can according to the type of the current node, corresponding processing function was obtained from the object of visitors to deal with this node.
Let’s first look at how to traverse the original tree structure, where each node is either an object of a specific type or an array. So we’re going to deal with these two cases separately. Let’s first determine how to traverse the tree structure. The code of this part is as follows:
Const traverser = (ast, visitor) => {const traverseNode = (node, visitor) parent) => { const method = visitor[node.type]; if (method && method.enter) { method.enter(node, parent); } const t = node.type; switch (t) { case 'Program': traverseArr(node.body, node); break; case 'CallExpression': // handle ArrowFunctionExpression // Todo if (node.hastrailingBlock) {node.params.push({type: 'ArrowFunctionExpression', parentType: ARGUMENTS_PARENT_TYPE, params: node.trailingBlockParams, body: node.trailingBody }); traverseArr(node.params, node); } else { traverseArr(node.params, node); } break; case 'ArrowFunctionExpression': traverseArr(node.params, node); traverseArr(node.body, node); break; case 'Identifier': case 'NumericLiteral': case 'StringLiteral': break; default: throw new Error(`this type ${t} is not a good type`); } if (method && method.exit) { method.exit(node, parent); }}; const traverseArr = (arr, parent) => { arr.forEach((node) => { traverseNode(node, parent); }); }; traverseNode(ast, null); };
Let me briefly explain the Traverser function, which internally defines two functions, the TraverseNode and the Traversearr. If the current node is an array, we need to process each node in the array separately.
The main processing logic for the node is in the TraverseNode. Let’s see what this function does. The processing method for the corresponding node is first retrieved from the Visitor object based on the type of node. If the node type is a primitive type, the node type is not processed. If the node’s type is the ArrowFunctionExpression arrow function, the params and body properties of the node need to be traversed in turn. If the node type is CallExpression, which indicates that the current node is a function call node, then we need to determine if the function call contains a tail closure. If it does, then our original function call needs to add an additional argument, which is an arrow function. The following code is used to determine this:
// ...
if (node.hasTrailingBlock) {
node.params.push({
type: 'ArrowFunctionExpression',
parentType: ARGUMENTS_PARENT_TYPE,
params: node.trailingBlockParams,
body: node.trailingBody
});
traverseArr(node.params, node);
} else {
traverseArr(node.params, node);
}
// ...
It then iterates over the params property of the CallExpression node. When the call to the function contains the tail closure, we add an object of type ArrowFunctionExpression to the node’s params property, and the parentType of this object is ARGUMENTS_PARENT_TYPE. Because then we know the parent node type of the object, which we can use when we do the syntax tree conversion.
The next step is to define the method of handling the different node types on the visitor object. The specific code is as follows:
const transformer = (ast) => { const newAst = { type: 'Program', body: [] }; ast._container = newAst.body; const getNodeContainer = (node, parent) => { const parentType = node.parentType; if (parentType) { if (parentType === BLOCK_PARENT_TYPE) { return parent._bodyContainer; } if (parentType === ARGUMENTS_PARENT_TYPE) { return parent._argumentsContainer; } } else { return parent._container; }}; traverser(ast, { NumericLiteral: { enter: (node, parent) => { getNodeContainer(node, parent).push({ type: 'NumericLiteral', value: node.value }); } }, StringLiteral: { enter: (node, parent) => { getNodeContainer(node, parent).push({ type: 'StringLiteral', value: node.value }); } }, Identifier: { enter: (node, parent) => { getNodeContainer(node, parent).push({ type: 'Identifier', name: node.value }); }}, callExpression: {enter: (node, parent) => {// TODO const callExp = {type: 'callExpression ', callee: { type: 'Identifier', name: node.value }, arguments: [], blockBody: [] }; // Add _container node._argumentscontainer = callExp.arguments; // Add _container node._argumentscontainer = callExp.arguments; node._bodyContainer = callExp.blockBody; getNodeContainer(node, parent).push(callExp); }}, ArrowFunctionExpression: {enter: (node, parent) => {// TODO const ArrowFunc = {type: 'ArrowFunctionExpression', arguments: [], blockBody: [] }; // Add _container node._argumentscontainer = arRowFunc.arguments; // Add _container node._argumentscontainer = arRowFunc.arguments; node._bodyContainer = arrowFunc.blockBody; getNodeContainer(node, parent).push(arrowFunc); }}}); console.log(newAst); return newAst; };
We define the outermost properties of the new AST first, then AST._CONTAINER = NEWAST.BODY. This operation associates the old AST with the outermost layer of the new AST because we are traversal the old AST. This allows us to point to the new AST through the _container attribute. So when we add an element to _container, we are actually adding the corresponding node to the new AST. It’s a little bit easier for us to do that.
Then there is the getNodeContainer function. This function retrieves the _container property of the parent of the current node. If the parentType property of the current node is not empty, the parent of the current node represents either the arguments to the function call or the contents of the tail closure. This can be determined by the type of Node.parentType. If the parentType attribute of the current node is empty, then the _container attribute of the parent node of the current node is the _container attribute of the parent node.
The next step is to handle the different node types of the Visitor object. For the basic type, return the corresponding node directly. If it’s the CallExpression and ArrowFunctionExpression types, some extra processing is required.
First, for ArrowFunctionExpression type nodes, we declare an ArrowFunc object and point the _ArgumentsContainer property of the corresponding node to the ArrowFunc object’s arguments property. Point the _bodyContainer attribute of the node to the ArrowFunc object’s blockBody attribute. It then gets the _container attribute of the parent of the current node, and finally adds arrowFunc to this attribute. Nodes of node type CallExpression are handled similarly, except that the object is defined with an additional callee attribute that indicates the function name of the function call.
This is where the conversion of the old AST to the new AST is complete.
CodeGenerator: Iterates through the new AST generation code
This step is relatively simple, according to the node type concatenation of the corresponding type of code can be; The detailed code is as follows:
const codeGenerator = (node) => { const type = node.type; switch (type) { case 'Program': return node.body.map(codeGenerator).join('; \n'); case 'Identifier': return node.name; case 'NumericLiteral': return node.value; case 'StringLiteral': return `"${node.value}"`; case 'CallExpression': return `${codeGenerator(node.callee)}(${node.arguments.map(codeGenerator).join(', ')})`; case 'ArrowFunctionExpression': return `(${node.arguments.map(codeGenerator).join(', ')}) => {${node.blockBody.map(codeGenerator).join('; ')}} `; default: throw new Error(`this type ${type} is not a good type`); }};
One thing to note is the handling of CallExpression and ArrowFunctionExpression nodes, where you add the name of the function and then the arguments to the function call. For ArrowFunctionExpression, you need to handle the arguments to the arrow function and the contents of the function body. Compared to the three steps above, this step is relatively simple.
The next step is to put these four steps together, and the simple compiler is done. The specific code is as follows:
// Assemble const compiler = (input) = bb0 {const tokens = Tokenizer (input); const ast = parser(tokens); const newAst = transformer(ast); return codeGenerator(newAst); }; Exports module. Exports = {Tokenizer, Parser, Transformer, CodeGenerator, Compiler}; exports module. Exports = {Tokenizer, Parser, Transformer, CodeGenerator, Compiler};
Simple summary
If you have the patience to read through, you will find that it is not that complicated to make a simple compiler. We need to be clear about what these four processes need to do, and then notice that there are some special things that need to be done, and there is one thing that needs a little patience.
Of course, this version of our implementation simply does what we want, but there are a lot of things that a real compiler has to think about. The above version of the code is not very formal in many places, the implementation of the first thought about how to implement, details and maintainability is not too much consideration. If you have any good ideas, or find any mistakes, please send Issues or Pull Request to this small project, so that it can become better. You are also welcome to leave a message below the article to see if it is possible to collision out of what new ideas and ideas.
Some of you might say, well, what’s the point of learning this stuff? In fact, there are many uses. First of all, the construction of our current front-end is basically inseparable from Babel’s support for new features of JavaScript, and the function of Babel is actually a compiler, which converts new features of our language into some syntax that current browsers can support, so that we can use the new syntax easily. It also relieves some of the burden of front-end development.
On the other hand, if you know these principles, not only can you easily read the source code of some of Babel’s syntax converters, but you can also implement a simple syntax converter or some interesting plugins yourself. This will give you a big boost in front end capabilities.
Time flies so fast, it has been two months since the last release of the article 😂, this article is also the first article after the end of the year, I hope that the future can continue to output some high quality articles. Of course, the previous design pattern adventure series will continue to be updated, and you are welcome to keep an eye on it.
This is the end of today’s article, if you have any comments and suggestions on this article, please leave a comment below, or give them here. Also welcome to pay attention to my public number Guan Shan is not difficult to cross, if you think this article is good, or helpful to you, then thumb up to share it ~
Reference:
- the-super-tiny-compiler