preface

This article is mainly translated from comments in the the-super-tiny-Compiler, which implements a minimalist compiler with a compiler core. I hope to provide some help to those who want to have a preliminary understanding of the compilation process.

The profile

  1. Learn how to write a super simple, lightweight compiler with less than 200 lines of code without comments.
  2. The compiler compiles lisp-like syntactic function calls into C-like function calls
  3. If you are not familiar with either syntax, here is a brief introduction
  4. Let’s say I have two functionsaddsubtractThey are the rest in their respective languages:
content Class lisp Class C
2 + 2 (add 2 2) add(2, 2)
4-2 (subtract 4 2) Subtract (4, 2)
2 plus 4 minus 2. (add 2 (subtract 4 2)) Add (2, subtract (4, 2))
  1. The entire syntax for this compilation is shown above. While it does not cover the full Lisp or C syntax, it is sufficient to show the major components of a modern compiler

Compiler composition

Most compilers can be roughly divided into three stages: Parsing, translation Transformation, and Code Generation

  1. parsingTake the original code and turn it into a more abstract code representation
  2. translationAn abstract code representation prepares the compiler for what it wants to do
  3. Code generationTranslate the translated abstract representation into new code to compile

Parsing the Parsing


The parsing process is usually divided into two parts: lexical analysis and grammatical analysis

  1. Lexical analysis takes the raw code and breaks it up into tokens that describe the syntax. They can be numbers, text, punctuation marks, operators, and so on
  2. Analysis captures phrases of tokens and reformats them into a representation that describes each part of the grammar and how it relates to one another. This is called an intermediate representation or abstract syntax tree. Abstract syntax trees (AST for short) are deeply nested objects that represent code in a way that is both easy to use and tells us a lot.

Sample grammar

Add 2 (subtract 4 2) Tokens

    { type: 'paren',  value: '('        },
     { type: 'name',   value: 'add'      },
     { type: 'number', value: '2'        },
     { type: 'paren',  value: '('        },
     { type: 'name',   value: 'subtract' },
     { type: 'number', value: '4'        },
     { type: 'number', value: '2'        },
     { type: 'paren',  value: ')'        },
     { type: 'paren',  value: ')'        },
   ]
Copy the code

The abstract syntax tree is shown below

    {
      type: 'Program',
      body: [{
        type: 'CallExpression',
        name: 'add',
        params: [{
          type: 'NumberLiteral',
          value: '2'}, {type: 'CallExpression',
          name: 'subtract',
          params: [{
            type: 'NumberLiteral',
            value: '4'}, {type: 'NumberLiteral',
            value: '2',}]}]}]}Copy the code

translation

The next stage after obtaining the abstract syntax tree is the translation transformation. Again, this simply extracts the AST from the last step and makes changes to it. It can manipulate the AST in the same language, or it can translate the AST into an entirely new language.

Let’s look at how to transform an AST.

You may notice that there are elements in our AST that look very similar. These objects have type properties. Each node is called an AST node. These nodes define attributes that describe a separate part of the tree.

We have a number node called NumberLiteral

{
    type: 'NumberLiteral',
    value: '2',}Copy the code

Or a call expression node

  {
     type: 'CallExpression',
     name: 'subtract',
     params: [...nested nodes here...],
  }
Copy the code

When transforming an AST, we can manipulate nodes by adding/removing/replacing properties, adding new nodes, deleting nodes, or creating a new AST based on it without using the existing AST.

Since we are targeting a new language, we will focus on creating a new AST specific to the target language.

traverse

To navigate through all of these nodes, we need to be able to walk through them. The following will traverse each node of the AST in depth-first fashion.

{
     type: 'Program',
     body: [{
       type: 'CallExpression',
       name: 'add',
       params: [{
         type: 'NumberLiteral',
         value: '2'
       }, {
         type: 'CallExpression',
         name: 'subtract',
         params: [{
           type: 'NumberLiteral',
           value: '4'
         }, {
           type: 'NumberLiteral',
           value: '2'}}}}]]]Copy the code

Therefore, for the AST above, we will:

  1. Program – Starts at the top level of the AST
  2. CallExpression (add) – Goes to the first element of the program
  3. NumberLiteral (2) – Moves to the first element of the CallExpression parameter
  4. CallExpression (Subtract) – Moves to the second element of the CallExpression parameter
  5. NumberLiteral (4) – Moves to the first element of the CallExpression parameter
  6. NumberLiteral (2) – Moves to the second element of the CallExpression parameter

If we operate on this AST directly, rather than creating a separate AST, we might introduce various abstractions here. But just visiting each node in the tree is enough to do what we’re trying to do.

I use the word “access” because this pattern exists to represent operations on object structural elements.

Visitors

The basic idea here is that we will create a “visitor” object whose methods will accept different node types.

var visitor = {
     NumberLiteral() {},
     CallExpression() {}};Copy the code

However, it is also possible to invoke the corresponding operation on exit. Imagine the old tree structure as a list:

  • Program
    • CallExpression
    • NumberLiteral
    • CallExpression
      • NumberLiteral
      • NumberLiteral

As we go down, as we go through all the branches. We “exit” it. So, down the tree, we “enter [Enter]” each node, and then “exit [exit]”.

    -> Program (enter)
      -> CallExpression (enter)
        -> Number Literal (enter)
        <- Number Literal (exit)
        -> Call Expression (enter)
           -> Number Literal (enter)
           <- Number Literal (exit)
           -> Number Literal (enter)
           <- Number Literal (exit)
        <- CallExpression (exit)
      <- CallExpression (exit)
    <- Program (exit)  
Copy the code

To support entry and exit operations, we adjust the Vistitor definition as follows

    var visitor = {
      NumberLiteral: {
        enter(node, parent) {},
        exit(node, parent) {},
      }
    };
Copy the code

Code generation


  • The last step in the compiler is code generation. Sometimes the compiler performs operations that overlap with transformations, but in most cases code generation simply means string-codification of the AST abstract syntax tree.

  • The code generators work in a few different ways. Some compilers will reuse the tokens they acquired earlier, others will create separate representations so that they can print nodes linearly. But as far as I know, most will use the AST we just created, and that’s what we’ll focus on later.

  • These are the core parts of a compiler. Note that not every compiler looks exactly like the one described here. There are many different kinds of compilers for different purposes, which may require steps as detailed below and many more.

code

You should now have a general idea of what the main compiler looks like.

With that explained, you are now ready to write your own compiler, and the code is ready to go.

The first step is to obtain the token

We’re going to take our code string and parse it into a token array

(add 2 (subtract 4 2)) => [{ type: ‘paren’, value: ‘(‘ }, …]

functionTokenizer (input) {// current, a cursor used to mark the character position of the currently read codeletcurrent = 0; // Tokens array variables to store parsed token phraseslettokens = []; // Open onewhileLoop, setting current to the increment within the loopwhile(current < input.length){// Get the character corresponding to the current cursorletchar = input[current]; // Check if the current character is a parenthesisif(char=== "("){// Add a 'paren' bracket of words to tokens phrase tokens. Push ({paren 'paren' bracket of wordstype: 'paren',
            value: '('}); // Then the cursor moves back one bit. // Enter the next loopcontinue; } // Check for close parenthesis, if so, add a close parenthesis phrase, add cursor, continue the next loopif (char === ') ') {
            tokens.push({
                type: 'paren',
                value:') '
            })
            current++;
            continue; } // Check whether the current character is a space, if it is a space, directly skip, cursor move // (add 123 456) // ^^^ ^^^ numberlet WHITESPACE = /\s/;
        if (WHITESPACE.test(char)) {
            current++;
            continue; } // The next type to be tested is number. Instead of the number type consisting of multiple numeric characters, we need // to get the entire sequence of digits as a token of the number typelet NUMBERS = /[0-9]/;
        if(NUMBERS. Test (char)){// Create a new value string to set the number stringlet value=' ';
            while(NUMBERS.test(char)){
                value += char;
                char = input[++current];
            }
            tokens.push({type:'number',value});
            continue; } // Double quoted strings // (concat) are also supported in the compiler to be implemented"foo" "bar") // ^^^^ ^^^^ payment stringif (char === '"') {
            let value = ' ';
            char = input[++current];
            while(char ! ='"') { value += char; Char = input[++current]} // Cursor skip closing quotation marks char = input[++current]; tokens.push({type:'string',
                value
            })
            continue; } // The last type of token is of type 'name'. It's a string of letters. This type is used as a function name for the compiler's // Lisp syntax stylelet LETTERS = /[a-z]/i;
        if(LETTERS.test(char)) {
            let value = ' ';
            while (LETTERS.test(char)){
                value += char;
                char = input[++current];
            }
            tokens.push({type: 'name', value});

            continue; } // Throw a type exception if any of the above types do not match.'I dont know what this character is: ' + char);
    }

    return tokens;
}
Copy the code

Parse Abstract syntax tree

functionParser (tokens) {// Create current variables as cursorsletcurrent = 0; // This method will use recursion insteadwhileLoop, first defining a walk methodfunction walk() {// Get the current tokenlettoken = tokens[current]; // Starting with tokens of type number, place tokens of different types in different places in the codeif (token.type === 'number') {// if the current type is number, the cursor forward current++; // Returns an AST node of type numberreturn {
                type: 'NumberLiteral', value: token.value}} // character token returns an AST node of character typeif (token.type === 'string') {
            current++;

            return {
                type: 'StringLiteral'Value: token.value}} // Check whether the expression is called. Check if it is a parenthesis type and is an open parenthesis tokenif (
            token.type === 'paren' && 
            token.value === '('Tokens = tokens[++current]; tokens = tokens[++current];let node = {
                type:'CallExpression', name: token.value, params: []} // Move cursor forward to skip tokens of name type token = tokens[++current]; // Now start iterating through the parameters of CallExpression until it encounters the close parenthesis // here is the recursion at first, we solve the nested node problem by recursion. // To explain this, let's use our Lisp code. You can see that the parameters of // add are a number and a nested CallExpression that contains its own parameters. / / / / / {type: 'paren',  value: '('}, / / {type: 'name',   value: 'add'}, / / {type: 'number', value: '2'}, / / {type: 'paren',  value: '('}, / / {type: 'name',   value: 'subtract'}, / / {type: 'number', value: '4'}, / / {type: 'number', value: '2'}, / / {type: 'paren',  value: ') '        }, <<< Closing parenthesis
            //     { type: 'paren',  value: ') '}, <<< Closing parenthesis //] // We recursively call the walk method to go through the embedded 'CallExpression'. // Here we create a While loop to go through until the left parenthesis is reachedwhile( (token.type ! = ='paren')||
                (token.type === 'paren'&& token.value ! = =') ')){   
                node.params.push(walk());
                token = tokens[current];
            }
            current++;
            returnnode; } // Throw new TypeError(token.type) if the type is not detected above; }let ast = {
        type:'Program',
        body:[]
    }
    while(current < tokens.length){
        ast.body.push(walk());
    }
    return ast;
}
Copy the code

Transform the abstract syntax tree

/ * * * * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * ⌒ (❀ > ◞ ౪ ◟ < ❀) ⌒ * THE TRAVERSER!!! * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * now through the parser with a AST abstract syntax tree, now through vistor access * * traverse by each node (AST, {* Program: { * enter(node, parent) { * // ... * *}exit(node, parent) { * // ... * }, * }, * * CallExpression: { * enter(node, parent) { * // ... * *}exit(node, parent) { * // ... * }, * }, * * NumberLiteral: { * enter(node, parent) { * // ... * *}exit(node, parent) { * // ... *}, *}, *}); * /function traverser(ast, visitor) {
    function traverseArray(array, parent) {
        array.forEach(child => {
            traverseNode(child, parent);
        });
    }
    function traverseNode(node, parent) {
        let methods = visitor[node.type];

        if(methods && methods.enter){
            methods.enter(node,parent);
        }
        switch(node.type){
            case 'Program':
                traverseArray(node.body,node);
                break;
            case 'CallExpression':
                traverseArray(node.params, node);
                break;
            case 'NumberLiteral':
            case 'StringLiteral':
                break;
            default:
                throw new TypeError(node.type);                
        }
        if(methods && methods.exit) { methods.exit(node,parent); } } traverseNode(ast, null); } / * * * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * ⁽ (◍ ˃ ̵ ͈ ̑ ᴗ ˂ ̵ ͈ ̑) ⁽ * THE TRANSFORMER! !!!!! * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * / / * * * next Ast transformations. Will have been built the Ast tree through the visitor into a new Ast abstract syntax tree * * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * Original AST | Transformed AST * ---------------------------------------------------------------------------- * { | { *type: 'Program'. |type: 'Program',
 *     body: [{                       |     body: [{
 *       type: 'CallExpression'. |type: 'ExpressionStatement',
 *       name: 'add',                 |       expression: {
 *       params: [{                   |         type: 'CallExpression',
 *         type: 'NumberLiteral',     |         callee: {
 *         value: '2'                 |           type: 'Identifier',
 *       }, {                         |           name: 'add'
 *         type: 'CallExpression',    |         },
 *         name: 'subtract',          |         arguments: [{
 *         params: [{                 |           type: 'NumberLiteral',
 *           type: 'NumberLiteral',   |           value: '2'
 *           value: '4'|}, {*}, {|type: 'CallExpression',
 *           type: 'NumberLiteral',   |           callee: {
 *           value: '2'               |             type: 'Identifier',
 *         }]                         |             name: 'subtract'
 *       }]                           |           },
 *     }]                             |           arguments: [{
 *   }                                |             type: 'NumberLiteral',
 *                                    |             value: '4'* -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - |}, {* |type: 'NumberLiteral',
 *                                    |             value: '2'* | }] * (sorry the other one is longer.) | } * | } * | }] * | } * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * / / / this method receives the class lisp abstract syntax tree, into a class c ast treefunctionTransformer (AST) {// Create a new AST nodelet newAst = {
        type: 'Program', body: []} // use the body of the newAst tree as the _context attribute of the old ast tree._context = newast.body; Traverser (ast,{NumberLiteral:{enter(node, parent) {parent._context.push({type: 'NumberLiteral',
                    value:node.value
                })
            }
        },
        StringLiteral:{
            enter(node, parent){
                parent._context.push({
                    type:'StringLiteral',
                    value: node.value
                })
            }
        },
        CallExpression:{
            enter(node,parent){
                let expression = {
                    type: 'CallExpression',
                    callee: {
                      type: 'Identifier', name: node.name, }, arguments: [], }; // Next, we will define a context on the original CallExpression node // that references the parameters of expression in order to set the parameters. node._context = expression.arguments; // Check whether the parent is CallExpresssion, if not execute the following codeif(parent.type ! = ='CallExpression') {// Wrap 'CallExpression' with the 'expression Statement' node // The reason for doing this transformation is that the calling expression is ultimately a statement expression = {type: 'ExpressionStatement', expression: expression }; } parent._context.push(expression); }}})return newAst;
}
Copy the code

Code generation

/** * Here begins the final step: code generation */function codeGenerator(node) {
    switch (node.type) {
        case 'Program':
            return node.body.map(codeGenerator)
                .join('\n')
        case 'ExpressionStatement':
            return (
                codeGenerator(node.expression)
                + '; '
            );
        case 'CallExpression':
            return (
                codeGenerator(node.callee) + 
                '(' +
                 node.arguments.map(codeGenerator)
                 .join(', ') +
                ') '
            );
        case 'Identifier':
            return node.name;
        case 'NumberLiteral':
            return node.value;
        case 'StringLiteral':
            return '"' + node.value + '"'; default: throw new TypeError(node.type); }}Copy the code

Component compiler

/** * finally create 'compiler' compiler function, Tokens => Parser => AST * 3. Ast => Transformer => newAst * 4. newAst => generator => output */function compiler(input) {
    let tokens = tokenizer(input);
    let ast = parser(tokens);
    let newAst = transformer(ast);
    let output = codeGenerator(newAst);
    return output;
}
Copy the code

conclusion

If you’re comfortable using other languages like Java, Go, Python, etc., try rewriting it. Of course, the content shared above only covers the main steps of the compiler, which is equivalent to a compiler’s Hello World, but the implementation of the code gives a more intuitive feeling. If you need to implement some compile-related features later on, this will help.