The original

In order to understand why Babel can achieve the effect of translation, and AST generation. We need to implement a Compiler manually. This article is a typescript implementation of tiny-Compiler.

When we enter a code block like this

	const input = '(add 2 (subtract 4 2))'
Copy the code

It should be converted to

	const output = 'add(2, subtract(4, 2)); '
Copy the code

Implementation steps

  1. Make tokens. In this case, tokens serve only as individual identifiers and do not have semantics.
  2. Analyze the tokens into the AST structure, where the AST is semantic but does not constitute a context.
  3. Add context relationships to ast.
    • Walk through the old syntax tree. Treat each type separately.

Lexical analysis

Lexical analysis is not complicated, requiring traversal of each byte.

Here we use the concept of a pointer. We move the pointer to get the next byte, assign type to the byte and push it into the tokens array.

To review the difference between a++ and ++a:

  • b = a++A is assigned to B and then incremented by 1.
  • b = ++aA increases by 1 and then assigns a value to B.

Using continue in the for loop goes straight to the next loop, skipping the loop.

const WIHITE_SPACE_REG = /\s/;
const NUMBER_REG = / [0-9];
const LETTERS = /[a-z]/i;

export default function tokenizer(input: string) {
  let current = 0;
  let token: tokensType = [];

  while (current < input.length) {
    let char = input[current];

    if (char === "(") {
      token.push({
        type: "paren".value: "("}); current++;continue;
    }

    if (char === ")") {
      token.push({
        type: "paren".value: ")"}); current++;continue;
    }

    if (WIHITE_SPACE_REG.test(char)) {
      current++;
      continue;
    }

    if (NUMBER_REG.test(char)) {
      let value = "";
      while (NUMBER_REG.test(char)) {
        value = value + char;
        char = input[++current];
      }
      token.push({
        type: "number",
        value,
      });
      continue;
    }

    if (char === '"' || char === "'") {
      let value = "";
      char = input[++current];
      while(char ! = ='"'&& char ! = ="'") {
        value = value + char;
        char = input[++current];
      }
      char = input[++current];
      token.push({
        type: "string",
        value,
      });
      continue;
    }

    if (LETTERS.test(char)) {
      let value = "";
      while (LETTERS.test(char)) {
        value = value + char;
        char = input[++current];
      }
      token.push({
        type: "name",
        value,
      });
      continue;
    }

    throw new TypeError("I dont know what this character is: " + char);
  }

  return token;
}

Copy the code

We get an array that looks like this:

[{type: "paren".value: "(" },
  { type: "string".value: "add" },
  { type: "number".value: "2" },
  { type: "paren".value: "(" },
  { type: "name".value: "subtract" },
  { type: "number".value: "4" },
  { type: "number".value: "2" },
  { type: "paren".value: ")" },
  { type: "paren".value: ")"},]Copy the code

We’re going to write a type to normalize this array

export type tokensType = Array<tokenType> // tokenType[]

export type tokenType = {
  type: "paren" | "number" | "string" | "name";
  value: string;
}
Copy the code

Into the AST

This step we parse, which is the process of converting tokens into a syntax tree.

If you’re familiar with AST trees, you’ll know that types in syntax trees are represented by NumberLiteral, StringLiteral. And each of these different types will have different attributes.

There are several attributes we need to deal with here:

  • NumberLiteral: indicates the literal value.
  • StringLiteral: indicates a literal.
  • CallExpression: method. You need a name and parameters.
  • Program: This is the root node we specify, with a body of array type. Inside the body are our many nodes.

Here we need to use the same method as in the previous chapter, single-pointer traversal

import type { tokensType } from ".. /types";

export default function parser(tokens: tokensType) {
  // Define a pointer
  let current = 0;
  // Define the function. When we see params we need to recursively evaluate them.
  function walk() {
    let token = tokens[current];

    if (token.type === "number") {
      current++;

      return {
        type: "NumberLiteral".value: token.value,
      } as Ast.NumberLiteral;
    }

    if (token.type === "string") {
      current++;

      return {
        type: "StringLiteral".value: token.value,
      } as Ast.StringLiteral;
    }
	
     // When it comes to left parentheses, it encounters CallExpression. The next node is its name.
     // Until the next close parenthesis is encountered, there is params inside.
    if (token.type == "paren" && token.value === "(") {
      token = tokens[++current];

      let node: Ast.CallExpression = {
        type: "CallExpression".name: token.value,
        params: [],}; token = tokens[++current];while( token.type ! = ="paren" ||
        (token.type === "paren"&& token.value ! = =")")
      ) {
        node.params.push(walk());
        token = tokens[current];
      }
      current++;
      return node;
    }

    throw new TypeError(token.type);
  }

  let ast: Ast.Program = {
    type: "Program".body: [],};while (current < tokens.length) {
    ast.body.push(walk());
  }

  return ast;
}


Copy the code

There are several AST types that appear here that we need to define. To avoid confusion with the new syntax tree type, we use the ts namespace

namespace Ast {
  export type CallExpression = {
    type: "CallExpression".name: string.params: Array<Types>,
  }
  
  export type NumberLiteral = {
    type: "NumberLiteral".value: string
  }
  
  export type StringLiteral = {
    type: "StringLiteral".value: string
  }
  
  export type Program = {
    type: "Program".body: Array<Types>,
  }  
}
Copy the code

At this point we convert the tiled token structure into an AST tree structure.

By the way, I was wondering, why is ast a tree? Later it became clear that the grammatical structure of it was hard not to be a tree.

{
      type: 'Program'.body: [{type: 'StringLiteral'.value: 'string'}, {type: 'CallExpression'.name: 'add'.params: [{type: 'NumberLiteral'.value: '2'
          }, 
          {
          type: 'CallExpression'.name: 'subtract'.params: [{type: 'NumberLiteral'.value: '4'
            }, 
            {
              type: 'NumberLiteral'.value: '2'}]}]}Copy the code

Write a method to enhance the syntax tree

Our current syntax tree is not complete, we need to write a method to enter each layer and return the syntax tree after the operation.

import type { Visitor } from ".. /types";

/** * traverser the traverser method is used to traverse the old AST and execute the enter method */ passed by visitor according to the corresponding type
export default function traverser(ast: Ast.Program, visitor: Visitor) {
  // Execute the traversNode method of the array array
  function traverseArray(array: Array<Ast.Types>, parent: Ast.ParentTypes) {
    array.forEach((child) = > {
      traverseNode(child, parent);
    });
  }

  function traverseNode(node: Ast.TypesWithProgram, parent: Ast.ParentTypes) {
    // method Specifies the method that gets the type of visitor
    let methods = visitor[node.type];

    // Execute methods
    if (methods && methods.enter) {
      methods.enter(node as any, parent);
    }
    switch (node.type) {
      case "Program":
        traverseArray(node.body, node);
        break;
      case "CallExpression":
        traverseArray(node.params, node);
        break;
      case "NumberLiteral":
      case "StringLiteral":
        break;
    }

    // if(methods && methods.exit) {
    // methods.exit(node as any, parent)
    // }
  }

  traverseNode(ast, null);

  return ast;
}

Copy the code

There are three nested methods:

  • traverser: entry, incoming isProgramRoot node, andvisitorMethods. The method should contain the corresponding AST key object, and the property should have oneenterMethods. Method inputs have an AST node of type key and a parent node that records the upper object of the node. (This is abstract, we will see the type declaration later.)
  • traverseNode: Method of operating a node. The main task is to callenterIf it isProgramType orCallExpressionTypes need to be iterated down.
  • TraverseArray: Traverses the lower layer.

The thing to watch out for here is the type declaration.

First the traverser passes the Program node, of the type we declare. There is also a visitor object:

// Visitor attributes are optional, use Partial wraps.
export typeVisitor= Partial<{ Program? : BindExitAndEnter<Ast.Program>, CallExpression? : BindExitAndEnter<Ast.CallExpression>, NumberLiteral? : BindExitAndEnter<Ast.NumberLiteral>, StringLiteral? : BindExitAndEnter<Ast.StringLiteral>, }>// parent may be null. Program does not have a parent node.
export type ParentTypes = Program | CallExpression | null

// Type method. Pass in a type and return an object type. This includes the Enter and exit methods. But actually in this case we're not going to use exit.
interfaceBindExitAndEnter<T>{ enter? :(node: T, parent: Ast.ParentTypes) = > voidexit? :(node: T, parent: Ast.ParentTypes) = > void 
} 
Copy the code

Call methods to enhance the syntax tree

import traverser from "./traverser";

export default function transformer(ast: Ast.Program) {
  // as the root node of the new AST
  let newAst: Ast.Program = {
    type: "Program".body: [],};// Add a context attribute to the old AST tree that contains its subset.
  letinstance = { ... ast,_context: newAst.body
  };

  traverser(instance, {
    NumberLiteral: {
      enter(node, parent: Ast.ParentTypes) {
        // Pass yourself in the parent contextparent? ._context? .push({type: "NumberLiteral".value: node.value, }); }},StringLiteral: {
      enter(node, parent){ parent? ._context? .push({type: "StringLiteral".value: node.value, }); }},CallExpression: {
      enter(node, parent) {
        // Add callee record name/type

        let expression: any = {
          type: "CallExpression".callee: {
            type: "Identifier".name: node.name,
          },
          arguments: [],}; node._context = expression.arguments;if(parent? .type ! = ="CallExpression") {
          // Add type to create scope

          expression = {
            type: "ExpressionStatement", expression, }; } parent? ._context? .push(expression); ,}}});return {
    ...newAst,
    body: instance._context, // Finally, everything recorded in the parent context exists in instance._context
  };
}

Copy the code

Here we add a _context as a pointer to the child node, thread the needle, record all, and finally assign to the body of the new AST.

At this point we get a more robust syntax tree:

{
      type: 'Program'.body: [{type: 'StringLiteral'.value: '1'
        },
        {
          type: 'ExpressionStatement'.expression: {
            type: 'CallExpression'.callee: {
              type: 'Identifier'.name: 'add'
            },
            arguments: [{type: 'NumberLiteral'.value: '2'
              }, 
              {
              type: 'CallExpression'.callee: {
                type: 'Identifier'.name: 'subtract'
              },
              arguments: [{type: 'NumberLiteral'.value: '4'
                }, 
                {
                  type: 'NumberLiteral'.value: '2'}]}]}}Copy the code

We even have scope! A good start. But the AST type has also changed. So we need to define new types:

namespace newAST {
  export type Program = {
    type: "Program".body: withoutProgram[]
  }

  export type ExpressionStatement = {
    type: "ExpressionStatement".expression: CallExpression
  }

  export type CallExpression = {
    type: "CallExpression".callee: Identifier,
    arguments: withoutProgram[]
  }

  export type NumberLiteral = {
    type: "NumberLiteral".value: string
  }

  export type StringLiteral = {
    type: "StringLiteral".value: string
  }

  type Identifier = {
    type: "Identifier".name: string
  }

  export type withoutProgram = ExpressionStatement | CallExpression | NumberLiteral | StringLiteral | Identifier

  export type all = withoutProgram | Program
}
Copy the code

Convert the new syntax tree to code

With the new syntax tree, we can finally turn code into human-readable code!

import transformer from "./transformer";
import tokenizer from "./tokenizer";
import parser from "./parser";
import codeGenerator from "./codeGenerator";

export default function compiler(input: string) {
  let tokens = tokenizer(input);
  let ast = parser(tokens);
  let newAST = transformer(ast);
  let output = codeGenerator(newAST as newAST.all);

  return output;
}
Copy the code

Now when we type in

"'abcd'(add 2 (subtract 4 2))"
Copy the code

Will get

'" abcd "\ nadd bring them any closer (2, subtract (4, 2)); ' // \n is a newline
Copy the code

test

Step by step testing is essential for such tools. But because the test is too long, I will not put it here

You can check out the warehouse

reference

The super – tiny – a compiler) : github.com/jamiebuilds…

This article code repository: github.com/linbuxiao/t…

Thank you!