Why do YOU need to learn about AST? Because AST is so important, you may not know it, but it’s everywhere. To be more specific:

  1. The browserjsThe engine tojsThe first thing is parsingjsgenerateASTInterpretation execution followed, compilation optimized execution.
  2. webpack
  3. babel
  4. eslint
  5. prettier

.

There are too many examples to go over, but the principles behind these tools depend on the AST.

These tools all involve a process:

Disassemble the code -> generate the AST-> traverse the AST and modify it -> regenerate the code.

  • The code is just a string, and making changes to the code requires fine-grained splitting of the string, that is, iterating through the code character by character, breaking each character into its corresponding pieces (token);
  • Generated after traversal is completetokenList, according totokenwithtokenThat is, the grammar rules, which, when combined, form a tree representation. This tree isAST.
  • getASTYou can then iterate through the tree and do some transformations (add, delete, change) on the nodes of the tree.
  • After finishing the operation, according to the modifiedASTPrint out the code.

So the AST is actually an intermediate.

What is theAST

After the above introduction, we already have a general understanding of AST. AST abstract syntax tree. It is an abstract representation of the syntactic structure of source code. It represents the syntactic structure of a programming language as a tree, with each node in the tree representing a structure in the source code. There are two steps to generate an AST:

Lexical analysis (lexical analyzer)

A lexical analyzer, also called a scanner, literally scans our code, iterating over every character, using predefined rules to convert each character into a token.

var answer = 6 * 7;
Copy the code

To generate the token:

[{"type": "Keyword"."value": "var"
    },
    {
        "type": "Identifier"."value": "answer"
    },
    {
        "type": "Punctuator"."value": "="
    },
    {
        "type": "Numeric"."value": "6"
    },
    {
        "type": "Punctuator"."value": "*"
    },
    {
        "type": "Numeric"."value": "Seven"
    },
    {
        "type": "Punctuator"."value": ";"}]Copy the code
  • Grammatical analysis (Syntax analyzer)

Syntax analysis is to iterate the token list and associate the tokens according to the syntax rules to form a tree structure, which is the AST. So the AST represents the syntactic structure of the source code, and each node in the tree represents a structure in the source code.

The AST generated by the above example:

{
  "type": "Program"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
            "type": "Identifier"."name": "answer"
          },
          "init": {
            "type": "BinaryExpression"."operator": "*"."left": {
              "type": "Literal"."value": 6."raw": "6"
            },
            "right": {
              "type": "Literal"."value": 7."raw": "Seven"}}}]."kind": "var"}]."sourceType": "script"
}
Copy the code

Implement a simpleJSThe compiler

With that in mind, let’s take a look at how to implement a simple JS compiler. Don’t be afraid, step by step, the goal is to achieve a simple JS compiler, the most important is to understand the principle. The main function of this compiler is to convert ES6 syntax into ES5 syntax. Sound familiar? Exactly what we used to do with Babel. For simplicity, our compiler will mainly:

let name = "Zhang";
Copy the code

Converted to

var name = "Zhang";
Copy the code

Compilers typically take three steps:

  1. Parsing(Parsing) : Responsible for parsing and generating codeAST
  2. Transformation(Transformation) : Add, delete and modify the code according to the AST.
  3. Code Generation(Code generation) : based on the transformationASTRegenerate the new code.

So now our goal is clear, to achieve these three steps. now

Parsing(analysis)

Parsing phase We have discussed the two phases of lexical parsing and syntax parsing. Here we recommend esprima, an online parsing tool, to check the generated tokens and AST.

Lexical analysis

In the lexical analysis stage, we need to generate the token list:

[{"type": "Keyword"."value": "let"
    },
    {
        "type": "Identifier"."value": "name"
    },
    {
        "type": "Punctuator"."value": "="
    },
    {
        "type": "String"."value": "\" zhang SAN \ ""
    },
    {
        "type": "Punctuator"."value": ";"}]Copy the code

The idea behind this step is to iterate through the code string and then categorize the string (generating tokens).

Start by defining some type variables that will be used later:

// constants.js
const TokenTypes = {
  Keyword: "Keyword".Identifier: "Identifier".Punctuator: "Punctuator".String: "String"
}

const AST_Types = {
  Literal: "Literal".Identifier: "Identifier".AssignmentExpression: "AssignmentExpression".VariableDeclarator: "VariableDeclarator".VariableDeclaration: "VariableDeclaration".Program: "Program"
}

module.exports = {
  TokenTypes,
  AST_Types
}
Copy the code

The implementation idea is as follows:

// tokenizer.js
const tokens = require("./constants")
const KEYWORD = /let/ // Match the keyword
const PUNCTUATOR = / [\ =;] / // match "=", ";"
const WHITESPACE = /\s/ // Matches Spaces
const LETTERS = /[a-z]/i // Match the character
const {TokenTypes } = tokens

function tokenizer(input) {
  const tokens = [] // The token list stores the tokens and eventually returns them
  let current = 0 // Where does the tag traverse to the string

  // Loop through the code string with a while loop until the entire string is iterated
  while (current < input.length) {
    let char = input[current] // hold the current iterated character

    // *************** handles the keyword and variable name ***************
    if (LETTERS.test(char)) {
      let value = ' '
      // Run a loop through all the letters and store them in value.
      while (LETTERS.test(char)) {
        value += char
        char = input[++current]
      }
      
      if(KEYWORD.test(value)) { // Determine if the current string is a keyword
        // enter the keyword
        tokens.push({
           type: TokenTypes.Keyword,
           value: value
        })
      } else { // Otherwise, the variable name
         // Enter the variable name
         tokens.push({
           type: TokenTypes.Identifier,
           value: value
         })
       }
       // Enter the next loop
       continue
    }

    // *************** check for symbols, "=", ";" * * * * * * * * * * * * * * *
    if (PUNCTUATOR.test(char)) {
      const punctuators = char // Create a variable to hold the matching symbol
      current++
      // enter the symbol
      tokens.push({
        type: TokenTypes.Punctuator,
        value: punctuators
      })
      // Enter the next loop
      continue
    }

    // *************** handle Spaces and skip ***************
    if (WHITESPACE.test(char)) {
      current++
      continue
    }
    
    // *************** processes the string ***************
    if (char === '"') {
      let value = ' '
      char = input[++current] // Omit the opening quotation marks
      // until the next quote is encountered
      while(char ! = ='"') {
        value += char
        char = input[++current]
      }
      char = input[++current] // Ignore closing quotes
      tokens.push({ type: TokenTypes.String, value: '"'+value+'"' })
      continue
    }
    / / * * * * * * * * * * * * * * * if you do not satisfy the current matching rules throw an error * * * * * * * * * * * * * * *
    throw new TypeError('Unknown' + char)
  }
  return tokens
}

module.exports = tokenizer

Copy the code

A tokenizer function is used to generate tokens. Because it is a simple implementation, it only determines the characters that need to be processed. If you want to implement some more complex parsing, you can enrich the above matching parsing logic.

Syntax parsing

In the syntax phase we need to use tokens to generate the AST.

{
  "type": "Program"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
            "type": "Identifier"."name": "name"
          },
          "init": {
            "type": "Literal"."value": "Zhang"."raw": "\" zhang SAN \ ""}}]."kind": "let"}]."sourceType": "script"
}
Copy the code

The implementation idea is as follows:

const tokens = require("./constants")
const {TokenTypes, AST_Types } = tokens

// Accept tokens as an argument
function parser(tokens) {
  // Record where tokens are traversed currently
  let current = 0

  // Define the walk function by iterating through the token node
  // For different types of nodes, the corresponding processing method is also different
  function walk() {
    // Start parsing from the current token
    const token = tokens[current]

    // *************** check whether the string is ***************
    if (token.type === TokenTypes.String) {
      // If current is autoincrement.
      current++;
      // Then return a new AST node
      return {
        type: AST_Types.Literal,
        value: JSON.parse(token.value),
        row: token.value
      }
    }
   
    // *************** check if the variable name is ***************
    if (token.type === TokenTypes.Identifier) {
      // If so, current increases.
      current++;
      // Then return a new AST node
      return {
        type: AST_Types.Identifier,
        name: token.value,
      };
    }

    // *************** check if it is the operator keyword ***************
    if (token.type === TokenTypes.Punctuator) {
      // If so, current increases.
      current++;
      // Check if it is =
      if(/ \ = /.test(token.value)){
        return {
          type: AST_Types.AssignmentExpression,
          operator: token.value
        }
      }else{ // ignore; Number is not included in the AST
        return}}// *************** check if the keyword is ***************
    if ( token.type === TokenTypes.Keyword) {
      var value = token.value
      current++; // Now ++, because the name of the variable immediately following the declaration, the next step is to return the variable name
      const variable = walk() // Get the defined variable
      current++ // the next one should be the = sign, so we'll just skip current++ and not count it in the AST
      const rightVar = walk()
      current++ // Next should be; I'm just going to skip current++ here, not count in the AST
      
      // Define a declaration
      const declaration = {
        type: AST_Types.VariableDeclarator,
        id: variable, // Define a variable
        init: rightVar // The value assigned
      }
      // Define the node to return
      return {
        type: AST_Types.VariableDeclaration,
        declarations: [declaration],
        kind: value,
      };
    }
    // *************** throws an error when encountering a node of unknown type. * * * * * * * * * * * * * * *
    throw new TypeError(token.type);
  }
  // Create the AST and define the root node to be a node of type 'Program'.
  const ast = {
    type: AST_Types.Program,
    body: [].sourceType: "script"
  };

  // start the walk function and place the node in ast.body.
  while (current < tokens.length) {
    ast.body.push(walk());
  }
  return ast;
}

module.exports = parser
Copy the code

The core point is the walk function, which makes recursive calls to the assignment statement. At the same time, we ignore some token processing, such as “=”, “;” . We also don’t see “=”, “;” in the AST generated by Esprima. .

So far we have implemented a very simple AST generation tool. But it takes a lot of judgment and special treatment to make it work in real development, but that’s the general idea. We focus on understanding principles and ideas.

Transformation

After the AST is generated, we can add, delete, modify and check the AST.

Remember our goal is to get

let name = "Zhang";
Copy the code

Converted to

var name = "Zhang";
Copy the code

So we need to replace let with var, which means we need to convert the AST to the following form:

{
  "type": "Program"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
            "type": "Identifier"."name": "name"
          },
          "init": {
            "type": "Literal"."value": "Zhang"."raw": "\" zhang SAN \ ""}}]."kind": "var"}]."sourceType": "script"
}
Copy the code

traverser(Traverser)

First we need a traverser function that can traverse the AST. This function has several key points:

  • astIs a tree structure, using depth first traversal
  • traverserTake two parametersast.visitor
  • traverserResponsible for the traverseAST.visitorContains methods that accept different types of nodes and their parents, for example:
    const visitor = {
       VariableDeclaration(node, parent){}};Copy the code
  • These methods add, delete and modify matched nodes

The implementation is as follows:

// traverser.js
const constants = require("./constants")
const { AST_Types } = constants
function traverser(ast, visitor) {
  // traverseNode is called from each node in the tree
  function traverseArray(array, parent) {
    array.forEach(function(child) {
      traverseNode(child, parent);
    });
  }

  // Process the ast node functions, using the visitor defined conversion function for the conversion
  function traverseNode(node, parent) {
    // First look at the visitor to see if there is a handler for the type.
    const method = visitor[node.type]
    // If so, call the handler method
    if (method) {
      method(node, parent)
    }

    // Each node of a different type is treated separately below.
    switch (node.type) {
      case AST_Types.Program: // The top level Program starts, the body is an array, so call traverseArray
        traverseArray(node.body, node) 
        break
      // If no conversion is required, exit directly
      case AST_Types.VariableDeclaration:
      case AST_Types.VariableDeclarator:
      case AST_Types.AssignmentExpression:
      case AST_Types.Identifier:
      case AST_Types.Literal:
        break
      // Also, if the current node cannot be recognized, an error is thrown.
      default:
        throw new TypeError(node.type)
    }
  }
  The root node has no parent, so null is passed here
  traverseNode(ast, null)}module.exports = traverser
Copy the code

Transformer

Transformer, which calls the traverser to generate a new AST. We need to define the traverser visitor parameter to enable the traverser visitor to add, remove, or change the AST node:

// transformer.js
const traverser = require("./traverser")
const constants = require("./constants")
const { AST_Types } = constants

// Transformer accepts the AST as a parameter
function transformer(ast) {
  // newAst is used to store newAst
  const newAst = {
    type: AST_Types.Program,
    body: [].sourceType: "script"
  };
  // For simplicity, newast. body is directly mounted to the _context property of the AST.
  // After processing a node in this way, _context can be obtained through the parent of the current node, and then push
  // Save the current modified node
  ast._context = newAst.body 
  // Pass AST and visitor into traverser
  traverser(ast, {
    // Convert let to var
    VariableDeclaration: function(node, parent) {
      const variableDeclaration = {
        type: AST_Types.VariableDeclaration,
        declarations: node.declarations,
        kind: "var"
      };
      // Put the new VariableDeclaration into the context.
      parent._context.push(variableDeclaration)
    }
  });
  // Finally return the new AST
  return newAst
}


module.exports = transformer
Copy the code

Code Generation

Finally, it’s time to regenerate the code based on the new AST.

const constants = require("./constants")
const { AST_Types } = constants

function codeGenerator(node) {
  // Handle different types of nodes
  switch (node.type) {
    // If it is a Program node, loop through each node in its body property and add a newline symbol
    case AST_Types.Program:
      return node.body.map(codeGenerator)
        .join('\n')
    
    case AST_Types.VariableDeclaration: // Handle variable declarations
      return (
        node.kind + ' ' + node.declarations.map(codeGenerator)
      )
    case AST_Types.VariableDeclarator: // Process declared variables and values
      return (
        codeGenerator(node.id) + '=' + 
        codeGenerator(node.init)
      )
    case AST_Types.Identifier: // The variable name is returned directly
      return node.name
    
    case AST_Types.Literal: // String with "",; return
      return '"'+node.value+'"'+";"

    // If we cannot identify the node, an error is thrown.
    default:
      throw new TypeError(node.type)
  }
}

module.exports = codeGenerator
Copy the code

This step is relatively simple, generating a new AST by recursively iterating through it and stitching the corresponding pieces together.

Finally, let’s look at the compiler we wrote in action:

const transformer = require("./transformer")
const tokenizer = require("./tokenizer")
const parser = require("./parser")
const codeGenerator = require("./codeGenerator")

const tokens = tokenizer('let name = "张三";')
const ast = parser(tokens)
const newAst = transformer(ast)
const newCode = codeGenerator(newAst)


console.log(newCode) // var name = "";
Copy the code

The complete code is here, if you are interested, you can refer to it.