An overview of the

This paper is divided into three parts

Part 1: What is AST and why do we use it

Part two: Implement a small compiler and understand the role the AST plays

Part 3: Introduce how AST is used in Babel

The second part is the most important, and I recommend downloading sample code to get a feel for the process


Babel is well known among front-end developers for its role in converting ES6, ES7, or draft syntax into ES5 syntax,

This way, there are no compatibility issues in the browser, and developers can try out the latest syntax,

Do you have any questions about how Babel achieves this capability?

.

Think about 5 s

. done ….

Babel implements this transformation based on an AST (Abstract Syntax tree) technology, which is referred to as AST for short

It is not only possible to convert ES6 to ES5 based on AST, but the following are current applications

  • Syntax validation (ESLint)
  • Code formatting
  • CSS pretreatment
  • Programming language conversion (e.g. Lisp to JavaScript)
  • React Native, Taro, Uni-app, etc.

It’s all part of front-end engineering, but it’s not the whole story.

You can do anything with it (AST) for a particular type of code file

What is AST?

Let’s look at the definition first

It is a hierarchical program representation that presents source code structure according to the grammar of a Buta collection of facts cannot be called science any more than a pile of facts. Buta collection of facts cannot be called science any more than a pile of facts. Buta collection of facts cannot be called science any more than a pile of facts. Each AST node corresponds to a source nodeCopy the code

Now, you’re probably as confused as I am, and this description isn’t enough to form a concrete understanding of it in our minds

Let’s start with a small example, if we have a piece of code like this

function calc(a, b) { return a + b; }
Copy the code

Since we’re just doing addition, we want to change the name of the function to sum, what can we do?

I’m sure some of you thought of replacing it with the string replace method.

Ok, let’s create a transform method to implement this capability

const code = ` function calc(a, b) { return a + b; } `;


function transform (originalCode) {
  if (typeoforiginalCode ! = ='string') {
    return originalCode;
  }
  return originalCode.replace(/calc/.'add');
}

console.log(transform(code));
/ / output
// function add(a, b) {
// return a + b;
// }
Copy the code

It seems pretty simple, so let’s change the example again, and now it looks like this

function plus (args = []) {
  return args.reduce((total, item) = > total + item, 0);
}
function subtract (args = []) {
  return args.reduce((total, item) = > total - item, 0);
}
function multiply (args) {
  return args.reduce((total, item) = > total * item, 0);
}
function divide (args) {
  return args.reduce((total, item) = > total / item, 0);
}
function calc(type, ... args) {
  switch (type) {
    case 'plus':
      return plus(args);
    case 'subtract':
      return subtract(args)
    case 'multiply':
      return multiply(args)
    case 'divide':
      return divide(args)
  }
}
Copy the code

If we want to add a default output to console.log, or a default return to the switch statement, it might be a bit of a hassle

The actual scene of programming language is more complex, and using AST to deal with it is the mainstream way at present. AST is a virtual node tree describing code

For example, the React virtual DOM is used to describe real DOM nodes in HTML.

This abstract tree structure makes it easy for us to process it to get the desired result.

That little example at the beginning, how does AST represent that

Function calc(a, b) {return a + b; }Copy the code
// AST represents {"type": "Program", "start": 0, "end": 38, "sourceType": "module", "interpreter": null, "body": [{"type": "FunctionDeclaration", "start": 0, "end": 38, "id": { "type": "Identifier", "start": 9, "end": 13, "loc": { "start": { "line": 1, "column": 9 }, "end": { "line": 1, "column": 13 }, "identifierName": "calc" }, "name": "calc" }, "generator": false, "async": false, ... }Copy the code

As far as Program and Identifier are concerned, you don’t need to worry about them for a moment, and we will explain them later. Now, as long as you know that AST describes the complete code structure, you can understand it after reading the code implementation of the compiler in Part 2

The parser here is @babel/ Parser, and there are a lot of parsers out there, and Babel is one of the most widely used tools, and we’ll talk about the others later

astexplorer.net/

Is a visual AST presentation site where you can paste code to see how an AST is made

Github.com/fkling/aste…

You can check out the various parser and Transformer tool libraries currently in mainstream use here

Compiler composition and implementation

Compiler composition

Having seen the AST above, let’s take a closer look at the role AST plays in implementing a compiler with over 200 lines of code

The workflow of most compilers can be roughly divided into three steps:

Parsing => 3. Parsing => 3. Parsing => 3. Code GenerationCopy the code

Parsing

Parsing is the process of converting source code into an AST, which is divided into two steps, lexical analysis and syntax analysis

Lexical analysis

Lexical analysis requires traversing the source code from beginning to end and breaking it down into the smallest semantically expressive units of token for later generation of AST.

What are semantically minimal units and tokens, as illustrated by the initial example

function calc(a, b) {
	return a + b;
}
Copy the code

Calc is a minimal word, and if you break it down into c, A, L, and C, it’s just an ordinary letter, and we don’t know what they’re related to each other,

Similarly, function and return are the smallest semantic words that cannot be split. What is token

Token is the data in {type: ‘XXX ‘, value:’ XXX ‘} format, which facilitates the generation of AST

[{type: 'string'.value: 'function' },
  { type: 'name'.value: 'calc'     },
  { type: 'paren'.value: '('        },
  { type: 'string'.value: 'a'        },
  { type: 'string'.value: 'b'        },
  { type: 'paren'.value: ') '        },
  { type: 'paren'.value: '{'        },
  { type: 'string'.value: 'return'   },
  { type: 'string'.value: 'a'        },
  { type: 'operator'.value: '+'        },
  { type: 'string'.value: 'b'        },
  { type: 'paren'.value: '} '},]Copy the code

Syntax analysis

By analyzing the tokens generated by lexical analysis, generate AST. The Program type is the top-level node

{
    "type": "Program",
    "start": 0,
    "end": 38,
    "sourceType": "module",
    "interpreter": null,
    "body": [
      {
        "type": "FunctionDeclaration",
        "start": 0,
        "end": 38,
        "id": {
          "type": "Identifier",
          "start": 9,
          "end": 13,
          "loc": {
            "start": {
              "line": 1,
              "column": 9
            },
            "end": {
              "line": 1,
              "column": 13
            },
            "identifierName": "calc"
          },
          "name": "calc"
        },
        "generator": false,
        "async": false,
  ...
}
Copy the code

2. Transformation

The next phase is the transformation phase, which basically takes the AST as a parameter, walks through the AST, makes the changes we want, and generates a new AST

3. Code Generation

Finally, there is the code generation phase, where we iterate through the new AST of the transformation phase to finally get the code structure we want to output

Function calc(a, b) {return a + b; Function add(a, b) {return a + b; }Copy the code

Compiler implementation

This is the highlight of this article. We will implement a compiler from scratch and look at the implementation details of each stage

Let’s implement a cross-language converter that compiles Lisp function calls to JavaScript function calls

Lisp JavaScript
2 + 2 (add 2 2) add(2, 2)
2 + (5-3) (add 2 (subtract 5 3)) add(2, subtract(5, 3))

It doesn’t matter if you don’t know Lisp, we’re going to focus on the conversion process

This correspondence is important, and everything we’re going to do is based on this correspondence in the table

As mentioned earlier, the implementation of a compiler has three stages:

Parsing => 3. Parsing => 3. Parsing => 3. Code GenerationCopy the code

We define the corresponding function according to the stage

1, parsing,

Tokenizer: Lexical analysis

Parser: Parses

2, conversion,

Traverser: Provides a visitor interface to traverse the AST tree and modify to generate a new tree

Transformer: Invoke the traverser

3. Code generation

CodeGenerator: : Generates code

1, parsing,

Tokenizer: Lexical analysis

** (subtract 4 2)) ** The token generated from the above code would look like this: ** [* {type: 'paren', value: '(' }, * { type: 'name', value: 'add' }, * { type: 'number', value: '2' }, * { type: 'paren', value: '(' }, * { type: 'name', value: 'subtract' }, * { type: 'number', value: '4' }, * { type: 'number', value: '2' }, * { type: 'paren', value: ')' }, * { type: 'paren', value: ')' }, * ] * */


// We accept the source code and set two variables current and tokens
function tokenizer(input) {

  The current variable is used to indicate the current position of the traversal, like a virtual cursor
  let current = 0;

  // Tokens array is used to collect tokens
  let tokens = [];

  // We create a while loop to traverse the source code
  // Because the token length is different, the current may be increased several times in a loop
  while (current < input.length) {

    // Cache the current field in input
    let char = input[current];

    // The first case we need to detect is the opening of parentheses, which is later used by the function call CallExpression. but
    // Now we just need to worry about the characters.
    //
    // We check if there is an open parenthesis
    if (char === '(') {

      // If the character is open parenthesis, we push a new token with type paren and value (
      tokens.push({
        type: 'paren'.value: '('});// move the virtual cursor forward
      current++;

      // Then continue to the next loop
      continue;
    }

    // Check the closing parentheses, as before, add 1 token, add current, continue the next loop
    if (char === ') ') {
      tokens.push({
        type: 'paren'.value: ') '}); current++;continue;
    }

    // We now need to check for Spaces. This is interesting because Spaces exist to separate characters and are not really important for token storage, so we'll just skip it
    let WHITESPACE = /\s/;
    if (WHITESPACE.test(char)) {
      current++;
      continue;
    }

    // The token type is a number, which is different from the previous one because the number may be consecutive
    //
    // (add 123 456)
    / / ^ ^ ^ ^ ^ ^
    // Only two tokens are allowed
    //
    // So when we get to the first number, we do the following
    let NUMBERS = / [0-9];
    if (NUMBERS.test(char)) {

      // We create a value variable to store characters
      let value = ' ';

      // We loop back until the character is not a number
      while (NUMBERS.test(char)) {
        value += char;
        char = input[++current];
      }

      // Push 1 token to tokens
      tokens.push({ type: 'number', value });

      // Continue to the next loop
      continue;
    }

    // Our language also supports strings, a paragraph of text wrapped around"
    //
    // (concat "foo" "bar")
    // ^^^ ^^^ string tokens
    //
    // let's start checking quotes
    if (char === '"') {
      // Use value to collect string tokens
      let value = ' ';

      // We skip the double brackets themselves and start with the last 1 bit
      char = input[++current];

      // Iterate backwards until another one is found."
      while(char ! = ='"') {
        value += char;
        char = input[++current];
      }

      // Skip closed double quotes
      char = input[++current];

      // We add string token to tokens
      tokens.push({ type: 'string', value });

      continue;
    }

    // The last token is the name token, also a sequence of letters, which in Lisp is the name of a function
    //
    // (add 2 4)
    //    ^^^
    // Name token
    //
    let LETTERS = /[a-z]/i;
    if (LETTERS.test(char)) {
      let value = ' ';

      // We iterate over the loop letters and add them to the value variable
      while (LETTERS.test(char)) {
        value += char;
        char = input[++current];
      }

      // We add the name token and continue the loop
      tokens.push({ type: 'name', value });

      continue;
    }

    // Finally we check the syntax, if we do not match the result, throw an error, stop execution
    throw new TypeError('I dont know what this character is: ' + char);
  }

  // At the end of tokenizer, we return to tokens
  return tokens;
}
Copy the code

Parser: Parses

/** * We try to parse the tokens array into AST ** [{type: 'paren', value: '('},...  => { type: 'Program', body: [...] } * /

// We define the parser function to receive tokens from Tokenizer
function parser(tokens) {

  // Create the current variable to represent the current position of the virtual cursor
  let current = 0;

  // We create the root node of the AST. Program is the top node of the AST
  let ast = {
    type: 'Program'.body: [],};// We will encounter nested functions, so this time we will use recursion instead of the while loop
  function walk() {

    // Retrieve the token for the virtual cursor position
    let token = tokens[current];

    // We check if the token is of type number
    if (token.type === 'number') {

      // 如果是 number,current +1
      current++;

      // We return an AST node called NumberLiteral
      return {
        type: 'NumberLiteral'.value: token.value,
      };
    }

    // If it is a string token, we create a StringLiteral node like number
    if (token.type === 'string') {
      current++;

      return {
        type: 'StringLiteral'.value: token.value,
      };
    }

    // Next we look at CallExpressions, which start with an open parenthesis
    if (
      token.type === 'paren' &&
      token.value === '('
    ) {

      // current +1, skip the opening parenthesis, we don't care about it
      token = tokens[++current];

      // We create a base node with type CallExpression and set name to the token value
      // The function name is followed by the opening parentheses
      let node = {
        type: 'CallExpression'.name: token.value,
        params: [],};// current +1, we skip the function name and look directly at the expression
      token = tokens[++current];

      // To collect the Params of CallExpression, we query each token backwards until we encounter closing parentheses
      //
      // This is when recursion is needed. Instead of trying to analyze parameters directly that may have an infinite number of nested nodes, we use recursion.
      //
      To illustrate this concept, take our lisp code as an example. You can observe that the arguments to Add are a number and a nesting
      // The CallExpression, which in turn has its own parameters.
      //
      // (add 2 (subtract 4 2))
      //
      // You should notice that our tokens array has multiple closing brackets
      //
      / / /
      // { type: 'paren', value: '(' },
      // { type: 'name', value: 'add' },
      // { type: 'number', value: '2' },
      // { type: 'paren', value: '(' },
      // { type: 'name', value: 'subtract' },
      // { type: 'number', value: '4' },
      // { type: 'number', value: '2' },
      // {type: 'paren', value: ')'}, <<< close parentheses
      // {type: 'paren', value: ')'}, <<< close parentheses
      / /]
      //
      // We rely on the walk function to add current

      // So we create a while loop that continues through the token until we encounter a closing parenthesis
      while( (token.type ! = ='paren') ||
        (token.type === 'paren'&& token.value ! = =') ')) {// We call walk until we return a node and we push it to Node.params
        node.params.push(walk());
        token = tokens[current];
      }

      // Finally, we put current+1, skipping the closing parentheses
      current++;

      / / returns the node
      return node;
    }

    // If the token does not match, a type error is thrown
    throw new TypeError(token.type);
  }

  // Then we start the walk function and push nodes to ast.body
  //
  // The reason we do it in one loop is because CallExpression (function call expressions) can be multiple and unrelated
  //
  // (add 2 2)
  // (subtract 4 2)
  //
  while (current < tokens.length) {
    ast.body.push(walk());
  }

  // At the end of parser, we return the AST
  return ast;
}
Copy the code

2, conversion,

Traverser: Provides a visitor interface to traverse the AST tree and modify to generate a new tree

/** * Now we have the AST, and we can use the visitor pattern to access different nodes. * When we encounter a matching type, we call different methods on the visitor. * Visitor pattern is a design pattern that separates data operations from data operations. Traverse (ast, {* Program: {* enter(node, parent) {* //... * }, * exit(node, parent) { * // ... * }, * }, * * CallExpression: { * enter(node, parent) { * // ... * }, * exit(node, parent) { * // ... * }, * }, * * NumberLiteral: { * enter(node, parent) { * // ... * }, * exit(node, parent) { * // ... *}, *}, *}); * /


// We define the traverser function to receive the AST and a visitor, where we will define two functions
function traverser(ast, visitor) {

  // The traverseArray function will allow us to traverse the array, and we need to define a traverseNode function next
  function traverseArray(array, parent) {
    array.forEach(child= > {
      traverseNode(child, parent);
    });
  }

  // traverseNode takes node nodes and their parent nodes
  // We can pass both as arguments to our visitor method, i.e
  function traverseNode(node, parent) {

    // Initially, see if the node type has a corresponding visitor method
    let methods = visitor[node.type];

    // If there is an Enter method, we call it, passing two arguments
    if (methods && methods.enter) {
      methods.enter(node, parent);
    }

    // Next, we need to make a difference according to the current node type
    switch (node.type) {

      // To start, the top-level structure is Program, which has a property named body,
      // We use traverseArray recursively to process all child nodes
      case 'Program':
        traverseArray(node.body, node);
        break;

      // Next, we do the same with CallExpression, cycling through its params, the expression body below
      case 'CallExpression':
        traverseArray(node.params, node);
        break;

      NumberLiteral and StringLiteral, which have no children below them, so no extra processing is required
      case 'NumberLiteral':
      case 'StringLiteral':
        break;

      // If the above match is not triggered, a type error is thrown as before
      default:
        throw new TypeError(node.type);
    }

    If there is an exit method on the visitor, we execute it
    if(methods && methods.exit) { methods.exit(node, parent); }}// Finally we start the traverseNode function with AST, which is at the top level and has no parent node
  traverseNode(ast, null);
}
Copy the code

Transformer: Invoke the traverser

/** * Next we have transformer, we iterate through the AST node tree, processing through a method on the visitor, Generate new AST tree * * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * source AST | AST * after conversion ---------------------------------------------------------------------------- * { | { * type: 'Program', | type: 'Program', * body: [{ | body: [{ * type: 'CallExpression', | type: 'ExpressionStatement', * name: 'add', | expression: { * params: [{ | type: 'CallExpression', * type: 'NumberLiteral', | callee: { * value: '2' | type: 'Identifier', * }, { | name: 'add' * type: 'CallExpression', | }, * name: 'subtract', | arguments: [{ * params: [{ | type: 'NumberLiteral', * type: 'NumberLiteral', | value: '2' * value: '4' | }, { * }, { | type: 'CallExpression', * type: 'NumberLiteral', | callee: { * value: '2' | type: 'Identifier', * }] | name: 'subtract' * }] | }, * }] | arguments: [{ * } | type: 'NumberLiteral', * | value: '4' * ---------------------------------- | }, { * | type: 'NumberLiteral', * | value: | '2' * * * | |}}}] * * | |}]} * -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - * / / / Function transformer(AST) {// We create newAst, like the AST above, Let newAst = {type: 'Program', body: [],}; // There are usually better abstractions to do this. If you want to do this, you can use a _content method on the old AST to store a reference to the new AST. Here we try to handle ast._context = newast.body as simply as possible; Traverser (ast, {// The first traverser method is used to process NumberLiteral NumberLiteral: {// We will call Enter (node, parent) when we access this node type. {// The new node is also called NumberLiteral, and we will push it into the new AST tree. 'NumberLiteral', value: node.value,});},}, // { enter(node, parent) { parent._context.push({ type: 'StringLiteral', value: Node.value,});},}, // Now it's time for CallExpression: {enter(node, parent) {// We create the new language CallExpression expression structure let expression = {type: 'CallExpression', callee: {type: 'Identifier', name: node.name, }, arguments: [],}; // Now we define the new _context on the original CallExpression node, Mount the node node to which you want to convert expression. Arguments node._context = expression.arguments; CallExpression if (parent. Type! == 'CallExpression') {// we use ExpressionStatement node to wrap CallExpression node, // why do we do this, because in JavaScript, Expression = {type: 'ExpressionStatement', expression: Expression,};} // Finally we push CallExpression into the _context of the parent, which can be another function call parent._context.push(expression);},}}); Return newAst;}Copy the code

3. Code generation

CodeGenerator: : Generates code

/** * now let's look at the last step: */ function codeGenerator(node) {switch (node.type) {// If the Program root node. Case 'Program': return node.body.map(codeGenerator).join('\n'); // ExpressionStatement (ExpressionStatement) Case 'ExpressionStatement': return (codeGenerator(node.expression) + '; '/ / < <... Because we want it to be syntactically correct)); // For CallExpression, we will print its callee, concatenation (, map iterates through its arguments, and finally add) case 'CallExpression': return ( codeGenerator(node.callee) + '(' + node.arguments.map(codeGenerator) .join(', ') + ')' ); Case 'Identifier': return node.name; Case 'NumberLiteral': return node.value; // For NumberLiteral, we wrap "" around node value and return case 'StringLiteral': return '" + node.value + '"'; Default: throw new TypeError(Node.type); // If no processing rule is matched, throw a type error. }}Copy the code

4. Encapsulate the methods in stages 1, 2 and 3 into compiler

/** * Finally, we create a compiler function * here, we string them together with paths like this, Tokens => Parser => AST * 3. Ast => Transformer => newAst * 4. NewAst =>  generator => output */


function compiler(input) {
  let tokens = tokenizer(input);
  let ast    = parser(tokens);
  let newAst = transformer(ast);
  let output = codeGenerator(newAst);

  return output;
}
Copy the code

Recall our previous chart, if you have a mental model in mind

Lisp JavaScript
2 + 2 (add 2 2) add(2, 2)
2 + (5-3) (add 2 (subtract 5 3)) add(2, subtract(5, 3))

Complete sample code for the compiler:Click here to get

How does Babel escape through AST?

Recall the three stages of the compiler:

Parsing => 3. Parsing => 3. Parsing => 3. Code GenerationCopy the code

Babel has toolkits for each stage

1. The analytic (Parsing) = > @ Babel/parser (https://babeljs.io/docs/en/babel-parser) 2. Translation (Transformation) = > @ Babel/traverse by 3 (https://babeljs.io/docs/en/babel-traverse). Code Generation (Code Generation) = > @ Babel/generator (https://babeljs.io/docs/en/babel-generator)Copy the code

The structure of AST is relatively complex, and Babel has a package of @babel/types to facilitate the operation of AST nodes

Let’s continue with the second example, adding console to the output of the function name to make it easier to identify the printed information

function plus (args = []) {
  return args.reduce((total, item) = > total + item, 0);
}

function subtract (args = []) {
  return args.reduce((total, item) = > total - item, 0);
}

function multiply (args) {
  return args.reduce((total, item) = > total * item, 0);
}

function divide (args) {
  console.log(args);
  return args.reduce((total, item) = > total / item, 0);
}

function calc(type, ... args) {
  console.log(type);
  console.log(args);
  switch (type) {
    case 'plus':
      return plus(args);
    case 'subtract':
      return subtract(args)
    case 'multiply':
      return multiply(args)
    case 'divide':
      return divide(args)
  }
}
Copy the code

Attach the complete code

const t = require('@babel/types');
const babelParser = require('@babel/parser');
const generate = require('@babel/generator').default;
const traverse = require('@babel/traverse').default;

const code = ` function plus (args = []) { return args.reduce((total, item) => total + item, 0); } function subtract (args = []) { return args.reduce((total, item) => total - item, 0); } function multiply (args) { return args.reduce((total, item) => total * item, 0); } function divide (args) { console.log(args); return args.reduce((total, item) => total / item, 0); } function calc(type, ... args) { console.log(type); console.log(args); switch (type) { case 'plus': return plus(args); case 'subtract': return subtract(args) case 'multiply': return multiply(args) case 'divide': return divide(args) } } `;

/ / 1. Parsing
const ast = babelParser.parse(code);

/ / 2. The conversion
/ / the node type and data structure, can see estree specification: https://github.com/estree/estree
traverse(ast, {
  // This visitor is called when a node is entered
  enter(path) {
    console.log('Log: path.node.type ->', path.node.type);
  },
  // This visitor is called when the node exits
  exit(path) {
    // console.log('exit')
  },
  // You can write the node type directly, and if the type is scanned, that visitor is called
  CallExpression(path) {
    const { callee } = path.node;

    // Check whether console.log has a parent node
    const isConsoleLog = t.isMemberExpression(callee) && callee.object.name === 'console' && callee.property.name === 'log';

    if (isConsoleLog) {
      // If the parent node is a declared function, take its function name and add it to console
      const funcPath = path.findParent(p= > p.isFunctionDeclaration());
      if (funcPath) {
        constfuncName = funcPath.node.id.name; path.node.arguments.unshift(t.stringLiteral(funcName)); }}}})// 3. Code generation
const output = generate(ast)  
console.log('Input \n', code)
console.log('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --')
console.log('Output \n', output.code)
Copy the code

The calc function console.log has added the function name as expected

function calc(type, ... args) {
  console.log("calc", type);
  console.log("calc", args); . }Copy the code

This is the end of AST introduction, do you have some harvest

Finally, I strongly recommend taking some time to think about the code in Part 2


References:

  1. Github.com/estree/estr…
  2. Github.com/jamiebuilds…
  3. babeljs.io/docs/en/
  4. Juejin. Cn/post / 684490…