The way Babel AST is made

When I analyzed the Babel compilation process, I mentioned that Babel converts JS code into an AST (abstract syntax tree). This behavior is universal; no matter what programming language parses source code into an AST, the AST is not specific to Babel, let alone JS.

Why do you do that? The original JS file is incomprehensible to the computer, and it is difficult for the computer to modify the JS code directly. However, after converting into AST, since AST is essentially a group of objects representing the program structure, we can modify the object indirectly to achieve the purpose of modifying the code. Chrome V8 does the same, and goes one step further than Bable by compiling AST to generate bytecode.

Javascript AST specification

An AST is essentially a set of objects that represent the structure of a program, so who dictates what that object should look like? The original AST of JavaScript was actually used in SpiderMonkey, the first generation of JavaScript engine designed by Brendan Eich, the father of JavaScript. After that, It was handed over to Mozilla for maintenance and then opened source to form Parser_API. Over time, this format has become a common language for tools that manipulate JavaScript source code.

At the same time, JavaScript continues to evolve. Some big names created project ESTree to update AST rules to keep up with the development of the JavaScript language, which is now the community standard. The AST used in Babel is based on ESTree, although Babel has also made some modifications.

To give an example of what an AST might look like, a simple variable declaration produces the following AST (with some attributes removed for simplicity).

const a = 1;
Copy the code

{
  "type": "Program"."comments": []."sourceType": "module"."body": [{"type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
            "type": "Identifier"."name": "a"
          },
          "init": {
            "type": "Literal"."value": 1,}}],"kind": "const"}}]Copy the code

In such a JSON tree, where each layer has a similar structure called nodes, an AST can consist of a single Node or hundreds or thousands of nodes.

interface Node {
  type: string;
}
Copy the code

Whenever it is a node, there is a type field that indicates the type of the node (for example, “FunctionDeclaration”, “Identifier”, or “BinaryExpression”). Each type of node defines additional attributes that further describe the node type. Babel also generates additional attributes for each node, such as location, start, end, and so on. Used to describe the location of the node in the original code.

Let’s now start analyzing how an AST is generated.

Lexical analysis

In general, the compiler compiles code for lexical analysis of the first choice. @babel/ Parser provides lexical analysis + parsing. The premise of understanding the passage correctly is to know each word, so does the program. Therefore, the program needs to be able to divide the code correctly and get one word after another, which is called ‘Token’. For example, for the following code, the correct segmentation is [‘a’, ‘>=’, ‘1’].

a >= 1;
Copy the code

Lexical analysis can be simply understood as word segmentation. For a natural language, word segmentation is a very complex project, but for a programming language, due to the existence of programming rules, we can use rules to simplify the process and distinguish different tokens. Let’s look at the word segmentation code in Babel. Each case is a word segmentation rule. Due to the large amount of code, only part of the rules are listed in the following code.

packages/babel-parser/src/tokenizer/index.js

getTokenFromCode(code: number): void {
  switch (code) {
    / / 'points
    case charCodes.dot:
      this.readToken_dot();
      return;

    // '(' left parenthesis
    case charCodes.leftParenthesis:
      ++this.state.pos;
      this.finishToken(tt.parenL);
      return;

    / / digital
    case charCodes.digit1:
    case charCodes.digit2:
    case charCodes.digit3:
    case charCodes.digit4:
    case charCodes.digit5:
    case charCodes.digit6:
    case charCodes.digit7:
    case charCodes.digit8:
    case charCodes.digit9:
      this.readNumber(false);
      return;

    / / "" |" single or double quotes
    case charCodes.quotationMark:
    case charCodes.apostrophe:
      this.readString(code);
      return;

    // None of the above matches
    default:
      if (isIdentifierStart(code)) {
        this.readWord();
        return; }}}Copy the code

As you can see from the code, the essence of lexical analysis is to program some rules, such as we need to recognize the number 101. When the program scans a numeric character, it treats it as a number until it encounters a non-numeric character. Or if we need to identify a string like “ABC”, when we scan for a double quote or a single quote, we treat it as a string until we get another double quote or a single quote.

For example code, since the first character is a ‘C ‘, not a number or a symbol, it enters the process of determining whether it is an Identifier. The isIdentifierStart method is used to determine whether the first character of an identifier is valid. This code represents the rule that an identifier can start with $or _ or a letter.

export function isIdentifierStart(code: number) :boolean {
  / / $legal
  if (code < charCodes.uppercaseA) return code === charCodes.dollarSign;
  // a-z is valid
  if (code <= charCodes.uppercaseZ) return true;
  // _ Underscores are valid
  if (code < charCodes.lowercaseA) return code === charCodes.underscore;
  // a-z is valid
  if (code <= charCodes.lowercaseZ) return true;
  if (code <= 0xffff) {
    return (
      // Non-ASCII characters
      code >= 0xaa && nonASCIIidentifierStart.test(String.fromCharCode(code))
    );
  }
  // > 0xFFFF May also be legal
  return isInAstralSet(code, astralIdentifierStartCodes);
}
Copy the code

The readWord method reads the characters backwards from the current starting point to determine if each character satisfies the identifier constraint (isIdentifierChar), obviously ‘o’, ‘n’, ‘s’, ‘t’, up to the space, At this point, const is used as a full identifier.

export function isIdentifierChar(code: number) :boolean {
  / / $legal
  if (code < charCodes.digit0) return code === charCodes.dollarSign;
  // 1-9 is valid
  if (code < charCodes.colon) return true;
  // isIdentifierStart is basically the same. }Copy the code

Next, to determine whether const is a keyword, Babel defines 35 keywords, such as the common if, else, switch, and so on. Obtain the token type based on the keyword, and each type corresponds to a different rule. Not only that, but lexical analysis also records the location of the ‘token’ as well as some contextual information.

const type = keywordTypes.get(word) || tt.name;
Copy the code

export const types: { [name: string]: TokenType } = {
  ...
  // Keywords
  // Don't forget to update packages/babel-helper-validator-identifier/src/keyword.js
  // when new keywords are added
  _break: createKeyword("break"),
  _case: createKeyword("case", { beforeExpr }),
  _catch: createKeyword("catch"),
  _continue: createKeyword("continue"),
  _debugger: createKeyword("debugger"),
  _default: createKeyword("default", { beforeExpr }),
  _do: createKeyword("do", { isLoop, beforeExpr }),
  _else: createKeyword("else", { beforeExpr }),
  _finally: createKeyword("finally"),
  _for: createKeyword("for", { isLoop }),
  _function: createKeyword("function", { startsExpr }),
 ...
};
Copy the code

The rule of word segmentation that we implemented through switch case actually implements a “finite state automaton (FSM)”. Finite-state automata have a finite number of states, each of which can be migrated to zero or more states. For example, if we need two states, zone 11 and ABC, then I can build the following FSM.

It has three states: state 0 is the initial state, state 1 is a string, and state 2 is a number. Enter a string and enter the initial state 0. The initial state is an unstable state, which is represented as a circle on the graph. Enter state 1 or state 2 by judging whether the initial character is a letter or a number. If subsequent numbers in state 2 are still numbers, they remain in state 2.

FSA is very good for lexical analysis, with regular expressions, we can achieve word segmentation more quickly. Looking back at the getTokenFromCode method, each case represents a state, and each state leads to a new state, eventually ending up in a stable state, where the ‘Token’ is obtained.

Syntax analysis

For example, we need to support expression statements. We need to define the rules of addition, subtraction, multiplication, division, and/or inequality. At the same time, we also need to take into account the priority and combination between operations. The syntax analysis project is a complete set of theoretical support and practical reference, and some open source tools such as Antrl can help us quickly implement these rules. JS syntax rules can be viewed on the ECMA website.

The result of grammar analysis is AST. This process is quite complicated and involves a series of concepts of compilation principles. I try to look at some problems of grammar analysis from a more pure programming perspective.

Take const a = 1. This is a variable declaration statement that starts with var or let or const, followed by an identifier, followed by an optional initialization part, which is an equal sign and an expression, followed by a semicolon.

The syntax is as follows: the lexical analysis obtains the const token and determines that this is a keyword. According to the type of the token, it determines that this is a variable declaration statement. Execute the parseVarStatement method.

packages/babel-parser/src/parser/statement.js

parseVarStatement(
    node: N.VariableDeclaration,
    kind: "var" | "let" | "const",
  ): N.VariableDeclaration {
    // Get the next token, which is a
    this.next();
    // Build declarations that resolve identifier A and its initial value
    this.parseVar(node, false, kind);
    // Use a semicolon; As the end
    this.semicolon();
    // this is a VariableDeclaration node, VariableDeclaration
    return this.finishNode(node, "VariableDeclaration");
}
Copy the code

Here is an AST node representing variable declarations (with some attributes removed for simplicity). Type is an attribute that every node has, with some special attributes depending on the node of type. In the VariableDeclaration node, the declarations are the details of the variable, which is an array indicating that we can declare multiple variables at once. The id is the name of the variable, and init is the value that the variable is initialized with. Our goal is to transform into such an AST.

{
  "type": "VariableDeclaration"."declarations": [{"type": "VariableDeclarator"."id": {
        "type": "Identifier"."name": "a"
      },
      "init": {
        "type": "Literal"."value": 1,}}],"kind": "const"
}
Copy the code

The parseVarStatement first analyzes subsequent tokens to construct the declarations node. This.next () obtains the next token, the identifier A, and the parseVar method parses specific function declarations.

Packages/Babel – parser/SRC/parser/statement, js (such as the purpose of simplifying be omitted part of the code)

parseVar(
    node: N.VariableDeclaration,
    isFor: boolean,
    kind: "var" | "let" | "const",
  ): N.VariableDeclaration {
    // Add declarations to node, which is an array by default
    const declarations = (node.declarations = []);
    
    // const
    node.kind = kind;
    for (;;) {
      // Construct a node
      const decl = this.startNode();

      // Construct the id of the node. 'a' will be treated as an Identifier.
      this.parseVarId(decl, kind);

      // Check if there is an equal sign =
      if (this.eat(tt.eq)) {

        // It has an equal sign, which means that the variable claims an initial value, which needs to be parsed as the node's init property
        decl.init = isFor
          ? this.parseMaybeAssignDisallowIn()
          : this.parseMaybeAssignAllowIn();
      }

      // Decl is a node with type VariableDeclarator, id attribute node, and possibly init attribute node
      declarations.push(this.finishNode(decl, "VariableDeclarator"));

      // Determine the comma between multiple statements.
      if (!this.eat(tt.comma)) break;
    }
    return node;
  }
Copy the code

Briefly translating the code above, parseVar establishes a declarations node that is also of type VariableDeclarator. Parse the first token as a variable name of type Identifier. If there is an equal sign, it means that the variable claims an initial value, which is parsed as the node’s init property. The entire code parses multiple variable declarations at once in the for loop.

The initial value of a variable can be not only a literal, but also an expression. We modify the previous code as follows.

const a = 1 + 2 * 3;Copy the code

Since the expression has precedence, the same precedence satisfies left associativity, and parentheses can change precedence, it is difficult to construct an appropriate expression model.

The following is the AST node generated by this expression. Type is BinaryExpression, which means binary operation. BinaryExpression has three attributes: left, operator, and right. Left and right are nested structures that can also contain BinaryExpression. This nesting design takes into account the priority of the operation, and the inner layer is always evaluated first.

"init": {
  "type": "BinaryExpression"."left": {
    "type": "Literal"."value": 1,},"operator": "+"."right": {
    "type": "BinaryExpression"."left": {
      "type": "Literal"."value": 2,},"operator": "*"."right": {
      "type": "Literal"."value": 3,}}}Copy the code

From this we can imagine that in an operational node, the outer nodes represent low-priority computation and the inner nodes represent high-priority computation. This is also the most basic idea of parsing expressions, defining the basic rules of operation, recursive calls according to the relationship between priorities, in the process of execution, the program can deduce the results according to these rules and recursive relations.

MDN has a complete list of operator priorities, and that’s what we use to write our expressions. According to the priority rule, we try to match the high-priority expression first, then the low-priority expression. To achieve this effect, the code is designed so that the computer will call first to generate lower-priority, outer node code, which will call the higher-priority code first, layer by layer, until it matches.

As you can see from the name of parseMaybeAssign, when Babel parses an expression, it will assume that it is an Assign expression because the Assign expression has a low priority and must be an outer node. After entering the method, try to match the conditional expression and determine if it is an AssignmentExpression. If so, reconstruct an AST node with type AssignmentExpression.

parseMaybeAssign( refExpressionErrors? :? ExpressionErrors, afterLeftParse? :Function, refNeedsArrowPos? :? Pos, ): N.Expression {// omit some of the judgments of yield
    // The yield priority is lower than the assignment, which can be determined directly because the syntax is very obvious and is located in the expression header

    // Try to match the conditional expression
    let left = this.parseMaybeConditional(
      refExpressionErrors,
      refNeedsArrowPos,
    );

    // is an assignment expression, such as a += 1
    if (this.state.type.isAssign) {
      const node = this.startNodeAt(startPos, startLoc);
      const operator = this.state.value;
      node.operator = operator;

      // left is anode.left = left; .// Parse the value to the right of the assignment symbol, that is, 1
      this.next();
      node.right = this.parseMaybeAssign();

      // type is AssignmentExpression, which indicates AssignmentExpression
      return this.finishNode(node, "AssignmentExpression");
    } 

    return left;
  }
Copy the code

Conditional operators have higher precedence than assignment, so in parseMaybeAssign, parseMaybeConditional was called, And the priority of logical operations is higher than that of conditional operators, so in parseMaybeConditional, parseExprOps was first called for matching.

parseMaybeConditional( refExpressionErrors: ExpressionErrors, refNeedsArrowPos? :? Pos, ): N.Expression {const startPos = this.state.start;
  const startLoc = this.state.startLoc;
  const potentialArrowAt = this.state.potentialArrowAt;
  // Try to match the logical operator
  const expr = this.parseExprOps(refExpressionErrors);

  // Exit expression parsing with arrow function
  if (this.shouldExitDescending(expr, potentialArrowAt)) {
    return expr;
  }

  // Determine the ternary condition (? The :) operator is equivalent to the if else variant
  return this.parseConditional(expr, startPos, startLoc, refNeedsArrowPos);
}
Copy the code

ParseConditional Judging ternary conditions (? If so, the preceding expression is a conditional statement, which is placed on the test attribute. Then the parseMaybeAssign method is recursively called to parse the following two expressions.

Simply put, it expresses such a rule:… ? … :…

parseConditional(
  expr: N.Expression,
  startPos: number,
  startLoc: Position,
  // FIXME: Disabling this for now since can't seem to get it to play nicely
  // eslint-disable-next-line no-unused-varsrefNeedsArrowPos? :? Pos, ): N.Expression {if (this.eat(tt.question)) {
    const node = this.startNodeAt(startPos, startLoc);
    node.test = expr;
    node.consequent = this.parseMaybeAssignAllowIn();
    this.expect(tt.colon);
    node.alternate = this.parseMaybeAssign();
    return this.finishNode(node, "ConditionalExpression");
  }
  return expr;
}
Copy the code

The idea for parseExprOps is the same, so I won’t do the detailed analysis.

From the perspective of the code, the syntax analysis is a series of rules matching with nested, the lower priority rule nested priority rules (in this way can we guarantee the high priority to execute), we found that these rules are written code is very “clean”, guarantee the AST structure is not affected by the invocation context, if know, compiling principle, will find that, This is actually the code implementation of the inline text independent grammar.

conclusion

@babel/ Parser lexical analysis + syntax analysis of the original code to generate AST. An AST is essentially a set of objects that represent the structure of a program, and we can modify the code indirectly by modifying this object. Modern front-end technologies such as SASS, Less, and ES6+ take advantage of this ability to achieve syntactic breakthroughs.
Lexical analysis is abstracted as a “finite state automata”. In a certain state, after some conditions are met, the state will be transferred to a new state. From the code level, it is a series of switch case statements. After lexical analysis, the code is cut exactly, and each shred word is called token.
The complexity of parsing lies in the nesting and matching of rules and the need to consider some associative problems. Rules with lower priorities are nested with rules with higher priorities to ensure that rules with higher priorities are preferentially executed. Syntax analysis needs to pay attention to many details, it is recommended to understand the context free grammar, LL algorithm and other basic before slowly analyzing.

The core of Babel compilation is to manipulate AST objects. How does the Babel plug-in manipulate AST

If you think you’ve learned something, please give it a thumbs up!

Javascript AST specification

Lexical analysis

Syntax analysis

conclusion

Related Posts

XML language data reading and writing comprehension 1

CLS for page visual stability

Electron Builder Package For MacOS Application (Package & Signature & Notarization & Shelves)