From 0 to 1, build the abstract code tree

1. Introduction

By referring to the static analysis process of JavaScript code with Babel toolchain, the author implements a set of simple abstract syntax tree construction process. Because there are few Chinese documents related to abstract syntax tree, this article is recorded in this paper.

2. Keywords: abstract syntax tree

Abstract Syntax Tree (AST) is an Abstract representation of the Syntax structure of source code. It represents the syntactic structure of a programming language in the form of a tree, with each node in the tree representing a structure in the source code. The grammar is “abstract” because it does not represent every detail that occurs in real grammar. For example, nested parentheses are implicit in the structure of the tree and are not presented as nodes. That’s all you can get from a search engine. Here’s a look at the less documented AST generation process and a specific way to use an AST.

3.AST generation process

3.1 Lexical analysis: Generate token object array.

Lexical analysis reads the code in the form of a string, parses the code according to the rules, and generates an array of tokens (token stream) composed of individual tokens. At the same time, it removes whitespace, comments, and so on. The common steps to disassemble a token object in code are as follows:

1. Determine the token type, such as number, string, keyword, and variable 2. Determine the token matching method: the tokenizer function. When reading the code, the function iterates according to the increasing subscript of the characters in the code string. The tokenizer function is recursively executed. According to the token type, the function matches the characters in the corresponding regular expression, strong equal, and other ways. 3. Generate a token object: Attributes of a token object often include the token type and code content. According to actual needs, the token object can also carry auxiliary information such as its coordinates in the editor.Copy the code

A custom syntax code “if A 5 then B 8” is converted to A tokens stream.

let tokens = [
  {
    token: "if".type: "Identifier"}, {token: "A".type: "identifier"}, {token: "5".type: "number"}, {token: "then".type: "identifier"}, {token: "B".type: "identifier"}, {token: "8".type: "number",},];Copy the code

3.2 Syntax Analysis: Generate an AST

Syntax analysis will parse each token object according to a certain form, forming a tree structure, each layer of the tree structure is called node, nodes work together on the static analysis of the program code, and verify the syntax, throw syntax error information.

3.2.1 Node Structure

The semantics themselves represent nodes of a value that are literal nodes and act as “leaves” in the tree structure. Since each resolved token carries a type attribute, we can easily match the corresponding literal node with the type attribute, such as: A token whose type attribute is “number” represents a value of type number semantically. As a leaf, it has no other child nodes to contain in the tree structure, so it can generate a literal node and set the type attribute to “NumberLiteral”. The literal of the Number type. The same goes for strings, Boolean values, regular expressions, and so on. For details on node names, see the AST object document.

The generation function for the Number literal node is as follows.

if (curToken.type === 'number') {
    // A number or string, indicating that this is a constant expression
    const literal = {
      type: 'NumberLiteral'.value: eval(curToken.value),
    };
    // Point current to the next token object
    nextToken(); 
    return literal;
 }
Copy the code

The AST uses “branch” nodes to combine multiple nodes into larger grammatical branches. They combine single or multiple leaf nodes or branch nodes into complex program syntax. If statement nodes, block domain nodes, function expression nodes, etc., can all be considered branches of the syntax tree. Here is an example. The if statement node usually has three properties: test, possession, and alternate. 1. The test attribute represents a conditional expression node. 2. Possession indicates the execution statement if the condition is true; possession is usually a block domain node. 3. The alternate property represents the statement node that follows the else, usually a block domain node or null, or a new if statement node, in which case the syntactic structure looks like: if(a){… }else if(b){ … }

The generation function of the if statement tree node is as follows:

 if (curToken.type === 'identifier' && curToken.value === 'if') {
   // Parse the if statement
   const statement = {
     type: 'IfStatement'};// if must be followed by (
   nextToken();
   if(curToken.type ! = ='parens'|| curToken.value ! = ='(') {
     throw new Error('Expected ( after if');
 
   // The following expression is the condition of if
   statement.test = nextExpre
   // Must be ()
   nextToken();
   if(curToken.type ! = ='parens'|| curToken.value ! = =') ') {
     throw new Error('Expected ) after if test expression');
 
   // The next statement is executed when if is true
   statement.consequent = nextStat
   // If the next symbol is else, there is logic where if fails
   if (curToken === 'identifier' && curToken.value === 'else') {
     statement.alternative = nextStatement();
   } else {
     statement.alternative = null;
   }
   return statement;
 }
Copy the code
3.2.2 Simple Example

After we deconstruct and define the internal structure of more branch nodes as described in section 3.2.1, we can construct the AST according to the structural limitations of these nodes. According to the internal structure of the if statement, the tree structure is defined for the corresponding node type of each token. According to the structural restrictions of these nodes, the recursive AST constructor is designed to realize the recursive execution process of the function. Executing such a function gradually generates the branch structure until the complete tree structure is completed. An example of AST generation for a simple javaScript code is as follows:

1. The javaScript code:

                            if ( 1 > 0 ) { alert( 'aa' ) }
Copy the code

2. Generate tokens stream

[{type: "identifier".value: "if"},
  {type: "parens".value: "("},
  {type: "number".value: "1"},
  {type: "operator".value: ">"},
  {type: "number".value: "0"},
  {type: "parens".value: ")"},
  {type: "brace".value: "{"},
  {type: "identifier".value: "alert"},
  {type: "parens".value: "("},
  {type: "string".value: "'aa'"},
  {type: "parens".value: ")"},
  {type: "brace".value: "}"},]Copy the code

3. Generate AST

{
  "type": "Program"."body": [{"type": "IfStatement"."test": {
        "type": "BinaryExpression"."left": {
          "type": "Literal"."value": 1
        },
        "right": {
          "type": "Literal"."value": 0}},"consequent": {
        "type": "BlockStatement"."body": [{"type": "ExpressionStatement"."expression": {
              "type": "CallExpression"."caller": {
                "type": "Identifier"."value": "alert"
              },
              "arguments": [{"type": "Literal"."value": "aa"}]}}]},"alternative": null}}]Copy the code

4. An application of AST: Babel toolchain

The ROLE of the AST in the Babel toolchain is to convert ECMAScript 2015+ code into backward compatible JavaScript syntax, which goes through the process of generating the AST and then converting the AST to generate object code.

1. Design a visitor. The visitor is an object, each key of the visitor is the type name of each node of the AST, and the value of the object is a function that needs to describe the transformation process of the AST. 2. Design functions to traverse all nodes of a tree, transform nodes, and generate a new AST after transformation. Traversal to branch nodes generally recursively traversal the child nodes, leaf nodes directly perform its conversion function. 3. Generate object code. According to the new AST and the syntax specification of the target language, the code generation method corresponding to each node is designed, layers of nodes are traversed, and new syntax is given to the original code.Copy the code

5. Conclusion

In this paper, the generation process of AST and an application of Babel to AST are introduced. In the generation process section, two steps of lexical parsing and syntax parsing in AST generation process are described. In the AST application, the Babel toolchain to Javascript language syntax conversion process is introduced. In addition to the way Babel is used, there are many things you can do with aN AST, such as code execution, syntax alerting, custom packaging solutions, and so on. The AST often provides important capabilities when faced with such requirements, and I hope this article will be helpful at that point.