Author: Chen Xiaoqiang

A basis,

Why do we need to know about abstract syntax trees

Daily work, we’ll meet js code parsing scenes, such as the analysis of the require in the code which packages, what are some key API calls, mostly use regular expressions to process, but a complex scene, or code depends on the context, the regular is difficult to handle, then going to use the abstract syntax tree. Common uglify, ESLint, Babel, Webpack, and so on are all handled based on abstract syntax trees and are so powerful that it’s worth getting to know them.

What is an abstract syntax tree

Abstract Syntax Tree is the Abstract Syntax Tree. AST for short, see the following figure.

  1. The code in the figure is first parsed into a tree data structure
  2. Then the nodes in the tree are transformed, and the leaf nodes are switched in the figure
  3. Regenerate the tree structure into code by generate

The tree data structure in the figure is the AST, and you can see how the code can be changed by manipulating nodes after being converted to the AST.

How do I get an abstract syntax tree

The process of obtaining an abstract syntax tree is as follows: Code => lexical => AST Lexical: Convert string code into a stream of tokens. Parsing: Converts a token stream into an AST. This phase uses the information in the tokens to transform them into an AST representation structure that makes subsequent operations easier. As shown below, the code is a simple function declaration. In the lexical analysis stage, the code is input as a string to obtain keywords. Function, square, (,), {,} and so on are identified as keywords in the figure (recall the compilation principle for a moment, characters are pushed one by one, if they meet certain rules, they are pushed out of the stack). In the stage of grammar analysis, keywords are combined to form nodes. For example, three key phrases, n* N, are combined to form binary expressions. Keyword RETURN and binary expressions are combined to form return statements. Finally, it is combined into a function declaration statement.

Second, the specification

How to get an AST has been described briefly. What data structure should an AST end up in

The industry already has a mature set of specifications for what the analysis is based on and why the structure of the above figure appears.

Origin of specification

Before v8, the earliest JS engine was SpiderMonkey. The first version was designed by JS author Brendan Eich and later handed over to the Mozilla organization for maintenance. When a JS engine executes a JS file, it first converts the JS code into an abstract syntax tree (AST). One day, a Mozilla engineer exposed the Parser_API for converting code to AST in FireFox, known as Parser_API[1], which was later incorporated into github project Estree [2] and has since become the industry standard.

Normative interpretation

Parser_API mentioned above is the original specification, Chinese version :Parser_API[3], but it is not very friendly to read. It is recommended to read the edited Git project Estree directly and open the project address, as shown below



es5.md
es2015.md
es2019.md

If you open the basic es5.md, you can see all the syntax basics. Here we read the categories with you. When reading specifications, you can use https://astexplorer.net/ to assist reading and output AST in real time.

  • Node objects
  • Programs
  • Identifier
  • Literal
  • Functions
  • Statements
  • Declarations
  • Expressions
  • Patterns

Node objects

interface Node {
    type: string;
    loc: SourceLocation | null;
}
Copy the code

Define the basic node types in the AST. All other specific nodes need to implement the above interfaces. That is, each node must contain two fields, type and LOC

The type field represents different node types, and we’ll talk more about each type and what syntax they correspond to in JavaScript below. You can see from this field which interface the node implements. The LOC field represents the source location information, null if there is no relevant information, otherwise it is an object containing the start and end positions. The interface is as follows

interface SourceLocation {
    source: string | null;
    start: Position;
    end: Position;
}
Copy the code

Each Position object contains row (starting at 1) and column (starting at 0) information, with the following interface

interface Position {
    line: number; / / > = 1
    column: number; / / > = 0
}
Copy the code

Programs

interface Program <: Node {
    type: "Program";
    body: [ Directive | Statement ];
}
Copy the code

A complete program code tree, usually as the root node

Identifier

interface Identifier <: Expression, Pattern {
    type: "Identifier";
    name: string;
}
Copy the code

Identifiers are names that we define when writing code, such as variable names, function names, and attribute names.

Literal

interface Literal <: Expression {
    type: "Literal";
    value: string | boolean | null | number | RegExp;
}
Copy the code

Literals such as “hello”, true, null, 100, /\d/, etc. Note that literals themselves are also expressions.

Functions

interface Function <: Node {
    id: Identifier | null;
    params: [ Pattern ];
    body: FunctionBody;
}
Copy the code

A function declaration or expression, where id is the function name, params is an array of identifiers, and body is the function body, which is also a statement block.

Statements

interface Statement <: Node { }
Copy the code

Statements. There are many subclasses, block statements, if/switch statements, return statements, for/while statements, with statements, and so on

Declarations

interface Declaration <: Statement { }
Copy the code

Declaration, the main subclasses are variable declaration, function declaration.

Expressions

interface Expression <: Node { }
Copy the code

Function (var arr = []), object (var obj = {}), assignment (a = 1), etc

Patterns

interface Pattern <: Node { }
Copy the code

Patterns, which are primarily meaningful in ES6 deconstruction assignments (let {name} = user, where {name} is ObjectPattern), can be understood in ES5 as something similar to identifiers.

Three, the present situation

From the above specification interpretation, we know the structure of the AST to be generated in the end. For javascript parsing, there are many mature parsers in the industry, which can convert JS code into an AST that complies with the specification

  • Esprima, the classic, comes earlier
  • Acorn, fork from Esprima, more streamlined code. Webpack uses Acorn for module resolution
  • UglifyJS2, mainly used for code compression
  • Babylon, Babel parser, fork from Acorn, the latest version is Babylon7, corresponding to NPM package @babel/ Parser
  • Espree, the default parser for ESLint, which follows the same set of specifications, can also be replaced with Babel’s parser
  • Flow, shift, etc

The basics of AST are covered and will be continued from a practical perspective in the next part

References [1] Parser_API [2] Estree [3] Parser_API(英 文)


If you think this article is valuable to you, please like it and follow our official website and WecTeam: