Talk

Because of recent work needs to use parsing JavaScript code, most cases using regular expression matching can be handled, but once depends on the content of the code context, regex or simple character parsing is not enough. A language parser is needed to retrieve the entire AST (Abstract Syntax Tree).

Then I found several JavaScript parsers written in JavaScript:

  • Esprima
  • Acorn
  • UglifyJS 2
  • Shift

According to the submitted records, the maintenance is all good, and all the features of ES can be kept up with. I have a brief understanding of them and have a chat about them.

Esprima is a classic parser, and Acorn came after it a few years ago. According to the Acorn writers, the wheel was built more for fun, with speed comparable to Esprima, but with less implementation code. The key points are the AST results from the two parsers (yes, just the AST, Tokens are in compliance with The Estree Spec. This is The JavaScript AST output by The SpiderMonkey engine from Mozilla engineers. SpiderMonkey in MDN), that is, the resulting result is largely compatible.

The now famous Webpack also uses Acorn to parse code.

As for Uglify, a well-known JavaScript code compressor, it actually comes with a code parser, which can also output AST, but its function is more used to compress code, if used to parse code feels too pure.

Shift doesn’t know much about this, except that he defines his own set of AST specifications.

Esprima has a performance test on its website, and I ran it on Chrome with the following results:

Acorn performs well, and there is an Estree specification (specifications are important, I personally feel that following common specifications is an important foundation for code reuse), so I chose Acorn directly for code parsing.

Google’s Traceur, which is more of an ES6 to ES5 compiler, is not suitable for the parser we are looking for.

Let’s get down to the business of using Acorn to parse JavaScript.

API

The parser API is very simple:

const ast = acorn.parse(code, options)Copy the code

The Acorn configuration is quite extensive, including some events to set callback functions. Let’s pick out some of the more important ones:

  • EcmaVersion literally means to set the ECMA version of the JavaScript you want to parse. The default is ES7.
  • sourceType

    This configuration item has two values:modulescriptThe default isscript.

    Mainly the strict mode andimport/exportThe difference between. Modules in ES6 are in strict mode, which means you don’t have to add themuse strict. The script we usually use in browsers is notimport/exportThe grammar.

    So, choosescriptIs thereimport/exportError will be reported, you can use strict schema declaration, selectmodule, do not strictly declare the schema, can be usedimport/exportSyntax.
  • locations

    The default value isfalse, is set totrueIt then carries one more in the AST’s nodeslocObject to represent the current number of starting and ending rows and columns.
  • onComment

    Pass in a callback function that is triggered whenever a comment is parsed into the code to retrieve the contents of the comment. The argument list is:[block, text, start, end].

    blockIndicates whether it is a block comment,textIs the comment content,startendIs where comments start and end.

The above mentioned Espree requires the attachComment configuration item of Esprima. After setting this parameter to true, Esprima carries comment information (trailingComments and leadingComments) in the nodes where the code parses the result. Espree uses Acorn’s onComment configuration to implement compatibility with this Esprima feature.

Parsers usually also have an interface to get lexical analysis results:

const tokens = [...acorn.tokenizer(code, options)]Copy the code

The second parameter to the Tokenizer method can also configure Locations.

The result data structure of Esprima is different from that of token. The result data structure of Esprima is different. .

The AST and token content that Acorn parses will be discussed in detail next.

Token

I searched for a long time, but could not find a detailed description of the token data structure, so I had to take a look myself.

The code I used to test parsing is:

import "hello.js"

var a = 2;

// test
function name() { console.log(arguments); }Copy the code

The parsed token array is an array of objects like this:

Token {
    type:
     TokenType {
       label: 'import'.keyword: 'import'.beforeExpr: false.startsExpr: false.isLoop: false.isAssign: false.prefix: false.postfix: false.binop: null.updateContext: null },
    value: 'import'.start: 5.end: 11 },Copy the code

In the object corresponding to type, label represents a type of the current identity, and keyword is a keyword, such as import or function in this example.

Value is the value of the current identifier. Start and end are the start and end positions, respectively.

Usually we need to focus on label/keyword/value. For more details, please refer to the source code: tokentype.js.

The Estree Spec

This section is the highlight, because all I really need is the parsed AST. The most original content from: The Estree Spec, I just read after The porter.

The benefit of providing standard documentation is that a lot of things can be tracked, and there is also a tool for converting Estree compliant AST into ESMAScript code: EsCodeGen.

Okay, back to the topic, let’s take a look at the ES5 section. You can test the parsing results of various code on the Esprima: Parser page.

Parsed AST nodes that conform to this specification are identified by Node objects that conform to interfaces like this:

interface Node {
    type: string;
    loc: SourceLocation | null;
}Copy the code

The type field represents different node types, and we’ll talk more about each type and what syntax they correspond to in JavaScript below. The LOC field represents the location information of the source code, null if there is no relevant information, otherwise an object containing the start and end positions. The interfaces are as follows:

interface SourceLocation {
    source: string | null;
    start: Position;
    end: Position;
}Copy the code

The Position object contains row and column information, starting with row 1 and column 0:

interface Position {
    line: number; / / > = 1
    column: number; / / > = 0
}Copy the code

Okay, that’s the basics, so let’s take a look at the various types of nodes and brush up on some JavaScript syntax. Each of these sections will be discussed briefly, but not expanded upon (there’s a lot of content), and will be easy for anyone who knows JavaScript to understand.

I feel like I’ve gone through the basics of JavaScript syntax.

Identifier

Identifiers, I think that’s what it’s called, are the names that we define when we write JS, the names of variables, the names of functions, the names of properties, are all called identifiers. The corresponding interface looks like this:

interface Identifier <: Expression, Pattern {
    type: "Identifier";
    name: string;
}Copy the code

An identifier may be an expression or a deconstruction pattern (deconstruction syntax in ES6). We will see Expression and Pattern later.

Literal

Literals, not [] or {}, but literals that semantically represent a value, such as 1, “hello”, true, and regular expressions (with an extended Node to represent regular expressions) such as /\d? /. Let’s look at the definition of the document:

interface Literal <: Expression {
    type: "Literal";
    value: string | boolean | null | number | RegExp;
}Copy the code

Value corresponds to the literal value. We can see the literal value type, string, Boolean, numeric, NULL, and re.

RegExpLiteral

This for regular literal, in order to better to parse the regular expression content, add one more regex fields, which will include regular itself, as well as the regular flags.

interface RegExpLiteral <: Literal {
  regex: {
    pattern: string;
    flags: string;
  };
}Copy the code

Programs

This is usually used as a trailing node, which represents a complete program code tree.

interface Program <: Node {
    type: "Program";
    body: [ Statement ];
}Copy the code

The body property is an array containing multiple Statement nodes.

Functions

Function declaration or function expression node.

interface Function <: Node {
    id: Identifier | null;
    params: [ Pattern ];
    body: BlockStatement;
}Copy the code

The id is the function name, and the params property is an array representing the parameters of the function. Body is a block statement.

It’s worth noting that you won’t find the type: “Function” node during testing, but you will find the type: “FunctionDeclaration” and the type: “FunctionExpression”, because functions appear either as declarations or function expressions, which are combination types of node types. FunctionDeclaration and FunctionExpression will be mentioned later.

This gives the impression that the document is well laid out, with function names, arguments, and function blocks being part of the function, while declarations or expressions have their own needs.

Statement

A statement node is nothing special; it is just a node, a distinction, but there are many kinds of statements, which are described below.

interface Statement <: Node { }Copy the code

ExpressionStatement

Expression statement nodes, where a = a+ 1 or a++ have an expression property that refers to an expression node object (we’ll talk about expressions later).

interface ExpressionStatement <: Statement {
    type: "ExpressionStatement";
    expression: Expression;
}Copy the code

BlockStatement

Block statement nodes, for example: if (…) {// Here is the contents of a block}, a block can contain multiple other statements, so there is a body attribute, which is an array representing multiple statements in the block.

interface BlockStatement <: Statement {
    type: "BlockStatement";
    body: [ Statement ];
}Copy the code

EmptyStatement

An empty statement node that does not execute any useful code, such as a separate semicolon;

interface EmptyStatement <: Statement {
    type: "EmptyStatement";
}Copy the code

DebuggerStatement

Debugger, that’s what it means. Nothing else.

interface DebuggerStatement <: Statement {
    type: "DebuggerStatement";
}Copy the code

WithStatement

The with statement node contains two special attributes. Object represents the object (which can be an expression) to be used with, and body represents the statement to be executed after with, which is usually a block statement.

interface WithStatement <: Statement {
    type: "WithStatement";
    object: Expression;
    body: Statement;
}Copy the code

Here is the control flow statement:

ReturnStatement

Returns the statement node. The argument property is an expression that represents what is returned.

interface ReturnStatement <: Statement {
    type: "ReturnStatement";
    argument: Expression | null;
}Copy the code

LabeledStatement

The label statement, for example:

loop: for(let i = 0; i < len; i++) {
    // ...
    for (let j = 0; j < min; j++) {
        // ...
        breakloop; }}Copy the code

We can use a break loop in the loop nesting to specify which loop to break. So the label statement refers to loop:… This one.

A label statement node has two attributes, a label attribute indicating the name of the label and a body attribute pointing to the corresponding statement, usually a loop or switch statement.

interface LabeledStatement <: Statement {
    type: "LabeledStatement";
    label: Identifier;
    body: Statement;
}Copy the code

BreakStatement

The break statement node has a label attribute indicating the desired label name, or null when no label is needed (which is usually not required).

interface BreakStatement <: Statement {
    type: "BreakStatement";
    label: Identifier | null;
}Copy the code

ContinueStatement

A continue statement node, similar to a break.

interface ContinueStatement <: Statement {
    type: "ContinueStatement";
    label: Identifier | null;
}Copy the code

Here are the conditional statements:

IfStatement

If statement nodes, typically, have three attributes, the test attribute representing if (…). Expressions in parentheses.

Possession property is an execution statement that represents a condition true, which is usually a block statement.

The alternate property is used to represent an else statement node, usually a block statement, but also an if statement node, such as if (a) {//… } else if (b) { // … }. Alternate can of course be null.

interface IfStatement <: Statement {
    type: "IfStatement";
    test: Expression;
    consequent: Statement;
    alternate: Statement | null;
}Copy the code

SwitchStatement

A Switch statement node has two attributes. The discriminant attribute indicates the discriminant expression immediately following a switch statement, which is usually a variable. The Cases attribute is an array of case nodes, which represents each case statement.

interface SwitchStatement <: Statement {
    type: "SwitchStatement";
    discriminant: Expression;
    cases: [ SwitchCase ];
}Copy the code
SwitchCase

Case node of the switch. The test attribute represents the judgment expression for the case, and aggressively is the execution statement for the case.

When the test property is null, it represents the default case node.

interface SwitchCase <: Node {
    type: "SwitchCase";
    test: Expression | null;
    consequent: [ Statement ];
}Copy the code

Here are the exception related statements:

ThrowStatement

The argument property is used to indicate the expression immediately following the throw.

interface ThrowStatement <: Statement {
    type: "ThrowStatement";
    argument: Expression;
}Copy the code

TryStatement

A try statement node whose block property represents the execution statement of a try, usually a block statement.

The Hanlder attribute refers to the catch node. Finalizer refers to the finally statement node. If the Hanlder is null, Finalizer must be a block statement node.

interface TryStatement <: Statement {
    type: "TryStatement";
    block: BlockStatement;
    handler: CatchClause | null;
    finalizer: BlockStatement | null;
}Copy the code
CatchClause

The catch node, param, represents the argument after the catch, and body represents the execution statement after the catch, usually a block statement.

interface CatchClause <: Node {
    type: "CatchClause";
    param: Pattern;
    body: BlockStatement;
}Copy the code

Here are the loop statements:

WhileStatement

The while statement node, where test represents the expression in parentheses, and body represents the statement to loop through.

interface WhileStatement <: Statement {
    type: "WhileStatement";
    test: Expression;
    body: Statement;
}Copy the code

DoWhileStatement

Do /while statement node, similar to the while statement.

interface DoWhileStatement <: Statement {
    type: "DoWhileStatement";
    body: Statement;
    test: Expression;
}Copy the code

ForStatement

The for loop node, init/test/update, represents the three expressions in the parentheses of the for statement, the initialization value, the loop judgment condition, and the variable update statement (init can be a variable declaration or expression) executed each time the loop executes. All three attributes can be null, for(;;) {}. The body attribute is used to indicate the statement to loop through.

interface ForStatement <: Statement {
    type: "ForStatement";
    init: VariableDeclaration | Expression | null;
    test: Expression | null;
    update: Expression | null;
    body: Statement;
}Copy the code

ForInStatement

For /in statement nodes, with the left and right attributes representing statements around the IN keyword (the left side can be a variable declaration or expression). The body is still the statement to loop through.

interface ForInStatement <: Statement {
    type: "ForInStatement";
    left: VariableDeclaration |  Pattern;
    right: Expression;
    body: Statement;
}Copy the code

Declarations

Declaration statement nodes, which are also statements, are just refinements of a type. The various declaration statement types are described below.

interface Declaration <: Statement { }Copy the code

FunctionDeclaration

Function declarations, unlike Function declarations above, cannot have id null.

interface FunctionDeclaration <: Function, Declaration {
    type: "FunctionDeclaration";
    id: Identifier;
}Copy the code

VariableDeclaration

Variable declarations. The kind attribute indicates what type of declaration it is, since ES6 introduced const/let. Declarations represent multiple descriptions of declarations, since we can do this: let a = 1, b = 2; .

interface VariableDeclaration <: Declaration {
    type: "VariableDeclaration";
    declarations: [ VariableDeclarator ];
    kind: "var";
}Copy the code
VariableDeclarator

Description of a variable declaration, where id represents the variable name node and init represents an expression for the initial value, which can be null.

interface VariableDeclarator <: Node {
    type: "VariableDeclarator";
    id: Pattern;
    init: Expression | null;
}Copy the code

Expressions

Expression node.

interface Expression <: Node { }Copy the code

ThisExpression

According to this.

interface ThisExpression <: Expression {
    type: "ThisExpression";
}Copy the code

ArrayExpression

The elements property is an array representing multiple elements of the array, each of which is an expression node.

interface ArrayExpression <: Expression {
    type: "ArrayExpression";
    elements: [ Expression | null ];
}Copy the code

ObjectExpression

Object expression node. The property property is an array representing each key-value pair of the object. Each element is an attribute node.

interface ObjectExpression <: Expression {
    type: "ObjectExpression";
    properties: [ Property ];
}Copy the code
Property

Property node in an object expression. Key represents a key, value represents a value, and since ES5 syntax has get/set, there is a kind attribute that indicates a normal initialization, or get/set.

interface Property <: Node {
    type: "Property";
    key: Literal | Identifier;
    value: Expression;
    kind: "init" | "get" | "set";
}Copy the code

FunctionExpression

Function expression node.

interface FunctionExpression <: Function, Expression {
    type: "FunctionExpression";
}Copy the code

Here is the unary operator related expression section:

UnaryExpression

Unary expression nodes (++/– is the update operator, not in this category), operator represents the operator, and prefix indicates whether or not it is a prefix operator. Argument is the expression to perform the operation.

interface UnaryExpression <: Expression {
    type: "UnaryExpression";
    operator: UnaryOperator;
    prefix: boolean;
    argument: Expression;
}Copy the code
UnaryOperator

Unary operator, enumeration type, all values as follows:

enum UnaryOperator {
    "-" | "+" | "!" | "~" | "typeof" | "void" | "delete"
}Copy the code

UpdateExpression

The update expression node, ++/–, is similar to the unary operator, except that the type of node object that operator points to is the update operator.

interface UpdateExpression <: Expression {
    type: "UpdateExpression";
    operator: UpdateOperator;
    argument: Expression;
    prefix: boolean;
}Copy the code
UpdateOperator

The update operator, with a value of ++ or –, is used with the prefix attribute of the UPDATE expression node to indicate before and after.

enum UpdateOperator {
    "+ +" | "--"
}Copy the code

Here is the part of the expression associated with binary operators:

BinaryExpression

Binary operation expression node, left and right represent two expressions left and right of the operator, and operator represents a binary operator.

interface BinaryExpression <: Expression {
    type: "BinaryExpression";
    operator: BinaryOperator;
    left: Expression;
    right: Expression;
}Copy the code
BinaryOperator

Binary operator, all values are as follows:

enum BinaryOperator {
    "= =" | ! "" =" | "= = =" | ! "" = ="
         | "<" | "< =" | ">" | "> ="
         | "< <" | "> >" | "> > >"
         | "+" | "-" | "*" | "/" | "%"
         | "|" | "^" | "&" | "in"
         | "instanceof"
}Copy the code

AssignmentExpression

Assignment expression node, the operator property represents an assignment operator, left and right are expressions around the assignment operator.

interface AssignmentExpression <: Expression {
    type: "AssignmentExpression";
    operator: AssignmentOperator;
    left: Pattern | Expression;
    right: Expression;
}Copy the code
AssignmentOperator

Assignment operator, all values as follows :(not many commonly used)

enum AssignmentOperator {
    "=" | "+ =" | "- =" | "* =" | "/ =" | "% ="
        | "< < =" | "> > =" | "> > > ="
        | "| =" | "^ =" | "& ="
}Copy the code

LogicalExpression

A logical operation expression node, and an assignment or binary operation type, except that operator is a logical operator type.

interface LogicalExpression <: Expression {
    type: "LogicalExpression";
    operator: LogicalOperator;
    left: Expression;
    right: Expression;
}Copy the code
LogicalOperator

Logical operator, two values, namely and or.

enum LogicalOperator {
    "| |" | "&"
}Copy the code

MemberExpression

A member expression node is a statement that refers to an object member, object is an expression node that refers to an object, property is an attribute name, computed, if false, means. The property should be an Identifier node, or [] if the computed property is true, that is, the property is an Expression node whose name is the resulting value of the Expression.

interface MemberExpression <: Expression, Pattern {
    type: "MemberExpression";
    object: Expression;
    property: Expression;
    computed: boolean;
}Copy the code

Here are some other expressions:

ConditionalExpression

Conditional expressions, often called ternary operands, Boolean? True, false. Attribute reference condition statement.

interface ConditionalExpression <: Expression {
    type: "ConditionalExpression";
    test: Expression;
    alternate: Expression;
    consequent: Expression;
}Copy the code

CallExpression

Function call expressions that represent statements of type func(1, 2). Arguments is an array, and the element is an expression node, representing the function argument list.

interface CallExpression <: Expression {
    type: "CallExpression";
    callee: Expression;
    arguments: [ Expression ];
}Copy the code

NewExpression

New expression.

interface NewExpression <: CallExpression {
    type: "NewExpression";
}Copy the code

SequenceExpression

This is the expression (the exact name is unknown) constructed by the comma operator, and the expressions attribute is an array representing the multiple expressions that make up the entire expression, separated by commas.

interface SequenceExpression <: Expression {
    type: "SequenceExpression";
    expressions: [ Expression ];
}Copy the code

Patterns

Patterns, which are primarily meaningful in ES6 deconstruction assignments, can be understood in ES5 as something similar to identifiers.

interface Pattern <: Node { }Copy the code

There’s a lot of stuff in this section, but I’m going to go over JavaScript syntax one more time as I write this. This document and ES2015 ES2016, ES2017 related content, involving things also pretty is much, but to understand The above this some, and then from The aspects of grammar to think about this document, other content is well understood, out here, and there is need to see: The Estree Spec.

Plugins

Returning to our protagonist, Acorn, provides an extended way to write related plug-ins: Acorn Plugins.

We can use plug-ins to extend the parser to parse more syntax, such as the.jsx syntax. If you are interested, check out this plugin: Acorn-jsx.

The Acorn plugin is intended to be used to extend the parser, but requires a good understanding of the inner workings of Acorn. The extension method will redefine some of the methods on the original basis. I won’t go into details here, but I’ll write about it again if I need plug-ins.

Examples

Now let’s look at how to use this parser. For example, if we need to resolve which modules a CommonJS compliant module depends on, we can use Acorn to resolve the call to require, and then extract the parameters passed in to obtain the dependent modules.

Here is the sample code:

// A function that iterates over all nodes
function walkNode(node, callback) {
  callback(node)

  // The type field is considered a node
  Object.keys(node).forEach((key) = > {
    const item = node[key]
    if (Array.isArray(item)) {
      item.forEach((sub) = > {
        sub.type && walkNode(sub, callback)
      })
    }

    item && item.type && walkNode(item, callback)
  })
}

function parseDependencies(str) {
  const ast = acorn.parse(str, { ranges: true })
  const resource = [] // Dependency list

  // Start from the root node
  walkNode(ast, (node) => {
    const callee = node.callee
    const args = node.arguments

    // require we consider a function call, called require, with a single argument that must be literal
    if (
      node.type === 'CallExpression' &&
      callee.type === 'Identifier' &&
      callee.name === 'require' &&
      args.length === 1 &&
      args[0].type === 'Literal'
    ) {
      const args = node.arguments

      // Get information about dependencies
      resource.push({
        string: str.substring(node.range[0], node.range[1]),
        path: args[0].value,
        start: node.range[0].end: node.range[1]})}})return resource
}Copy the code

This is just a simple case, but it shows us how to use the parser, and Webpack does a lot more on top of it, including var r = require; R (‘a’) or require.async(‘a’) etc.

AST is something we enjoy all the time on the front end (module building, code compression, code obfuscation), so it’s good to know something about it.

Welcome to discuss any questions.