Crawler stopped by JS? 30K anti - climbing god summed up a AST syntax details! Suggest collection!

First of all, why do we need to know AST

As we all know, JS is the most difficult in anti-crawler. In order to crack JS reverse, it is necessary to master AST grammar. At present, some front-end plug-ins or tools commonly used, such as javascript translation, code compression, CSS preprocessor, ELint, Pretiier and other functions are implemented on the basis of AST.

Javascript compiler execution process

The first step of JS execution is to read the character stream in the JS file, then generate the token through lexical analysis, then generate the AST (Abstract Syntax Tree) through syntactic analysis, and finally generate the machine code execution.

Syntax analysis

Syntax analysis is also called scanner. In simple terms, it calls the next () method to read characters one by one, and then compares them with the defined Javascript key characters to generate the corresponding Token. The smallest indivisible unit, such as the three characters var, can only be considered as a whole and can no longer be semantically decomposed, so it is a Token. In lexical analysis, each keyword is a Token, each identifier is a Token, each operator is a Token, and each punctuation mark is a Token. In addition, comments and whitespace characters (newlines, Spaces, tabs, and so on) in the source program are filtered out. Eventually, the entire code is split into a list of tokens (or a one-dimensional array).

n * n;

[
  { type: { ... }, value: "n",  loc: { ... } },
  { type: { ... }, value: "*",  loc: { ... } },
  { type: { ... }, value: "n",  loc: { ... } },
  ...
]
Copy the code

Each type has a set of attributes that describe the token:

{
  type: {
    label: 'name',
    keyword: undefined,
    beforeExpr: false,
    startsExpr: true,
    rightAssociative: false,
    isLoop: false,
    isAssign: false,
    prefix: false,
    postfix: false,
    binop: null,
    updateContext: null
  },
  ...
}
Copy the code

Syntax analysis

The syntax analysis will transform the tokens from lexical analysis into an abstract syntax tree structure with grammatical meaning. Also, validate the syntax and throw syntax errors if there are any.

What is an AST (Abstract Syntax Tree)

Abstract Syntax Tree (AST), or Syntax Tree for short, is an Abstract representation of the syntactic structure of source code. It represents the syntactic structure of a programming language as a tree, with each node in the tree representing a structure in the source code.

function square(n) {
  return n * n;
}
Copy the code

This code forms a tree:

FunctionDeclaration:
   id:
     Identifier:
       name: square
   params [1]
     Identifier
       name: n
   body:
     BlockStatement
       body [1]
         ReturnStatement
           argument
             BinaryExpression
               operator: *
               left
                 Identifier
                   name: n
               right
                 Identifier
                   name: n
Copy the code

Or a JavaScript Object like this:

{
  type: "FunctionDeclaration",
  id: {
    type: "Identifier",
    name: "square"
  },
  params: [{
    type: "Identifier",
    name: "n"
  }],
  body: {
    type: "BlockStatement",
    body: [{
      type: "ReturnStatement",
      argument: {
        type: "BinaryExpression",
        operator: "*",
        left: {
          type: "Identifier",
          name: "n"
        },
        right: {
          type: "Identifier",
          name: "n"
        }
      }
    }]
  }
}
Copy the code

You’ll notice that each layer of the AST has the same structure:

{ type: "FunctionDeclaration", id: {... }, params: [...] , body: {... } } { type: "Identifier", name: ... } { type: "BinaryExpression", operator: ... , left: {... }, right: {... }}Copy the code

Each of these layers is also called a Node. An AST can consist of a single node or hundreds or thousands of nodes. Together, they describe program syntax for static analysis.

Each node has the following Interface:

interface Node {
  type: string;
}
Copy the code

The type field is a string representing the type of the node (for example, “FunctionDeclaration”, “Identifier”, or “BinaryExpression”). Each type of node defines additional attributes that further describe the node type.

AST Node Introduction

ldentifier

Identifiers are the names we define when we write JS, such as variable names, function names, and attribute names, all belong to identifiers. The corresponding interface looks like this:

interface Identifier <: Expression, Pattern {
    type: "Identifier";
    name: string;
}
Copy the code

An identifier may be an expression or a deconstruction pattern (deconstruction syntax in ES6). We will see Expression and Pattern later.

Literal

Literals, not [] or {}, but literals that semantically represent a value, such as 1, “hello”, true, and regular expressions (with an extended Node to represent regular expressions) such as /\d? /. Let’s look at the definition of the document:

interface Literal <: Expression {
    type: "Literal";
    value: string | boolean | null | number | RegExp;
}
Copy the code

RegExpLiteral

This for regular literal, in order to better to parse the regular expression content, add one more regex fields, which will include regular itself, as well as the regular flags.

interface RegExpLiteral <: Literal {
  regex: {
    pattern: string;
    flags: string;
  };
}
Copy the code

Programs

This is usually used as a trailing node, which represents a complete program code tree.

interface Program <: Node {
    type: "Program";
    body: [ Statement ];
}
Copy the code

The body property is an array containing multiple Statement nodes.

Functions

Function declaration or function expression node.

interface Function <: Node {
    id: Identifier | null;
    params: [ Pattern ];
    body: BlockStatement;
}
Copy the code

The id is the function name, and the params property is an array representing the parameters of the function. Body is a block statement. It’s worth noting that you won’t find the type: “Function” node during testing, but you will find the type: “FunctionDeclaration” and the type: “FunctionExpression”, because functions appear either as declarations or function expressions, which are combination types of node types. FunctionDeclaration and FunctionExpression will be mentioned later. This gives the impression that the document is well laid out, with function names, arguments, and function blocks being part of the function, while declarations or expressions have their own needs.

Statement

A statement node is nothing special; it is just a node, a distinction, but there are many kinds of statements, which are described below.

interface Statement <: Node { }ExpressionStatement
Copy the code

ExpressionStatement

Expression statement nodes, where a = a+ 1 or a++ have an expression property that refers to an expression node object (we’ll talk about expressions later).

interface ExpressionStatement <: Statement {
    type: "ExpressionStatement";
    expression: Expression;
}
Copy the code

BlockStatement

Block statement nodes, for example: if (…) {// Here is the contents of a block}, a block can contain multiple other statements, so there is a body attribute, which is an array representing multiple statements in the block.

interface BlockStatement <: Statement {
    type: "BlockStatement";
    body: [ Statement ];
}
Copy the code

EmptyStatement

An empty statement node that does not execute any useful code, such as a separate semicolon;

interface EmptyStatement <: Statement {
    type: "EmptyStatement";
}
Copy the code

DebuggerStatement

Debugger, that’s what it means. Nothing else.

interface DebuggerStatement <: Statement {
    type: "DebuggerStatement";
}
Copy the code

WithStatement

The with statement node contains two special attributes. Object represents the object (which can be an expression) to be used with, and body represents the statement to be executed after with, which is usually a block statement.

interface WithStatement <: Statement {
    type: "WithStatement";
    object: Expression;
    body: Statement;
}
Copy the code

Here is the control flow statement:

ReturnStatement

Returns the statement node. The argument property is an expression that represents what is returned.

interface ReturnStatement <: Statement {
    type: "ReturnStatement";
    argument: Expression | null;
}
Copy the code

LabeledStatement

The label statement, for example:

loop: for(let i = 0; i < len; i++) { // ... for (let j = 0; j < min; j++) { // ... break loop; }}Copy the code

We can use a break loop in the loop nesting to specify which loop to break. So the label statement refers to loop:… This one. A label statement node has two attributes, a label attribute indicating the name of the label and a body attribute pointing to the corresponding statement, usually a loop or switch statement.

interface LabeledStatement <: Statement {
    type: "LabeledStatement";
    label: Identifier;
    body: Statement;
}
Copy the code

BreakStatement

The break statement node has a label attribute indicating the desired label name, or null when no label is needed (which is usually not required).

interface BreakStatement <: Statement { type: "BreakStatement"; label: Identifier | null; }Copy the code

ContinueStatement

A continue statement node, similar to a break.

interface ContinueStatement <: Statement {
    type: "ContinueStatement";
    label: Identifier | null;
}
Copy the code

Here are the conditional statements:

IfStatement

If statement nodes, typically, have three attributes, the test attribute representing if (…). Expressions in parentheses. Possession property is an execution statement that represents a condition true, which is usually a block statement. The alternate property is used to represent an else statement node, usually a block statement, but also an if statement node, such as if (a) {//… } else if (b) { // … }. Alternate can of course be null.

interface IfStatement <: Statement {
    type: "IfStatement";
    test: Expression;
    consequent: Statement;
    alternate: Statement | null;
}
Copy the code

SwitchStatement

A Switch statement node has two attributes. The discriminant attribute indicates the discriminant expression immediately following a switch statement, which is usually a variable. The Cases attribute is an array of case nodes, which represents each case statement.

interface SwitchStatement <: Statement {
    type: "SwitchStatement";
    discriminant: Expression;
    cases: [ SwitchCase ];
}
Copy the code

SwitchCase

Case node of the switch. The test attribute represents the judgment expression for the case, and aggressively is the execution statement for the case. When the test property is null, it represents the default case node.

interface SwitchCase <: Node {
    type: "SwitchCase";
    test: Expression | null;
    consequent: [ Statement ];
}
Copy the code

Here are the exception related statements:

ThrowStatement

The argument property is used to indicate the expression immediately following the throw.

interface ThrowStatement <: Statement {
    type: "ThrowStatement";
    argument: Expression;
}
Copy the code

TryStatement

A try statement node whose block property represents the execution statement of a try, usually a block statement. The Hanlder attribute refers to the catch node. Finalizer refers to the finally statement node. If the Hanlder is null, Finalizer must be a block statement node.

interface TryStatement <: Statement {
    type: "TryStatement";
    block: BlockStatement;
    handler: CatchClause | null;
    finalizer: BlockStatement | null;
}
Copy the code

CatchClause

The catch node, param, represents the argument after the catch, and body represents the execution statement after the catch, usually a block statement.

interface CatchClause <: Node {
    type: "CatchClause";
    param: Pattern;
    body: BlockStatement;
}
Copy the code

Here are the loop statements:

WhileStatement

The while statement node, where test represents the expression in parentheses, and body represents the statement to loop through.

interface WhileStatement <: Statement {
    type: "WhileStatement";
    test: Expression;
    body: Statement;
}
Copy the code

Do /while statement node, similar to the while statement.

interface DoWhileStatement <: Statement {
    type: "DoWhileStatement";
    body: Statement;
    test: Expression;
}
Copy the code

ForStatement

The for loop node, init/test/update, represents the three expressions in the parentheses of the for statement, the initialization value, the loop judgment condition, and the variable update statement (init can be a variable declaration or expression) executed each time the loop executes. All three attributes can be null, for(;;) {}. The body attribute is used to indicate the statement to loop through.

interface ForStatement <: Statement {
    type: "ForStatement";
    init: VariableDeclaration | Expression | null;
    test: Expression | null;
    update: Expression | null;
    body: Statement;
}
Copy the code

ForInStatement

For /in statement nodes, with the left and right attributes representing statements around the IN keyword (the left side can be a variable declaration or expression). The body is still the statement to loop through.

interface ForInStatement <: Statement {
    type: "ForInStatement";
    left: VariableDeclaration |  Pattern;
    right: Expression;
    body: Statement;
}
Copy the code

Declarations

Declaration statement nodes, which are also statements, are just refinements of a type. The various declaration statement types are described below.

interface Declaration <: Statement { }
Copy the code

FunctionDeclaration

Function declarations, unlike Function declarations above, cannot have id null.

interface FunctionDeclaration <: Function, Declaration {
    type: "FunctionDeclaration";
    id: Identifier;
}
Copy the code

VariableDeclaration

Variable declarations. The kind attribute indicates what type of declaration it is, since ES6 introduced const/let. Declarations represent multiple descriptions of declarations, since we can do this: let a = 1, b = 2; .

interface VariableDeclaration <: Declaration {
    type: "VariableDeclaration";
    declarations: [ VariableDeclarator ];
    kind: "var";
}
Copy the code

VariableDeclarator

Description of a variable declaration, where id represents the variable name node and init represents an expression for the initial value, which can be null.

interface VariableDeclarator <: Node {
    type: "VariableDeclarator";
    id: Pattern;
    init: Expression | null;
}
Copy the code

Expressions

Expression node.

interface Expression <: Node { }
Copy the code

ThisExpression

According to this.

interface ThisExpression <: Expression {
    type: "ThisExpression";
}
Copy the code

ArrayExpression

The elements property is an array representing multiple elements of the array, each of which is an expression node.

interface ArrayExpression <: Expression {
    type: "ArrayExpression";
    elements: [ Expression | null ];
}
Copy the code

ObjectExpression

Object expression node. The property property is an array representing each key-value pair of the object. Each element is an attribute node.

interface ObjectExpression <: Expression {
    type: "ObjectExpression";
    properties: [ Property ];
}
Copy the code

Property

Property node in an object expression. Key represents a key, value represents a value, and since ES5 syntax has get/set, there is a kind attribute that indicates a normal initialization, or get/set.

interface Property <: Node {
    type: "Property";
    key: Literal | Identifier;
    value: Expression;
    kind: "init" | "get" | "set";
}
Copy the code

FunctionExpression

Function expression node.

interface FunctionExpression <: Function, Expression {
    type: "FunctionExpression";
}
Copy the code

Below is the expression section related to unary operators

UnaryExpression

Unary expression nodes (++/– is the update operator, not in this category), operator represents the operator, and prefix indicates whether or not it is a prefix operator. Argument is the expression to perform the operation.

interface UnaryExpression <: Expression {
    type: "UnaryExpression";
    operator: UnaryOperator;
    prefix: boolean;
    argument: Expression;
}
Copy the code

UnaryOperator

Unary operator, enumeration type, all values as follows:

enum UnaryOperator {
    "-" | "+" | "!" | "~" | "typeof" | "void" | "delete"
}
Copy the code

UpdateExpression

The update expression node, ++/–, is similar to the unary operator, except that the type of node object that operator points to is the update operator.

interface UpdateExpression <: Expression {
    type: "UpdateExpression";
    operator: UpdateOperator;
    argument: Expression;
    prefix: boolean;
}
Copy the code

UpdateOperator

The update operator, with a value of ++ or –, is used with the prefix attribute of the UPDATE expression node to indicate before and after.

enum UpdateOperator {
    "++" | "--"
}
Copy the code

Here is the part of the expression associated with binary operators:

BinaryExpression

Binary operation expression node, left and right represent two expressions left and right of the operator, and operator represents a binary operator.

interface BinaryExpression <: Expression {
    type: "BinaryExpression";
    operator: BinaryOperator;
    left: Expression;
    right: Expression;
}
Copy the code

BinaryOperator

Binary operator, all values are as follows:

enum BinaryOperator { "==" | "! = "|" = = = "|"! ==" | "<" | "<=" | ">" | ">=" | "<<" | ">>" | ">>>" | "+" | "-" | "*" | "/" | "%" | "|" | "^" | "&" | "in" | "instanceof" }Copy the code

AssignmentExpression

Assignment expression node, the operator property represents an assignment operator, left and right are expressions around the assignment operator.

interface AssignmentExpression <: Expression {
    type: "AssignmentExpression";
    operator: AssignmentOperator;
    left: Pattern | Expression;
    right: Expression;
}
Copy the code

AssignmentOperator

Assignment operator, all values as follows :(not many commonly used)

enum AssignmentOperator {
    "=" | "+=" | "-=" | "*=" | "/=" | "%="
        | "<<=" | ">>=" | ">>>="
        | "|=" | "^=" | "&="
}
Copy the code

LogicalExpression

A logical operation expression node, and an assignment or binary operation type, except that operator is a logical operator type.

interface LogicalExpression <: Expression {
    type: "LogicalExpression";
    operator: LogicalOperator;
    left: Expression;
    right: Expression;
}
Copy the code

LogicalOperator

Logical operator, two values, namely and or.

enum LogicalOperator {
    "||" | "&&"
}
Copy the code

MemberExpression

A member expression node is a statement that refers to an object member, object is an expression node that refers to an object, property is an attribute name, computed, if false, means. The property should be an Identifier node, or [] if the computed property is true, that is, the property is an Expression node whose name is the resulting value of the Expression.

interface MemberExpression <: Expression, Pattern {
    type: "MemberExpression";
    object: Expression;
    property: Expression;
    computed: boolean;
}
Copy the code

Here are some other expressions:

ConditionalExpression

Conditional expressions, often called ternary operands, Boolean? True, false. Attribute reference condition statement.

interface ConditionalExpression <: Expression {
    type: "ConditionalExpression";
    test: Expression;
    alternate: Expression;
    consequent: Expression;
}
Copy the code

CallExpression

Function call expressions that represent statements of type func(1, 2). Arguments is an array, and the element is an expression node, representing the function argument list.

interface CallExpression <: Expression {
    type: "CallExpression";
    callee: Expression;
    arguments: [ Expression ];
}
Copy the code

NewExpression

New expressions

interface NewExpression <: CallExpression {
    type: "NewExpression";
}
Copy the code

SequenceExpression

This is the expression (the exact name is unknown) constructed by the comma operator, and the expressions attribute is an array representing the multiple expressions that make up the entire expression, separated by commas.

interface SequenceExpression <: Expression {
    type: "SequenceExpression";
    expressions: [ Expression ];
}
Copy the code

Xiaobian himself is a Python development engineer. I spent three days to organize a set of Python learning tutorials from the most basic Python scripts to Web development, crawlers, data analysis, data visualization, machine learning, etc. These materials have the desired friendsClick on theCan receive

Crawler stopped by JS? 30K anti – climbing god summed up a AST syntax details! Suggest collection!

Javascript compiler execution process

Syntax analysis

Syntax analysis

What is an AST (Abstract Syntax Tree)

AST Node Introduction

ldentifier

Literal

RegExpLiteral

Programs

Functions

Statement

ExpressionStatement

BlockStatement

EmptyStatement

DebuggerStatement

WithStatement

ReturnStatement

LabeledStatement

BreakStatement

ContinueStatement

IfStatement

SwitchStatement

SwitchCase

ThrowStatement

TryStatement

CatchClause

WhileStatement

ForStatement

ForInStatement

Declarations

FunctionDeclaration

VariableDeclaration

VariableDeclarator

Expressions

ThisExpression

ArrayExpression

ObjectExpression

Property

FunctionExpression

UnaryExpression

UnaryOperator

UpdateExpression

UpdateOperator

BinaryExpression

BinaryOperator

AssignmentExpression

AssignmentOperator

LogicalExpression

LogicalOperator

MemberExpression

ConditionalExpression

CallExpression

NewExpression

SequenceExpression

Related Posts

Rambling data Structures (II) — Linear Table 1

Frequently asked questions and solutions to cache usage

(2) Hill sorting method