Hi JSer, today I’d like to talk to you about one of my favorite topics: what abstract syntax trees are and what we can do with them. We’ll look at abstract syntax trees by writing a few simple demos.

You can download the working sample code here to see some of the demos we’ll be covering today.

Noun explanation

First let’s look at the following nouns: abstract syntax tree, lexical analysis, syntax analysis.

Abstract syntax tree

Abstract Syntax Tree (AST). AST is a tree-like representation of the abstract syntactic structure of source code, with each node in the tree representing a structure in the source code.

For example, if-else code might have a node with three child nodes (judgment condition, logic for true, logic for false), while code might have two child nodes (judgment condition, logic for the body of the loop).

The transformation of code from source to abstract syntax tree goes through two important processes: lexical analysis and syntax analysis.

Lexical analysis

Lexical analysis, or tokenize for short, refers to the process of converting sequences of characters from source code into sequences of words (tokens). The tool that does this is called a lexer (or tokenizer).

For example, the code a = b + 1 will yield the five words [A, =, b, +, 1] after lexical analysis (or an EOF identifier at the end). Each word may contain the following information:

  • Original words such as ‘a’, ‘=’, ‘1’
  • The position of a word in the source code, for examplebThe location is from5to6
  • The type of word, for exampleaIs an identifier, which represents a variable name, and=and+Is the operator,1It’s a literal.

Syntax analysis

Through lexical analysis, we get chained word sequences, and then Parser converts word sequences into tree-like AST according to syntax rules. We usually use the Backus-Naur Form (BNF) to describe our grammar.

During parsing, some optimization rules can be used to make changes to the structure of the syntax tree so that the compiler/interpreter can execute the code more efficiently.

tool

Today we will rely on the parsing capability provided by Acorn.js, which parses JS code into an AST.

For those of you who don’t know what Acorn.js is, chances are you’ve already touched it and are using it, as many tools rely on it: webPack, Uglip-es, ESLint, and more.

We won’t go into the details of how it works today (we’ll do that in a future article), so just look at the API it provides through the documentation.

The source code

Next, we’ll write our code around the following syntax tree generated by Acorn.js:

var sum = 0;
var count = 1;
var result;

while (count <= 10) {
  sum = sum + count;
  count = count + 1;
}

if (count < 10) {
  result = false;
} else {
  result = true;
}
Copy the code

Show word sequences

Let’s start with a simple task.

Acorn.js provides a tokenizer, through which we can retrieve words one by one up to eOF:

const getTokenList = function getTokenList(code) {
  const tokens = acorn.tokenizer(code);
  const arr = [];
  let token = tokens.getToken();
  while(token.type.label ! = ='eof') {
    arr.push(token);
    token = tokens.getToken();
  }
  return arr;
};
Copy the code

Thus, we can display word sequences:

Each word contains the following information: Start and end identify the position of the word in the source code, label represents the word type, and value is the word value.

For example, var sum = 0; 0, position 10 to 11, type num, value 0.

Notice that in the word sequence, newlines and Spaces are removed, but; It’s reserved.

Show syntax tree

Also, Acorn.js provides parse:

const tree = acorn.parse(code);
Copy the code

So we can visualize the AST generated by Acorn.js:

As you can see, each node also contains information about node type, start location, child nodes, and so on. Nodes of different types will have different numbers and names of children. Nodes like Identifier and Literal will not contain child nodes, but will contain raw information that specifies the name or value of their Identifier.

Code highlighting

Based on word sequences or AST, we can code highlight the syntax.

The demo shows how to use the results of lexical analysis to highlight code.

What we need to do is replace the source code with a sequence of words, such as:

var sum = 0;
Copy the code

Replace with:

<span class="token-var">var</span> sum <span class="token-assign">=</span> 0;
Copy the code

The word sequence provides the starting position of the word, so we can easily do this substitution. The trick here is that if you order to replace, then you need to the cumulative records after each replacement of cheap, or you can start from the tail replacement, or, like me, starting from the word sequence of assembly, when found the adjacent word does not end, I will take out from the original content (Spaces and newlines) between them and putting it back to them.

If we use the results of parsing, we can provide more interesting features, such as allowing folding of blocks (such as those of if and while), or jumping over declarations of variables, functions, classes, and so on.

Code compression

The next task is to compress the code.

Word based sequence

First of all, we compress based on the word sequence. Since our word sequence no longer contains Spaces and newlines, we just need to connect the values of each word in the word sequence. For example, we need to fill the space between var sum:

var a=0; var b=1; var c;while(b<=10){a=a+b; b=b+1; }if(b<10){c=false; }else{c=true; }Copy the code

Here we also do a little bit of work: for variable names, we will build a table that corresponds to a, B, C in the order in which they first appear… , a0, b0, … . We do not consider nested scopes, for example:

var sum = 0; . var count = 1; var func = () => { var count = 0; . }Copy the code

May be converted to:

var a = 0; . var b2 = 1; var b3 = () => { var b2 = 0; . }Copy the code

However, the fact is that the two counts are not in the same scope, so we should use a new corresponding table:

var b3 = () => { var a = 0; . }Copy the code

We can’t provide different scopes just on the basis of word sequences, we can handle this by providing AST information (although I skipped this point in the demo too).

Based on the AST

There is plenty of room for further improvement in the above compression code. For example; Not always necessary, for example for the last line of code in a block:

while(b<=10){a=a+b; b=b+1; }Copy the code

Can be expressed as:

while(b<=10){a=a+b; b=b+1}Copy the code

Variable declarations of the same class can also be merged:

var a=0,b=1,c;
Copy the code

Here we need to take advantage of the information provided by the AST.

In./walker/index.js we provide defaultPs, which is an example of traversing a syntax tree. DefaultPs does nothing, but we can modify it to do various things:

const defaultPs = {
  Program(node, env) {
    let item;
    for (item of node.body) {
      defaultPs.process(item, env);
    }
  },
  BlockStatement(node, env) {
    let item;
    for(item of node.body) { defaultPs.process(item, env); }},... process(node, env) { const p = defaultPs[node.type];if(! p) { console.error('Unprocessed type', node.type, node);
      return;
    }
    returnp(node, env); }};Copy the code

DefaultPs walks through the syntax tree, reacting differently depending on the type of each node. Env is used to store global data. The example code in the demo is very simple, so we only cover a few node types of processing, such as BlockStatement, VariableDeclaration, Identifier, WhileStatement and so on.

Here we iterate over nodes and concatenate strings:

const joinCode = function joinCode(list) {
  returnlist.filter(x => x ! = null).join('; ');
};


const ps = {

  ...


  BlockStatement(node, env) {
    const result = [];
    let item;
    result.push(preProcessVars(node, env));
    for (item of node.body) {
      result.push(ps.process(item, env));
    }
    returnjoinCode(result); },... };Copy the code

Here we list several typical node processing:

  Identifier(node, env) {
    returnenv.nextName.getName(node.name); }, Literal(node, env) {// compressiontrueandfalse
    if (node.raw === 'true') return '! 0 ';
    if (node.raw === 'false') return '! 1 ';
    return node.raw;
  },
  WhileStatement(node, env) {
    const test = ps.process(node.test, env);
    const body = ps.process(node.body, env);
    return `while(${test}) {${body}} `; }, IfStatement(node, env) { consttest = ps.process(node.test, env);
    const consequent = ps.process(node.consequent, env);
    let str = `if(${test}) {${consequent}} `; const alternate = ps.process(node.alternate, env)if (node.alternate) str += `else{${alternate}} `;return str;
  },
  BinaryExpression(node, env) {
    return ps.process(node.left, env) + node.operator + ps.process(node.right, env);
  },
Copy the code

For example, IfStatement is to concatenate the code of the corresponding node with if, else, and so on. And its children are going to be treated recursively.

Compared with before, because we have a tree-like structure, we can preferentially traverse all variable declaration nodes under a tree and bring them together to the head. Meanwhile, we can know whether a line of code is the last line in a block and omit the following one. .

The following code is generated in the demo:

var a=0,b=1,c;while(b<=10){a=a+b; b=b+1};if(b<10){c=! 1}else{c=! 0}Copy the code

Here we add another trick: use! 0 is used instead of true! 1 instead of false. The results look good, but there is room for improvement, such as between while and if; You can omit, if you’re going to try, the distinction:

var a = {};if(...). .Copy the code

And:

while (...) { ... }if(...). .Copy the code

The difference between. You need to figure out whether the} from the previous node comes from a block or something like an object literal to determine whether the two nodes are not needed; Attached to it.

In the demo we put variable declarations in the header of the code block. As a JSer, I’m sure you’ve heard of variable promotion. Maybe you’ve heard of instruction reordering in Java, or optimization engines in SQL. These are examples of how compilers, etc., might make changes to the abstract structure of code to optimize it before executing it.

Today’s code compression tools do more static analysis to further compress code, such as tree-shaking.

const a = () => {};
const b = () => {};

export {
  a,
};
Copy the code

For example, in the case of the above module, B is neither exported nor used within the module, so the processed code will discard it completely.

Another example:

const a = () => {
  const a = 3;
  const b= 1;
  return a + b;
};
Copy the code

Will be optimized directly by some tools to code like the following:

const a = () => 4;
Copy the code

Document generation

This example is about automatically generating documents.

The sample code we will use is as follows:

class Photo extends MediaContent { /** * @owner: User */ constructor(owner) { this.owner = owner; } / * * * @return: Array | photo Array * /list() {}
  /**
   * @hashReturns: String | from seven cowshash
   * @returnNew photos: Int | id * / create (hash) {} delete(id, useSoftDelete) {} } class Video extends MediaContent { constructor(owner) { this.owner = owner; } / * * * @returnArray: an Array | video * /list() {} / * * * @ url: String | video url * @returnNew video: Int | id * / create (url) {} / * * * @ id: Int * @ useSoftDelete: Boolean * / delete (id, useSoftDelete) {}}Copy the code

When we walk through the generated AST, we’ll see two ClassDeclaration nodes, each of which will contain information about the class name, the parent class name (if any), and several MethodDefinition nodes, And the MethodDefinition node will contain the list of params.

However, this information is a little thin for generating documentation. We want to show more information, such as the type and description of the parameter, the description of the return value, and so on. It’s not easy to get this information from Javascript code, so we decided to resort to comments.

The problem we need to deal with is that the syntax tree generated by Acorn.js does not contain comment nodes. We can only collect comment nodes into a separate array through onComment, and then guess which method the comment belongs to based on the position information of comment nodes and method nodes (start and end).

For annotation content processing, you can use regular expressions or any other appropriate method.

In addition, we can provide classes and methods with the ability to view source code.

compile

Once we have an AST, we can reorganize the source code, we can convert it to compressed code, and of course we can convert it to code in other languages, such as Python:

sum = 0
count = 1
result = None
while count <= 10:
    sum = sum + count
    count = count + 1
if count < 10:
    result = False
else:
    result = True
Copy the code

The code in the demo is also relatively simple to work with. The important thing to note here is the handling of indentation. We need to preserve the correct indentation for the generated Python code.

So here we’re going to rewrite the joinCode method to fill in the Spaces.

const prependSpace = function prependSpace(level) {
  let str = ' ';
  for (let i = 0; i < level; i++) {
    str += ' ';
  }
  return str;
};

const joinCode = function joinCode(list, level = 0) {
  const space = prependSpace(level); 
  returnlist.filter(x => x ! = null).map(x => space + x).join('\n');
};
Copy the code

Let’s look at some typical nodes:

  WhileStatement(node, env) {
    const test = ps.process(node.test, env);
    const body = ps.process(node.body, env);
    return `while ${test}:\n${body}`;
  },
  IfStatement(node, env) {
    const test = ps.process(node.test, env);
    const consequent = ps.process(node.consequent, env);
    let str = `if ${test}:\n${consequent}`;
    const alternate = ps.process(node.alternate, env)
    if (node.alternate) str += `\nelse:\n${alternate}`;
    return str;
  },
  BinaryExpression(node, env) {
    return `${ps.process(node.left, env)} ${node.operator} ${ps.process(node.right, env)}`;
  },
Copy the code

The trick here is to maintain the correct level of indentation. At what point does the indentation level change? When code enters and exits a new BlockStatement. So we just need to maintain the hierarchy here.

It is also possible to enter and exit a BlockStatement from the scope of the maintenance variable name, push a new table of the variable name, or pop a scope at the top of the stack, while fetching and setting variables through multiple tables from the top. (Note that unlike const and let, var is not block-scoped.) But our demo is not designed to handle this kind of problem.

The interpreter

We can then try to execute the code by iterating through the AST to evaluate the node. WhileStatement = AssignmentExpression WhileStatement = AssignmentExpression WhileStatement = AssignmentExpression

. Identifier(node, env, opts) {if(! opts.name)return env.names[node.name];
    return node.name;
  },
  Literal(node, env) {
    return eval(node.raw);
  },
  WhileStatement(node, env) {
    while (ps.process(node.test, env)) {
      ps.process(node.body, env)
    }
  },
  IfStatement(node, env) {
    if (ps.process(node.test, env)) {
      ps.process(node.consequent, env)
    } else {
      ps.process(node.alternate, env)
    }
  },
  BinaryExpression(node, env) {
    switch (node.operator) {
      case '< =': {
        return ps.process(node.left, env) <= ps.process(node.right, env)
      }
      case '<': {
        return ps.process(node.left, env) < ps.process(node.right, env)
      }
      case '+': {
        returnps.process(node.left, env) + ps.process(node.right, env); }}},...Copy the code

In fact, any node should have a return value. For example, the return value of a BlockStatement should be the return value of its last node, while the return value of an IfStatement varies depending on the result of its test. However, due to my laziness, the demo did not return all nodes, and as mentioned, instead of dealing with nested scopes, I used a table (and no Map, haha). I believe you can do it better than ME.

Step through

Finally, now that we’ve implemented a simple interpreter, it’s natural to see if we can implement a single-step debugger?

Since we are using JS and in a browser environment, we cannot use pTrace for process tracing, interrupt and restore instructions, but since we implemented the interpreter ourselves, we can still do some interesting things:

We can promisify the evaluation process for all nodes (./walker/step.js).

Identifier and Literal are the simplest nodes, and we can use promise.resolve directly. For example, with the BinaryExpression node, we need to use promise. all to wait for the left and right subtrees to finish applying. WhileStatement and block Statements are slightly more troublesome.

WhileStatement needs to repeatedly evaluate its own node before its test is false:

  WhileStatement(node, env) {
    return ps.process(node.test, env).then((flag) => {
      if (flag) {
        return ps.process(node.body, env).then(() => {
          return ps.process(node, env)
        });
      } else {
        // done}}); },Copy the code

Blockstatements need to process all nodes in turn:

  BlockStatement(node, env, opts) {
    const index = opts.index || 0;
    let item = node.body[index];
    return ps.process(item, env).then(() => {
      if (index === node.body.length - 1) {
        // done
      } else {
        returnps.process(node, env, { index: index + 1, }); }}); },Copy the code

This made me a little dizzy the first time I wrote it, but it was fun to see how it worked.

After wrapping all the evaluation with a Promise, we can then just hijack resolve (./walker/step2.js) in the node type that requires a breakpoint:

  BinaryExpression(node, env) {
    return new Promise((r) => {
      env.currentNode = node;
      env.next = () => {
        r(Promise.all([
          ps.process(node.left, env),
          ps.process(node.right, env)
        ]).then((arr) => {
          const [left, right] = arr;
          switch (node.operator) {
            case '< =': {
              return left <= right;
            }
            case '<': {
              return left < right;
            }
            case '+': {
              returnleft + right; }}})); }; }); },Copy the code

While hijacking Resolve, we record the current node to highlight its code:

Doesn’t look bad, does it?

conclusion

Finally, thank you for reading this article and hope you found it interesting. In future articles, I’ll try to bring some interesting content, including how to use ohm.js and how to do the parsing yourself.