The introduction

For any language framework, the knowledge of the compile part is often unrecognized but important knowledge hidden inside the code.

Most front-end engineers probably have only a superficial understanding of how compilation works, “knowing how it works but not why.”

That’s why I started this column, “Playing with the Principles of Compilation,” to help more front-end developers understand the world of compilation in plain English.

For the “strange and far away” compilation principle knowledge, I will use the most popular and practical words to guide you into its world.

Don’t want to be a screw? Then come on!

Why To Do

We believe that the application of the relevant knowledge of front-end compilation principle will be more or less understood. Common packaging tools in the industry, such as Webpack, Rollup and so on, are all based on the analysis of Abstract Syntax Tree carried out by Acorn to package our code.

The well-known VueJs front-end framework also implements template analysis based on AST to implement special.vue files. The JSX syntax in React is also transformed from JSX syntax analysis to JS syntax through AST.

Well-known libraries such as ESLint, TypeScript, and PostCss in front-end closed-loop engineering also rely on abstract syntax trees.

In fact, mastering the knowledge of compilation principle is very beneficial for a front-end engineer, whether it is to go deep into the bottom of the framework or to implement an auxiliary tool is very practical and essential knowledge.

Maybe you’ve never heard AST names before, or maybe you’ve used some of the well-known esprima, Babel/Paser, Acorn, etc libraries to handle AST nodes.

It doesn’t matter which category you belong to before that, please forget them. Here, we build our own little parser from basic JavaScript.

This article, as the first of the columns, is a bit more conceptual.

Compiler workflow

At this point, I will use Esprima and a simple Demo first to realize the work flow of the entire compiler, and later we will use a fully self-implemented compiler to compile our real case to duplicate a small compiler.

The workflow of a simple compiler can be summarized in three broad aspects:

Parsing

First, the compiler accepts a piece of code, usually a string, in its initial phase.

For example, let’s take the following JSX code:

<div id="app"><p>hello</p>Jue Jin</div>
Copy the code

When the compiler gets to this string code, it enters the parsing phase, where it does two things:

Lexical analysis

When the compiler receives the above string, it first splits the incoming string into a series of lexical items called tokens, a step commonly referred to as word segmentation.

Let’s start by looking at the results of lexical analysis of the above code using the Esprima Api.

const esprima = require('esprima');

// Configure support for JSX and Tokens to print corresponding tokens using the parseScript Api
const { tokens } = esprima.parseScript('<div id="app"><p>hello</p>Jue Jin</div>', { jsx: true.tokens: true });

console.log(tokens,'tokens')
Copy the code

After lexical analysis, the above statement will be broken up step by step into this structure:

[{"type": "Punctuator"."value": "<"},
   {"type": "JSXIdentifier"."value": "div"}, 
   {"type": "JSXIdentifier"."value": "id"}, 
   {"type": "Punctuator"."value": "="},
   {"type": "String"."value": "\"app\""}, 
   {"type": "Punctuator"."value": ">"}, 
   {"type": "Punctuator"."value": "<"},
   {"type": "JSXIdentifier"."value": "p"},
   {"type": "Punctuator"."value": ">"},
   {"type": "JSXText"."value": "Hello"},
   {"type": "Punctuator"."value": "<"},
   {"type": "Punctuator"."value": "/"},
   {"type": "JSXIdentifier"."value": "p"},
   {"type": "Punctuator"."value": ">"},
   {"type": "JSXText"."value": "Jue Jin"},
   {"type": "Punctuator"."value": "<"},
   {"type": "Punctuator"."value": "/"},
   {"type": "JSXIdentifier"."value": "div"},
   {"type": "Punctuator"."value": ">"}]Copy the code

We can see that the JSX syntax passed in above is parsed into an array of tokens, with each object in the array representing a Token.

Each Token has a corresponding type attribute representing its type and a value attribute representing its value.

In this step, we divide the incoming code into tokens through lexical analysis in the parsing stage. Usually, finite state machine is the best way of lexical analysis.

I’ll elaborate on what a finite state machine is later in this article.

Syntax analysis

In the previous step, we used lexical analysis to divide the input code into an array of tokens. After that, we need to parse tokens into a real abstract syntax tree (AST) form.

An abstract syntax tree, you can think of it as a Christmas tree. Each of these tokens can be seen as a node in the Christmas tree.

The parsing formally abstracts each of the above divided tokens into a tree to describe the relationship between each Token node.

const esprima = require('esprima');

Call parseScript to get the abstract syntax tree generated by the input code
const ast = esprima.parseScript('<div id="app"><p>hello</p>Jue Jin</div>', { jsx: true });

console.log(ast, 'ast')
Copy the code

The above tokens are parsed into this data structure:

{
  "type": "Program"."body": [{
    "type": "ExpressionStatement"."expression": {
      "type": "JSXElement"."openingElement": {
        "type": "JSXOpeningElement"."name": {
          "type": "JSXIdentifier"."name": "div"
        },
        "selfClosing": false."attributes": [{
          "type": "JSXAttribute"."name": {
            "type": "JSXIdentifier"."name": "id"
          },
          "value": {
            "type": "Literal"."value": "app"."raw": "\"app\""}}},"children": [{
        "type": "JSXElement"."openingElement": {
          "type": "JSXOpeningElement"."name": {
            "type": "JSXIdentifier"."name": "p"
          },
          "selfClosing": false."attributes": []},"children": [{
          "type": "JSXText"."value": "Hello"."raw": "Hello"}]."closingElement": {
          "type": "JSXClosingElement"."name": {
            "type": "JSXIdentifier"."name": "p"}}}, {"type": "JSXText"."value": "Jue Jin"."raw": "Jue Jin"}]."closingElement": {
        "type": "JSXClosingElement"."name": {
          "type": "JSXIdentifier"."name": "div"}}}}],"sourceType": "script"
}
Copy the code

Let’s use a simple diagram to describe the tree a little bit:

The so-called grammatical analysis stage is to turn Tokens into a tree through a series of grammatical analysis. Each node in the tree will hold information corresponding to its own node.

At the same time, the tree data structure also reflects the relationship between each node.

For example, the top-level div node of type JSXElelement can be split into a structure like this:

  {
    // Node type
      "type": "JSXElement"."openingElement": {
        "type": "JSXOpeningElement"."name": {
          "type": "JSXIdentifier"."name": "div"
        },
       // Whether it is self-closed
        "selfClosing": false.// Attribute of this object
        "attributes": [{
          "type": "JSXAttribute"./ / the property name
          "name": {
            "type": "JSXIdentifier"."name": "id"
          },
          / / property values
          "value": {
            "type": "Literal"."value": "app"."raw": "\"app\""}}},"children": [...]. .// Save the nested child nodes of this node
      "closingElement": { // Close the node
        "type": "JSXClosingElement"."name": {
          "type": "JSXIdentifier"."name": "div"}}}Copy the code

Each of the lexical parsed nodes has a type attribute whose value indicates the type of the node. For example, the

converted node has type JSXElement, or the “Jue Jin” string has type JSXText to indicate that the node is a JSX string.

Depending on the type, we can use the corresponding attributes to access/modify the corresponding attributes and contents of the node.

This is the end of the compiler’s two-step work in the parsing phase, which is simply to convert our input string code into a tree data structure (AST).

So far I’ve described this process in as simple a text as possible, and later I’ll walk you through how to do it yourself.

Transformation stage

The compiler first goes through the transition phase and converts the input code into the AST. Then comes the transformation phase, which is essentially a deep traversal of the abstract syntax tree.

In the transformation phase, we iterate over the abstract syntax tree to modify the tree structure by adding, deleting, or modifying matched nodes.

For example, if I want to add an attribute whose ID is TEXT to the P node, then during the AST traversal, I can modify the corresponding node attribute. Of course, you can also directly replace the entire node. Decided by you!

Let’s do this first with Estraverse.

Estraverse is a library of tools for deep traversal of abstract syntax trees generated by Esprima. Because Estraverse does not support JSX syntax, we use Estraverse – Fb, an extension of the library, to traverse JSX transformed abstract syntax trees.

const esprima = require('esprima');
// Deep walk through the AST tool library
const esTraverseFb = require('estraverse-fb')
// Tools to generate AST nodes
const { builders } = require('ast-types')

const ast = esprima.parseScript('<div id="app"><p>hello</p>Jue Jin</div>', { jsx: true });

// Depth-first approach
esTraverseFb.traverse(ast, {
  // The enter function starts when entering each node
  enter: function (node) {
    const { type, openingElement } = node
    // Check whether the current entered node is a matching p node
    if (type === 'JSXElement' && openingElement.name.name === 'p') {
      // Generate the current attribute node that needs to be added
      const attribute = builders.jsxAttribute(
        // The first argument is name
        builders.jsxIdentifier('id'),
        // The second argument is value
        builders.literal('text'))// Add the generated attribute id='text' to the start tag of this node
      openingElement.attributes.push(attribute)
    }
  },
  // The leave function is triggered when leaving each node
  leave: function () {
    // nothing}})Copy the code

At this point, after the above transformation, we changed the original AST structure. We modified the node corresponding to the original P tag into this structure:

{
        "type": "JSXElement"."openingElement": {
          "type": "JSXOpeningElement"."name": {
            "type": "JSXIdentifier"."name": "p"
          },
          "selfClosing": false.// Here we add an attribute node to attributes
          "attributes": [{
            "name": {
              "name": "id"."loc": null."type": "JSXIdentifier"."comments": null."optional": false."typeAnnotation": null
            },
            "value": {
              "value": "text"."loc": null."type": "Literal"."comments": null."regex": null
            },
            "loc": null."type": "JSXAttribute"."comments": null}}]Copy the code

You may be curious about the so-called transformation phase of the AST traversal and how to match the corresponding nodes to add, delete, change, and check, but keep a general idea of the compiler’s transformation phase in mind.

And I’m going to walk you through that process step by step.

Code Generation

We went through Parsing to convert the input string to an abstract syntax tree AST structure.

We then go through the transformation phase to deeply traverse the nodes of the generated abstract syntax tree, and modify some of the nodes. ‘

Now that the compiler has the abstract syntax tree processed, all it needs to do is, of course, convert the abstract syntax tree of the so-called tree structure into new code.

This step is usually called Code Generation: we reverse transform through the abstract syntax tree into generated Code, where the latest Code is generated based on the modified AST.

In the generation stage, it essentially traverses the abstract syntax tree and generates the corresponding string code according to the type and attribute of each node in the abstract syntax tree.

During the code generation phase, we can use EscodeGen to convert the AST into new string code.

EscodeGen doesn’t support JSX syntax, so I won’t show you how to use it in detail.

For example, the abstract syntax tree above where we modified the code generates new code:

<div id="app"><p id="text">hello</p>Jue Jin</div>
Copy the code

The last generation phase of the compiler converts the latest abstract syntax tree into the corresponding code structure. Different compilers will implement this step in different ways, and we will mainly use recursively modified AST nodes for code generation.

To this step also completely completed the code conversion function, simply speaking, the whole process of the compiler is the three steps of parsing, transformation, generation.

At the end

A complete little compiler works in a way that I’ve described here.

We have mentioned the three complete processes of parsing, transforming, and generating as well as the things that are handled in each step of a compiler workflow, but there are many details of these seemingly simple processes that we have not covered yet.

My original intention of the first article is more to introduce the role of a brick, here we can understand the role of each stage and the results of each stage for the three stages of analysis, transformation and generation mentioned in the article.

As for how to implement the three phases of parsing, transformation and generation, including the finite state machine mentioned above, I will take you step by step to implement a small JSX compiler with a practical Demo.

If you are interested in the principle of front-end compilation, you can follow my column to play with the principle of compilation.