Review past

Interested partners can continue to pay attention to my column compilation principle,

Dedicated to helping more front-end developers explore the compilation world with the most accessible front-end languages.

  • Front-end Engineer’s Guide to Compiler Principles – “Compiler Workflow”

The introduction

In our last article, we discussed a complete workflow of the compiler, which involved three stages: Parsing, Transformaiton, and Code Generation, to process our inputs and finally our outputs.

Now that you’re familiar with the basics of how compilers work, we’ll implement a small JavaScript compiler from scratch.

But before I start, I’ll discuss FSM with you.

Finite state machine

concept

The concept of finite state machines doesn’t really have much to do with JavaScript, but most states in JavaScript can be described using finite state machines.

What is a finite state machine? Generally speaking, the so-called finite state machine is just an idea, a model. We can use the idea of finite state machines to simulate most scenarios.

Let’s say there’s a button element on a web page. A popover appears when the mouse hovers over the button and disappears when the mouse moves away from the button.

For the above behavior, let’s try to describe the process with a state machine.

We can abstract the user’s mouse activity Event as the input of a state machine and treat the button as a state machine.

The state of the button (state machine) can be changed every time a user action (input) is generated, and there are only two legal states inside the button (state machine) : the mouse hover is triggered and the mouse not hover is not triggered.

When we move the mouse around the button (without entering the button), we can assume that the same input is given to the state machine each time, so the state of the state machine does not change.

When the mouse moves inside the button, the state machine determines that the state needs to be changed according to the input because the input changes. Then the state of the button (state machine) will be changed to the highlighted state and the floating box will be displayed.

A bit of pseudo-code describes this process:

const buttonMachine = {
  // The current state of the state machine
  currentState: 'inVisible'.// The transform method is called on each input to change the current state based on the input judgment
  transform: function (event) {
    // do something change currentState based on user action (event)
    switch (this.currentState) {
      case 'visible':
        // do something to change the status to display need to highlight the button, display the menu
        break;
      
      case 'inVisible':
        // do something change to hidden state to unhighlight and hide menu
        break;
      
      default:
        break; }}}Copy the code

Each mouse movement generates input that enters the transform method of the state machine to determine whether the current state of the state machine needs to be changed and the logical processing that follows when the state changes.

role

The main reason I singled out finite state machines is because we mentioned in the previous article that when the compiler does word segmentation for input strings, for example:

<div id="app"><p>hello</p>Jue Jin</div>
Copy the code

At the word segmentation stage it is divided into tokens one by one:

[{"type": "Punctuator"."value": "<"},
   {"type": "JSXIdentifier"."value": "div"}, 
   {"type": "JSXIdentifier"."value": "id"}, 
   {"type": "Punctuator"."value": "="},
   {"type": "String"."value": "\"app\""}, 
   {"type": "Punctuator"."value": ">"}, 
   {"type": "Punctuator"."value": "<"},
   {"type": "JSXIdentifier"."value": "p"},
   {"type": "Punctuator"."value": ">"},
   {"type": "JSXText"."value": "Hello"},
   {"type": "Punctuator"."value": "<"},
   {"type": "Punctuator"."value": "/"},
   {"type": "JSXIdentifier"."value": "p"},
   {"type": "Punctuator"."value": ">"},
   {"type": "JSXText"."value": "Jue Jin"},
   {"type": "Punctuator"."value": "<"},
   {"type": "Punctuator"."value": "/"},
   {"type": "JSXIdentifier"."value": "div"},
   {"type": "Punctuator"."value": ">"}]Copy the code

This step is handled by using finite state machines, and of course there are many ways to divide words. For example, it is rude to use the re for word segmentation directly, but using the state machine for word segmentation is the most elegant and powerful way.

And again, we’re going to implement this later, after we understand the concept of a state machine. We are now trying to write a small state machine to help us digest the above concepts.

Miniature state machine

The basic concepts of the state machine have been explained above. Let’s try to use the state machine for lexical analysis together with a Demo.

Take, for example, a simple four-rule operation:

100+200-300
Copy the code

We need to break the above statements into the following structure using the state machine:

[{type: 'Numeric'.value: '100' },
    { type: 'Punctuator'.value: '+' },
    { type: 'Numeric'.value: '200' },
    { type: 'Punctuator'.value: The '-' },
    { type: 'Numeric'.value: '300'}]Copy the code

This step is actually called lexical analysis, but the state machine we implemented at this point only considers two states: Numeric and Punctuator.

The process of using state machine word segmentation is to read the whole input (input code) in turn according to each character, changing the current state according to each character read to perform the corresponding logic.

The whole process can be described by a graph:

Let’s implement all the procedures in the diagram in code:

First, let’s create some basic variables for word segmentation:

// Regex to match numbers
const NumReg = / [0-9]
// Matches the regular rule for punctuation
const PunctuatorReg = / /, +, -, * / /

// The final output of all tokens
const tokens = []

// Token being processed in the current state machine
let currentToken = {}

// The word segmentation function accepts the input string code
function tokenizer(input) {
    // dosome thing
}

// Print the segmentation result
console.log(tokenizer('100 + 200-300'))
Copy the code

First we define:

  • NumReg, PunctuatorRef these two regular matching rules, here we only consider the basic addition, subtraction, multiplication and division and number segmentation.

  • Tokens as a result of preserving the final participle.

  • CurrentToken is the Token being processed in the current state machine.

  • Tokenizer, as its name implies, is the core processing function for lexical analysis.

Next we need to use finite state machines in the Tokenizer function for word segmentation.

First, for the input character “100+200-300”, we need to iterate each character to add matching Tokens into Tokens.

We also need to maintain an internal state as the current state after each input character, for example

When “1” is analyzed, we need to change the internal state of the state machine to “Numeric” due to this input, and continue iterating the next character “0”, since “1” and “0” are a whole and can not be separated.

So when “2” is entered, the state is still ‘Numeric’, and all we need to do is change the value of currentToken from “1” to “10”, and the same goes for parsing to “100”.

When the analysis reaches “+”, the input in the state machine is “+”, obviously “+” is a punctuation symbol, and it cannot be spliced with the previous “100”. Now we push the currentToken (100) into Tokens and change the state machine to “Punctuator”… And so on.

Let’s focus first on the implementation of the so-called Tokenizer function:

// ...

function tokenizer(input) {
    // Define the initial state judgment function of the state machine
    let state = start
    // Iterate over the input string in turn
    input.split("").forEach(char= > {
        // char is for each character
        // Call state and pass in char
        state = state(char)
    })
    return tokens
}
Copy the code

Let’s take a step by step breakdown of what the Tokenizer function is really doing.

Let’s focus on the forEach function first. We iterate over the input (the source code for the word segmentation), so each char can be thought of as each input to the state machine.

As for the state function, it is essentially the handler of the last input return, so there is no need to delve into the state function at this point. I’ll tell you what it does later.

First, there is no pre-state in the state machine at the first input. So we need to define a start function to initialize the state machine handler, which is this line:

function tokenizer(input) {
    // ...
    let state = start
    // ...
}

/** * state machine initial function *@param {*} Char Specifies the entered character *@return {*} * /
function start (char) {
    if(NumReg.test(char)) {
        // The first input char is numeric initialization token is numeric
        currentToken = { type: 'Numeric'.value: char }
        // Returns a nunmer handler
        return numeric
    }else if (PunctuatorReg.test(char)) {
        // The first input char is the punctuator to initialize current to punctuator
        currentToken = { type: 'Punctuator'.value: char }
        // Returns a punctuator handler
        return punctuator
    }
}
Copy the code

In the start function, two things are handled:

  • Initialize currentToken based on the char input for the first time. For example, when we type “100+200-300”, the char input for the first time is “1 “, and the initialization phase will be {type: ‘Numeric’, value: ‘1’} to initialize currentToken.

  • Second, after the curerntToken is initialized, the state machine determines the state within the current state machine based on the current input.

For example, when we initialized currentToken to {type: ‘Numeric’, value: ‘1’}, the state machine internally retained the ‘Numeric’ state based on the input.

The next time the input is generated, it is processed with the previous state handler (in this case, numeric). We do this because when we type “1” inside the state machine, we return numeric (the numeric handler).

There are two possibilities for the next input:

  • The input “0” is still a number. Since we know that the input we processed last time is still a number, we only need to change the value of currentToken to “10”. The two inputs will not be processed separately.

  • In the second case, if “1” is followed by a punctuation symbol such as “+”, then the numeric function returned from the last processing result in the state machine receives the input of “+”, indicating that they are not in the same morphology and need to be re-segmented.

So here we use the input of this time to return the next processing function to determine whether the word segmentation or continuous.

Essentially, each CHAR can be regarded as an internal input to the state machine, while functions returned from each processing, such as Numeric and Punctuator, can be regarded as states determined according to the input.

At each input, we combine the state of the previous state machine (the returned handler) to process whether the current input (char) is consecutive with the last one during lexical analysis.

Following that, we take a look at the details of the so-called numeric and punctuator functions:

// numeric
function numeric(char) {
    if(NumReg.test(char)) {
        // If the current input is a number regardless of the word continuously add value value
        currentToken.value += char
        // Returns the numeric function assigned to state
        return numeric
    }else if (PunctuatorRef.test(char)) {
        // if it is a punctuation word
        // If the currently entered punctuation mark is split
        // First input old tokens into tokens
        emitToken(currentToken)
        // Change the current token
        currentToken = { type: 'Punctuator'.value: char }
        // Returns the punctuator handler
        return punctuator
    }
}
Copy the code

We can see that the logic of the numeric function processing is very simple. When the input is entered into the Numeric handler function, it means that the result of the previous processing in the state machine must have been an input character of numeric type.

At this point, we only need to determine how to handle the input based on the input in Numeric. For example, if numeric char is “3”, it means that the numeric type is still numeric this time. In this case, I do not need to perform word segmentation but concatenate the value of the last token.

If the input is “+”, then the “+” and the previous “numeric” type are obviously participles, so in this case:

  • EmitToken (currentToken) is performed first to save the last word segmentation result into the final token.

  • Then we changed currentToken to the Punctuator type entered this time.

  • A function named Punctuator is then returned as the state in the state machine as the state function for the next processing.

We will implement the logic of the emitToken method later. It’s as simple as stuffing the currentToken into the final token array, and that’s it.

Next, we implement the punctuator function, which has the same logic as Numeric. Essentially, the word segmentation is processed by the state machine according to the char (word) assigned this time as the input and the input function state of the last input.

// punctuation state handler
function punctuator(char) {
  // Emit anyway because punctuation is not spliced up at the word segmentation stage
  emitToken(currentToken)
  if (NumReg.test(char)) {
    currentToken = { type: 'Numeric'.value: char }
    return numeric
  } else if (PunctuatorRef.test(char)) {
    currentToken = { type: 'Punctuator'.value: char }
    return punctuator
  }

  return punctuator
}
Copy the code

As for Punctuator, which is slightly different from Numeric, we want to split the word immediately without concatenating any input as punctuation.

As a result, the first thing to enter the punctuator function is to conduct emitToken(currentToken), and then return the corresponding state function according to the char input this time.

For example, the participle reaches the first “2” :

Char (2) will enter the punctuator function because the state processing function reserved in the last state machine is punctuator.

That is, it first outputs the last “+” into the final tokens, and then modifies the state function of the state machine based on the inputs.

Some students may be confused, so if there is a link to the lexical content of the legal punctuation mark will be spliced?

For example, the increment (++) and decrement (++) operators that we use a lot in JavaScript, they’re usually used in pairs so do we need to concatenate these two words at the word segmentation stage?

The answer is no, usually in the process of lexical analysis, we only need to divide the incoming string into tokens according to the basic lexical rules. As for the grammar input based on context, such as increment and subtraction, we will deal with them in detail in the syntax analysis stage.

Of course, our focus here is not on lexical analysis, but on the process of thinking through words to tell you the concept and use of finite state machines.

Finally, we implement the so-called emitToken function:

function emitToken(token) {
  / / remakes currentToken
  currentToken = { type: ' '.value: ' ' }
  // Save the token arguments passed in the last time into the tokens entered in the last time
  tokens.push(token)
}
Copy the code

It’s as simple as reworking the currentToken and stuffing it into the final token array.

Here we are with our little example of implementing lexical analysis through a state machine. Let’s run the code to see the corresponding output:

Doesn’t look bad, does it? However, there seems to be a “300” token missing. This is because in tokenizer we do nothing with the result of the last word segmentation after forEach ends.

We need to input the result of the last word segmentation into tokens after the traversal stops. Something like this:

function tokenizer(input) {
  // Initializes the state machine
  let state = start
  input.split(' ').forEach(char= >{state = state(char)}) token.push (currentToken)return tokens
}
Copy the code

At this point, let’s rerun the code

The console prints out what we expect, and of course the lexical analysis that we’re implementing right now is very simple and just supports punctuation and number segmentation.

Instead of focusing on lexical analysis, we want you to use this simple example to understand what a finite state machine is, and to understand its concept through code examples.

At the end

I did not pile up too many so-called finite state machine related concepts, the concept of finite state machine and how to apply the example of the present can be understood in the article is actually enough, we will use it in detail in the formal stage of the lexical analysis.

A finite state machine is just a model, a concept. We can treat it as a finite state machine model in any scenario.

In the next article we’ll get down to business and take you through the process of building a small JavaScript compiler. Interested partners can continue to pay attention to my column compilation principle.

code

The full code is here.

const NumReg = / [0-9]
const PunctuatorRef = / /, +, -, * / /

// Save all formatted tokens
const tokens = []
// Token currently being processed
let currentToken = {}

function emitToken(token) {
  / / remakes currentToken
  currentToken = { type: ' '.value: ' ' }
  // Save the token arguments passed in the last time into the tokens entered in the last time
  tokens.push(token)
}
// 
/** * state machine initial function *@param {*} Char Specifies the entered character *@return {*} * /
function start(char) {
  // If input is a number
  if (NumReg.test(char)) {
    // Initialize currentToken to the Numeric type
    currentToken = { type: 'Numeric'.value: char }
    return numeric
  } else if (PunctuatorRef.test(char)) {
    // Initialize currentToken to Punctuator
    currentToken = { type: 'Punctuator'.value: char }
    return punctuator
  }
}

// Maybe it is different
function numeric() {
  // XiWang
  const number = 
}

// The current digital state is entered
function numeric(char) {
  if (NumReg.test(char)) {
    // How to match number
    currentToken.value += char
  } else if (PunctuatorRef.test(char)) {
    // If the punctuation is matched, the state needs to be changed
    // First input old tokens into tokens
    emitToken(currentToken)
    currentToken = { type: 'Punctuator'.value: char }
    return punctuator
  }
  // The function that returns the current state will be called again in the next iteration
  return numeric
}

// punctuation handlers
function punctuator(char) {
  // Emit anyway because punctuation is not spliced up at the word segmentation stage
  emitToken(currentToken)
  if (NumReg.test(char)) {
    currentToken = { type: 'Numeric'.value: char }
    return numeric
  } else if (PunctuatorRef.test(char)) {
    currentToken = { type: 'Punctuator'.value: char }
    return punctuator
  }

  return punctuator
}

function tokenizer(input) {
  // Initializes the state machine
  let state = start
  input.split(' ').forEach(char= > {
    state = state(char)
  })
  // Still need to send the last time after traversal
  tokens.push(currentToken)
  return tokens
}

console.log(tokenizer('100 + 200-300'))
Copy the code