preface

JavaScript is by far the most familiar programming language for most front-end developers, and while it is powerful, some language features are hard to understand, such as concepts such as closures and the This binding, which are often confusing for beginners. There are a lot of articles on the Internet like “Cut me if you don’t understand this binding.” But in my opinion, most of these articles are superficial, and you can read a lot of them and still be asked questions by the interviewer. So how can you get a thorough understanding of these language features and stand out in an interview? In my opinion, the best way to truly understand something is to implement it, which is also what Western programmers like to call learning by implementing. For example, if you want to understand React better, the best way is to implement React yourself. To better understand JavaScript, I implemented a Simple JavaScript interpreter called Simple. The interpreter implements a subset of JavaScript syntax based on TypeScript, including the following features:

  • Basic data types
  • Complex data types Object, array, and function
  • Variable definitions
  • Mathematical operations
  • Logical operations
  • If conditional judgment
  • While, for loop
  • Functional programming
  • closure
  • This binding

This series of integral articles, which I wrote after implementing the Simple language interpreter, includes the following sections:

  • Project Introduction and Lexical Analysis (Article)
  • Syntax analysis
  • Executing JavaScript code

While Simple’s implementation has nothing to do with the V8 engine (or any other JavaScript engine), and you won’t be able to understand their source code through this series of articles, here’s what you’ll learn by the end of this series:

  • Deepen your understanding of the JavaScript language (this, closures, etc.)
  • Basic knowledge of compiler principles
  • Know what DSL is and how to implement internal DSL to improve r&d efficiency (Simple’s parsing is based on internal DSL)

The source code for the Simple interpreter is open sourced on Github at github.com/XiaocongDon… , I also developed a simple code editor for everybody to play, the address is superseany.com/opensource/… In this editor, you can write and run JavaScript code, and you can see words (tokens) and syntax trees (AST) generated by JavaScript code.

This brings us to the first part of this series, project introduction and lexical analysis.

Project introduction

Compiler vs interpreter

Before we begin to understand how Simple works, let’s clarify two basic concepts of Compiler principles: Compiler vs Interpreter.

The compiler

A compiler can be understood as a language translator. It converts a source file from one form of code to another form of code. It only converts the code, but does not actually execute the logic of the code. During the development of the front-end project, the code wrapper we used, Webpack, is essentially a JavaScript compiler that packages our code without executing it.

The interpreter

An interpreter, as the name implies, interprets and executes our code. Unlike a compiler, it does not convert the source code (at least it does not output intermediate files), but interprets and executes the source code logic.

Simple interpreter

Because Simple doesn’t do intermediate code conversions to the JavaScript code it writes, it simply interprets and executes the logic of the code, making it a literal JavaScript language interpreter.

Architecture design for Simple

The code we write is simply a string of text stored on the computer’s hard drive, and the essence of implementing a language interpreter is to teach the computer how to understand and execute that text code. So how can a computer understand what we write? Given that most programming languages are coded in English, let’s take a look at how one understands an English sentence and see if we can get some inspiration.

The process of understanding English sentences

Put a pencil on the table. I’m sure you all know what this means, but have you ever thought about how you areUnderstand this sentence? Or further, could you break down the process of understanding this sentence into individual steps?

I’m sure most people will go through these stages in understanding the above statement:

  • cuttingWords,understandThe meaning of each word: A sentence is made up of words. To understand the meaning of a sentence, we must first know the meaning of each word. “Put a pencil on the table” Means:
    • Put: verb, put
    • A: Indefinite article, one.
    • A. Pencil B. pencil C. pencil D. pencil
    • On: preposition, in… The above.
    • The: Definite article, this one.
    • The table.
  • After the words are cut up, we will base them onRules of English grammarsentence-dividingstructure: After understanding the meaning of each word in the sentence, we will divide the structure of the sentence according to the Grammar rules of English. For example, for the above sentence, we will divide the structure as follows:
    • The first word of the sentence is the verb “put”, and the verb is followed by an indefinite article noun. The first part of the sentence is to ask someone to put a (A) pencil.
    • After explaining the first half of the sentence, let’s look at the second half of the sentence. The pencil is on the table. The pencil is on the table. The pencil is on the table.
    • Put the pencil on the desk. Put the pencil on the desk.

How does a computer understand code

Now that we know how to understand an English sentence, let’s think about how to get a computer to understand our code. We all know that a lot of computer science is about modeling the real world. For example, the well-known data structure Queue corresponds to the queues we often see in daily life, while some design patterns, such as Visitor and Listener, are models of real life scenarios. In computer science, the study of programming languages is called the principles of compiling. How do some of the basic concepts of the principles of compiling correspond to the steps we have taken to understand sentences?

As mentioned above, the first step for us to understand a sentence is to cut up the words and then understand the meaning of each word, which corresponds to Lexical Analysis in the compilation principle. Lexical analysis, as its name implies, interprets code at the word level, mainly dividing the code string into individual tokens.

After understanding the meaning of each word, we will divide the sentence structure according to English grammar rules. The concept of compilation principle corresponding to this step is Syntax Analysis/Parser. The process of parsing generates an abstract syntax tree (AST) based on the defined syntax rules from the lexical strings generated by parsing. The resulting abstract syntax tree is eventually executed by some runtime.

To sum up, the software architecture of a language interpreter looks something like this:That’s Simple’s software architecture, and let’s look at the implementation of lexical analysis.

Lexical analysis

As mentioned earlier, the so-called lexical analysis is to cut the code of a file into individual units of word (token). In addition to letter-concatenated strings such as let, for and while are words, non-concatenated strings such as =, ==, && and + are also legal words. Here are some words that are legal for the Simple interpreter:

  • Keywords: let, const, break, continue, if, else, while, function, true, false, for, undefined, null, new, return
  • Identifiers: Variable names defined by developers, such as ARr, server, result, etc
  • Literals: Literals include numeric literals (number) and string literals (string). The Simple interpreter only supports single-quoted strings, such as ‘this is a string literal’.
  • Arithmetic and logical operations symbol: +, -, + +, -, *, /, &&, | |, >, > =, <, < =, = =
  • Assignment operators: =, +=, -=
  • Special symbols: [,], {,},., :, (,)

It is important to note that all characters in the source code are not preserved in the lexical analysis phase, and unnecessary information such as whitespace, line feeds, and code comments are removed in this phase. Here is a rendering of the lexical analysis:For lexical analysis, there are roughly two implementations:

Regular expression

This approach is probably the one most developers would think of. Since the Simple interpreter does not use this approach, the process is described briefly here, which generally consists of the following steps:

  • Defines regular expressions for each word type, such as numeric literals/ [0-9] [0-9] * /The regular expression for the simple assignment operator (excluding the floating point case) is/ = /, is the regular expression of the operator/ = = /.
  • The regular expression for each word type is defined asLexical priority orderAnd the code stringmatchAction if the regular expression for a word type hashit, extract the corresponding substring, and then from the string just hitFinal positionStart to continue the match operation, soLoop over and over againUntil all strings match. One very important point here is that there are different word typesLexical priority orderFor example, the equal operator= =Is more important than=“Takes precedence, because if the developer writes two equals signs, it must mean equal judgment, not two assignment signs.

Based on finite state machines

Since all regular expressions can be converted to their corresponding finite state machines, lexical analysis can also be implemented using finite state machines. So what is a finite state machine?

Finite State Machine (FSM) Has the following characteristics:

  • It has a finite state
  • It can only have one state at a time, the current state
  • After receiving data from the outside world, the finite state machine calculates the next state based on the current state and received data and transitions to that state

The familiar traffic light is an example of a finite state machine. A traffic light can only have three colors, red, green and yellow, so its set of states is finite. Since a traffic light can only have one color at a time (imagine if it were both red and green :), its current state is unique. Finally, the traffic light will change to the next state according to the current state (color) and input (how much time has passed). For example, after 60 seconds, the red light will turn yellow instead of green.

From the above definition we know that the most important elements of a finite state machine are the following:

  • The state sets
  • The current state
  • How do you reverse between different states

Now that we know what a finite state machine is and its three elements, let’s look at an example of using a simple finite state machine to do lexical analysis. We are going to design a finite state machine that recognizes the following types of words:

  • Identifiers
  • Number (number literal, excluding floating point)
  • String (string literal, enclosed in single quotes)
  • Plus (+)
  • The plus assignment operator (+=)

Let’s first define the three elements of a finite state machine:

  • State set: The state set should contain the state machine after it receives any inputAll stateFor the above states, opportunities have the following states:
    • Initial: indicates the initial state
    • Number: The state in which the state machine recognizes a number literal
    • Start String literal: The state machine is in this state when it receives the first single quote and until it receives the second single quote (before the string ends)
    • String literal: This state is in when the state machine recognizes a string literal
    • Identifier: The state in which an identifier is identified by the state machine
    • Plus: the state in which the state machine recognizes the plus sign
    • Plus assign: The current state machine recognizes that the plus assignment operator will be in this state
  • Current state: The current state of the finite state machine can be any of the states defined above
  • How to reverse between states: when the state machine is in one state, it can onlyTo reverse to some particular state. For example, if the state machine is now instart string literalState, which can only maintain the current state or transition tostring literalState. When the current input cannot be reversed by the state machine, there are two situations. The first is when the current state is oneA terminable stateIn other words, the current state machine already knows all the information needed to generate a token. At this time, the state machine outputs the word type represented by the current state. After the output of the last word, the state machine resets to the initial state and then reprocesses the previous input. If the current state is aNon-terminated stateIf the state machine does not have enough information to output a word, the state machine will report an error. In the current example, the terminable states arenumber.string literalandidentifier, while the non-terminal state hasstart string literal. Here is the state twist of the state machine:

The important thing to note here is that in addition to storing the current state information, the state machine also needs to store any characters that are not currently printed as words. This means that there should be a buffer variable to store any character input that is encountered. For example, if + is encountered, buffer will change to +, if = is encountered, buffer will change to +=, and finally += will be printed, and buffer will be reset to an empty string “”.

Once the three elements of the state machine are defined, we can use the above state machine to parse the string a+=’Simple’ :

  1. The stateful machine starts out as initial, then receives each character of the code one by one and completes the corresponding state reversal and word output
  2. Received by the state machineaCharacter, which we know from the state reversal diagram defined above can be used to reverse the state machine toidentifierThis state, and will save the character inbufferIn this variable
  3. Received by the state machine+Character cannot be based on due to identifier+The character is in a state reversal, and it is currently in a terminable state, so the state will output the previously recorded stateaWord, and then resets the state toinitial. The state machine resets the state and reprocesses it+Character, when the state machine converts toplusState, and will+The character is recorded at this timebufferinto+
  4. Received by the state machine=Character, as can be seen from the twist diagram above, the state machine can be converted toplus assignThis state, so the state will do the state reversal and record it=This character,bufferinto+ =
  5. Received by the state machine'Character, due toplus assignNot according to the'Character for state transition, whileplus assignIt is also a terminable state, so the state will output the presentbufferRecords of the+ =As a word, and resets the state toinitial. The state machine resets the state and reprocesses it'Character, when the state machine converts tostart string literalstate
  6. When the state machine receives separatelyS.i.m.p.landeBecause they are not single quotes, the state chances remain atstart string literalThis state, and these characters will be added in turn tobuffer, the final buffer becomesSimple
  7. Received by the state machine'Character, state machine conversion tostring literalState, which means that the state machine has recognized a valid string word
  8. Finally, after the state machine decides there are no characters to enter, it looks to see if the current state is terminable becausestring literalIs a terminable state, so the state will output the current word. Conversely, if the state machine finds that there are no new characters to enter and it is in a non-terminating state, it will throw a character calledUnexpected EOFThe error

This is a Simple example of using a finite state machine to implement a lexical analyzer. The lexical analysis implementation of the Simple interpreter is the same as the steps above. In the Simple interpreter, I decoupled the logic of the core logic of the state machine (recording the current state and reversing the state) from the logic of the configuration of the state machine (defining the set of states and how to reverse them from one state to another) to make it easier to modify and extend the lexical rules of the Simple language later. And it can use another state machine configuration to implement lexical analysis for another language.

The core logic code of the state machine is placed in lib/lexer/ tokenizer. ts file, and the configuration of the state machine is placed in lib/config/ tokenizer. ts file, the following is the specific source code analysis:

State machine configuration definition

The Simple state machine configuration is defined in lib/config/ tokenizer. ts.

// lib/config/Tokenizer.ts

// State defines all possible states of the Simple language State machine
enum State {
  INITIAL = 'INITIAL',
  NUMBER_LITERAL = 'NUMBER_LITERAL',
  IDENTIFER = 'IDENTIFER',
  START_STRING_LITERAL = 'START_STRING_LITERAL',
  STRING_LITERAL = 'STRING_LITERAL'. }// State torsion definition
const config: IConfig = {
  initialState: State.INITIAL, // Define the initial state of the state machine
  states: { // Enumerates all state configurations of the state machine
    [State.INITIAL]: {
      isEnd: false.// Indicates whether the state is terminable
      transitions: [ // Enumerates all state transitions of the state machine
        {
          state: State.NUMBER_LITERAL,
          checker: / [0-9]
        },
        {
          state: State.START_STRING_LITERAL,
          checker: "'"
        }
    },
    [State.NUMBER_LITERAL]: {
      isEnd: true.TokenType: TOKEN_TYPE.NUMBER_LITERAL,
      transitions: [{state: State.NUMBER_LITERAL,
          checker: / / [0-9 \]
        }
      ]
    },
    [State.START_STRING_LITERAL]: {
      isEnd: false.transitions: [{state: State.START_STRING_LITERAL,
          checker: / [^ '] /
        },
        {
          state: State.STRING_LITERAL,
          checker: "'"
        }
      ]
    },
    [State.STRING_LITERAL]: {
      isEnd: true.TokenType: TOKEN_TYPE.STRING_LITERAL }, ... }}Copy the code

The above configuration file defines a Config object that is passed as a parameter to the finite state machine class Tokenizer in lib/lexer/ tokenizer.ts. The config object takes two parameters, the initial state value and all state configurations of the state machine. The initial state value is the initial state value of the state machine, and when the state machine recognizes a new word, it resets to this state. States is an Object of type. Its key is the value of a state, and value is the corresponding configuration of the state. The configuration of a state includes the following contents:

  • IsEnd: Boolean, indicating whether the state is terminable
  • TokenType: Represents the word type corresponding to this state. If the state is a terminable state, it can have a corresponding word type. If TokenType is not specified, the corresponding word will not be generated even if a word matches successfully.
  • Transitions: Stores all possible transitions from this state, each of which has the following properties:
    • State: state to be converted to
    • Checker: A condition for a state transition. It can be a string, a regular expression, or a function that returns a Boolean value. The state machine transitions when input meets the checker condition

State machine core logic implementation

After looking at the Simple state machine configuration above, let’s look at the core code for the state machine that uses this configuration, lib/Lexer/ tokenizer.ts. In order to implement Tokenizer, I designed two helper classes. One is a LocationKeeper class that records the current location of the character. It is used to record the number of lines and columns in the source file. The other class is the TokenBuffer class. All words recognized by the state machine are stored in an instance of this class, so it needs to provide methods to read/write words. This class will be introduced after the Tokenizer class.

Let’s start with the Tokenizer class’s consume(ch: string) function, the core logic for processing input characters:

// lib/lexer/Tokenizer.ts
class Tokenizer {...consume(ch: string) {
    // If the input character is a space or a newline character and the current state is initial, only the current position information is updated
    if ((ch === SPACE || ch === NEW_LINE) && this.state === this.initialState) {
      this.locationKeeper.consume(ch)
      return
    }

    // The state is then reversed based on the current state and the entered characters

    This. state stores the current state of the state machine
    const currentStateConfig: IStateConfig = this.statesConfig[this.state]
    if(! currentStateConfig) {Develper-friendly: // developer forgot to configure develper-friendly:
      throw new Error(`Missing state config for The ${this.state}`)}// Get all possible transitions to the current state
    const transitions = currentStateConfig.transitions
    if(! transitions) {// If the current state is not convertible and terminable
      if (currentStateConfig.isEnd) {
        // Generate a token and store it in tokenBuffer
        this.addToken(currentStateConfig.TokenType)
        // Reset the current state
        this.reset()
        // Consume the currently entered characters again
        this.consume(ch)
        return
      }

      // The current state cannot be converted and is not terminated.
      throw new SyntaxError(`Unexpected character ${ch}`.this.locationKeeper.getCurrentLocation())
    }

    // Compare the input character to checker to determine the state transition that needs to be made
    const targetTransition = transitions.find(({ checker }) = > {
      if (typeof checker === 'string') {
        return ch === checker
      }

      if (checker instanceof RegExp) {
        return checker.test(ch)
      }

      return checker(ch)
    })
    
    // There is no state that can be converted
    if(! targetTransition) {// Is a terminable state
      if (currentStateConfig.isEnd) {
        if (currentStateConfig.TokenType) {
          // Add token to tokenBuffer instance
          this.addToken(currentStateConfig.TokenType)
        }
        // Reset the state
        this.reset()
        // re-consume input characters
        this.consume(ch)
        return
      }

      // There is no transition state and now there is a non-termination state, we can only report an error!
      this.locationKeeper.consume(ch)
      throw new SyntaxError('Invalid or unexpected token'.this.locationKeeper.getCurrentLocation())      
    }

    // The following logic is done after a successful state reversal

    // Update the current recorded location information, the number of lines and columns of code
    this.locationKeeper.consume(ch)

    The following code is used to record the beginning position of the current word
    if (this.state === this.initialState && targetTransition.state ! = =this.initialState) {
      this.locationKeeper.markLocation()
    }

    // Convert the current state to the target state
    this.state = targetTransition.state
    // Add the current character to the buffer
    this.buffer += ch
  }
}
Copy the code

Next, let’s look at the source code for the TokenBuffer class used to store recognized words:

// lib/lexer/TokenBuffer.ts
import { IToken } from "./types/token"

class TokenBuffer {
  // Store the currently recognized words
  private tokens: Array<IToken> = []
  // Store the current read word location
  private cursor: number = 0

  // Peek returns the current word, does not change the cursor position, only prereads
  peek() {
    return this.tokens[this.cursor]
  }

  // Unlike peek, it reads the current word and thus changes the cursor position
  read() {
    const currentToken = this.tokens[this.cursor]
    const nextCursor = this.cursor < this.tokens.length ? ++this.cursor : this.tokens.length
    this.cursor = nextCursor
    return currentToken
  }

  // Cancel the last read and put the word "back"
  unread() {
    const lastCursor = --this.cursor
    this.cursor = lastCursor
    return this.tokens[lastCursor]
  }

  // Write a new token
  write(token: IToken) {
    this.tokens.push(token)
  }

  // Get the current cursor position
  getCursor() {
    return this.cursor
  }

  // Directly set the current cursor position, mainly used in the syntax analysis phase for backtracking
  setCursor(cursor: number) {
    this.cursor = cursor
  }

  // Output current tokens in JSON format
  toJSON(): Array<IToken> {
    return this.tokens
  }

  // Determine if the word has been processed completely
  isEmpty(): boolean {
    return this.cursor === this.tokens.length
  }
}
Copy the code

Attentive students will find every time when I was in TokenBuffer above reads the words are just move the cursor, and no real out the words from an array, the advantage is convenient syntax analysis phase in some grammatical rules does not match the back before read the words, to use another grammar rules to match.

Token word string

Finally, let’s take a look at the Token string recognized by the finite state machine. Here is the input code:

let a = 'HelloWorld';
Copy the code

After processing by the finite state machine, the output Token string is:

[{"type": "LET"."value": "let"."range": {
      "start": {
        "line": 1."column": 1
      },
      "end": {
        "line": 1."column": 3}}}, {"type": "IDENTIFIER"."value": "a"."range": {
      "start": {
        "line": 1."column": 5
      },
      "end": {
        "line": 1."column": 5}}}, {"type": "ASSIGN"."value": "="."range": {
      "start": {
        "line": 1."column": 7
      },
      "end": {
        "line": 1."column": 7}}}, {"type": "STRING_LITERAL"."value": "'HelloWorld'"."range": {
      "start": {
        "line": 1."column": 9
      },
      "end": {
        "line": 1."column": 20}}}, {"type": "SEMICOLON"."value": ";"."range": {
      "start": {
        "line": 1."column": 21
      },
      "end": {
        "line": 1."column": 21}}}]Copy the code

As you can see from the output above, each word (token) has the following attributes:

  • Type: The type of the word, which is TokenType defined in the non-terminating state
  • Value: The specific value of the word
  • Range: Stores the starting and ending positions of the word, including the number of rows and columns. This location information helps the developer locate the error when the code reports an error

summary

In this article, I give you the background and content of the Simple project, then introduce you to some basic compilation principles, and finally detail how to use finite state machines to implement lexical analysis and interpret the corresponding source code for the Simple project.

In the next article, I’ll cover some of the basics of parsing, introduce some of the basic concepts of domain-specific languages (DSL), and finally detail how I used a flexible DSL to implement parsing for the Simple language.

Personal Technology dynamics

The article started on my personal blog

Welcome to pay attention to the public number of green Onions to learn and grow together