The job of lexical analysis is to identify a long string of words, which are tokens. And the job of lexical analysis is to read and recognize strings, not to read all the strings into memory and then recognize

String is a series of characters, how to break it into one Token after another? What is the basis for partition? In fact, we realize the process of lexical analyzer, is to write regular expression, draw the graph of finite automata, and then write the process of parsing code intuitively according to the graph.

The following figure shows a simple regular expression rule

Parse age >= 45

To resolveage >= 45For example, the process of lexical analysis is shown in the figure below

Let’s describe the lexical rules for identifiers, comparison operators, and numeric literals.

  • Identifiers: The first character must be a letter, followed by letters or numbers.
  • Comparison operators: > and >= (other comparison operators are ignored for now).
  • Number literals: all numbers (like floating point numbers with a decimal point, forget about that for now.

This is how we construct finite automata. Thus, the lexical analyzer recognizes age, >=, and 45 as identifiers, comparison operators, and numeric literals, respectively. However, the diagram above is just a simplified schematic diagram. A strictly finite automaton is drawn as follows:

Explain the five states shown above.

  1. Initial state: The state of the program when lexical analysis is first started.
  2. Identifier state: In the initial state, when the first character is a letter, migrate to state 2. When the following characters are letters and numbers, leave them in state 2. If not, leave state 2, write down the Token, and return to the initial state.
  3. Greater than operator (GT) : In the initial state, entered when the first character is >. It is a case of comparison operators.
  4. Greater than or equal operator (GE) : If the next character in state 3 is =, state 4 is entered, becoming >=. It is also a case of comparison operators.
  5. Numeric literals: Enter the initial state where the next character is a number. If the subsequent number is still a number, it stays in state 5.

I would like to add here that you can see that the circles in the figure above have both single and double lines. Double line means that the status is already a legitimate Token, single line means that the status is still temporary.

You can easily program these five state transitions. We will start with state 1 and enter state 2, 3, and 5 when we encounter different characters:

 if (this.isAlpha(ch)) {
    // The first character is a letter
    if (ch === "i") {
        newState = DfaState.Id_let1;
    } else {
        // Enter the Id state
        newState = DfaState.Id;
    }
    this.token.type = TokenType.Identifier;
} else if (this.isDigit(ch)) {
    // The first character is a number
    newState = DfaState.LetLiteral;
    this.token.type = TokenType.LetLiteral;
}
Copy the code

In the code above, nextState is the nextState to enter. Enumeration (enum) types define enumeration values to represent different states, making the code easier to read.

Token is a custom data structure, which has two main attributes: one is “type”, which is the type of Token, which also uses an enumeration type value; One is the “text”, which is the text value of the Token.

We then deal with state transitions after entering states 2, 3, and 5:

case DfaState.Initial:
    state = this.initToken(ch);
    break;
case DfaState.Id:
    if (this.isAlpha(ch) || this.isDigit(ch)) {
        this.appendTokenText(ch);
    } else {
        state = this.initToken(ch);
    }
    break;
case DfaState.GT:
    if (ch === '=') {
        this.token.type = TokenType.GE;
        state = DfaState.GE;
        this.appendTokenText(ch)
    } else {
        state = this.initToken(ch)
    }
    break;
case DfaState.GE:
case DfaState.Assignment:
case DfaState.Plus:
case DfaState.Minus:
case DfaState.Star:
case DfaState.Slash:
case DfaState.SemiColon:
case DfaState.LeftParen:
case DfaState.RightParen:
    state = this.initToken(ch);          // Exit the current state and save the Token
    break;
case DfaState.LetLiteral:
    if (this.isDigit(ch)) {
        this.appendTokenText(ch)
    } else {
        state = this.initToken(ch)
    }
    break;
Copy the code

Run the sample program and you will successfully parse program statements like “age >= 45.”

Int age = 40

If the lexical rules involved in this statement are written out as regular expressions, they look like this:

Int:        'int'
Id :        [a-zA-Z_] ([a-zA-Z_] | [0-9])*
Assignment : '='
Copy the code

The int keyword, like an identifier, begins with a letter followed by another letter.

In other words, the string int conforms to both the rules for the identifier and the rules for the int keyword, and the two rules overlap. So this is a conflict, which rule should we use when we scan strings?

Of course, we know that rules for the int keyword take precedence over rules for identifiers. Common identifiers are not allowed to have the same name as these keywords.

Here, let’s review: What are keywords?

Keywords are words used as grammar elements in language design, such as int and char to represent data types, while and if to represent program structures, null and NAN to represent special data values, etc.

In addition to keywords, there are other words called reserved words. Reserved words are not used in the current language design, but they will be used in the future. We can name our variables, class names, and do not use the same string as the keyword and reserved word. So how do we distinguish keywords and reserved words from identifiers in a lexical analyzer?

Taking “int age = 40” as an example, we changed the finite auto to look like the following to resolve the keyword and identifier conflicts.

The idea is actually quite simple. Before you recognize a generic identifier, you should look at whether it is a keyword or a reserved word. The specific approach is:

When the first character is I, we put it into a special state. And then, if it encounters n and t, it goes to state 4. But that doesn’t end there. If the following character has other letters and numbers, it becomes a plain identifier again. For example, we can declare A variable like intA (int and A are connected) without conflicting with the int keyword. See the complete code at the bottom.

The above two examples, although simple, but it has clear the principle of lexical, is constructed based on finite automata and migration in different state, so as to resolve the Token for as long as to extend the finite automaton, increase the inside of the state and migration routes, you can gradually achieve a complete lexical analyzer.

This article mainly refers to the geek time course using JS to implement the lexical analyzer, run the sample program code as follows

let string = "int age = 45;";
console.log(string);
const lexer = new SimpleLexer();
let tokenReader = lexer.tokenize(string);
dump(tokenReader);

string = "inta age = 45;";
console.log(string);
tokenReader = lexer.tokenize(string);
dump(tokenReader);

string = "age >= 45;";
console.log(string);
tokenReader = lexer.tokenize(string);
dump(tokenReader);
Copy the code

The running result is shown in the figure below

Full code implementation

Source code address: SimpleLexer

// Define various state machines
const DfaState = {
    Initial: Symbol("Initial"),
    If: Symbol("If"),
    Id_if1: Symbol("Id_if1"),
    Id_if2: Symbol("Id_if2"),
    Else: Symbol("Else"),
    Id_else1: Symbol("Id_else1"),
    Id_else2: Symbol("Id_else2"),
    Id_else3: Symbol("Id_else3"),
    Id_else4: Symbol("Id_else4"),
    Int: Symbol("Int"),
    Id_int1: Symbol("Id_int1"),
    Id_int2: Symbol("Id_int2"),
    Id_int3: Symbol("Id_int3"),
    Id: Symbol("Id"),
    GT: Symbol("GT"),
    GE: Symbol("GE"),
    Assignment: Symbol("Assignment"),
    Plus: Symbol("Plus"),
    Minus: Symbol("Minus"),
    Star: Symbol("Star"),
    Slash: Symbol("Slash"),
    SemiColon: Symbol("SemiColon"),
    LeftParen: Symbol("LeftParen"),
    RightParen: Symbol("RightParen"),
    IntLiteral: Symbol("IntLiteral"),};// Define the token type
export const TokenType = {
    Identifier: Symbol("Identifier"),
    IntLiteral: Symbol("IntLiteral"),
    GT: Symbol("GT"),
    Plus: Symbol("Plus"),
    Minus: Symbol("Minus"),
    Star: Symbol("Star"),
    Slash: Symbol("Slash"),
    SemiColon: Symbol("SemiColon"),
    LeftParen: Symbol("LeftParen"),
    RightParen: Symbol("RightParen"),
    Assignment: Symbol("Assignment"),
    Int: Symbol("Int"),};// Single token object
class SimpleToken {
    constructor() {
        this.type = null;
        this.text = null;
    }

    getType() {
        return this.type;
    }

    getText() {
        return this.text; }}class TokenReader {
    constructor(tokens) {
        this.tokens = tokens;
        this.pos = 0;
    }

    read() {
        if (this.pos < this.tokens.length) {
            return this.tokens[this.pos++];
        }
        return null;
    }

    peek() {
        if (this.pos < this.tokens.length) {
            return this.tokens[this.pos]
        }
        return null;
    }

    unread() {
        if (this.pos > 0) {
            this.pos--; }}getPosition() {
        return this.pos;
    }

    setPosition(position) {
        if (position >= 0 && position < this.token.length) {
            this.pos = position; }}}export function dump(tokenReader){
    let token = null;
    while((token = tokenReader.read())){
        console.log(token.type ,token.getText())
    }
}


export class SimpleLexer {
    constructor() {
        this.tokenText = null;
        this.tokens = [];
        this.token = null;
    }

    appendTokenText(ch) {
        this.tokenText = this.tokenText + ch;
    }
    // Is it a letter
    isAlpha(ch) {
        return (ch >= "a" && ch <= "z") || (ch >= "A" && ch <= "Z");
    }
    // Whether it is a number
    isDigit(ch) {
        return ch >= "0" && ch <= "9";
    }
    // Is a blank character
    isBlank(ch) {
        return ch === "" || ch === "\t" || ch === "\n";
    }
    /** * The finite state machine enters the initial state. * This initial state does not actually stay, it immediately enters another state. * Enter the initial state when parsing begins; After a Token is resolved, it enters the initial state. Write down the Token here and create a new Token. *@param ch
     * @return* /
    initToken(ch) {
        if (this.tokenText.length > 0) {
            this.token.text = this.tokenText;
            if(this.token.type){
                this.tokens.push(this.token);
            }
            this.tokenText = "";
            this.token = new SimpleToken();
        }

        let newState = DfaState.Initial;
        if (this.isAlpha(ch)) {
            // The first character is a letter
            if (ch === "i") {
                newState = DfaState.Id_int1;
            } else {
                // Enter the Id state
                newState = DfaState.Id;
            }
            this.token.type = TokenType.Identifier;
        } else if (this.isDigit(ch)) {
            // The first character is a number
            newState = DfaState.IntLiteral;
            this.token.type = TokenType.IntLiteral;
        } else if (ch === ">") {
            // The first character is >
            newState = DfaState.GT;
            this.token.type = TokenType.GT;
        } else if (ch === "+") {
            newState = DfaState.Plus;
            this.token.type = TokenType.Plus;
        } else if (ch === "-") {
            newState = DfaState.Minus;
            this.token.type = TokenType.Minus;
        } else if (ch === "*") {
            newState = DfaState.Star;
            this.token.type = TokenType.Star;
        } else if (ch === "/") {
            newState = DfaState.Slash;
            this.token.type = TokenType.Slash;
        } else if (ch === ";") {
            newState = DfaState.SemiColon;
            this.token.type = TokenType.SemiColon;
        } else if (ch === "(") {
            newState = DfaState.LeftParen;
            this.token.type = TokenType.LeftParen;
        } else if (ch === ")") {
            newState = DfaState.RightParen;
            this.token.type = TokenType.RightParen;
        } else if (ch === "=") {
            newState = DfaState.Assignment;
            this.token.type = TokenType.Assignment;
        }
        this.appendTokenText(ch);
        return newState;
    }

    getTokens(code) {
        return code.replace(/ (.). (? =[^$])/g."$1").split(",")}tokenize(code) {
        this.tokens = [];
        const reader = new TokenReader(this.getTokens(code))
        this.tokenText = "";
        this.token = new SimpleToken();
        let chI = 0;
        let ch = 0;
        let state = DfaState.Initial;
        try {
            while ((chI = reader.read())) {
                ch = chI;
                // console.log(ch);
                switch (state) {
                    case DfaState.Initial:
                        state = this.initToken(ch);
                        break;
                    case DfaState.Id:
                        if (this.isAlpha(ch) || this.isDigit(ch)) {
                            this.appendTokenText(ch);
                        } else {
                            state = this.initToken(ch);
                        }
                        break;
                    case DfaState.GT:
                        if (ch === '=') {
                            this.token.type = TokenType.GE;
                            state = DfaState.GE;
                            this.appendTokenText(ch)
                        } else {
                            state = this.initToken(ch)
                        }
                        break;
                    case DfaState.GE:
                    case DfaState.Assignment:
                    case DfaState.Plus:
                    case DfaState.Minus:
                    case DfaState.Star:
                    case DfaState.Slash:
                    case DfaState.SemiColon:
                    case DfaState.LeftParen:
                    case DfaState.RightParen:
                        state = this.initToken(ch);          // Exit the current state and save the Token
                        break;
                    case DfaState.IntLiteral:
                        if (this.isDigit(ch)) {
                            this.appendTokenText(ch)
                        } else {
                            state = this.initToken(ch)
                        }
                        break;
                    case DfaState.Id_int1:
                        if (ch === 'n') {
                            state = DfaState.Id_int2;
                            this.appendTokenText(ch)
                        } else if (this.isDigit(ch) || this.isAlpha(ch)) {
                            state = DfaState.Id;    // Switch back to Id
                            this.appendTokenText(ch);
                        } else {
                            state = this.initToken(ch)
                        }
                        break;
                    case DfaState.Id_int2:
                        if (ch === 't') {
                            state = DfaState.Id_int3;
                            this.appendTokenText(ch)
                        } else if (this.isDigit(ch) || this.isAlpha(ch)) {
                            state = DfaState.Id;    // Switch back to Id
                            this.appendTokenText(ch);
                        } else {
                            state = this.initToken(ch)
                        }
                        break;
                    case DfaState.Id_int3:
                        if (this.isBlank(ch)) {
                            this.token.type = TokenType.Int;
                            state = this.initToken(ch)
                        } else {
                            state = DfaState.Id;
                            this.appendTokenText(ch)
                        }
                        break;
                    default:
                        break; }}// Send the last token
            if (this.tokenText.length > 0) {
                this.initToken(ch)
            }

        } catch (error) {
            console.log(error)
        }

        return new  TokenReader(this.tokens)
    }
}
// const simpleLexer = new SimpleLexer();

Copy the code