A compiler is a program that translates one language into another.

Babel, for example, is a compiler that translates ES6 version JS into ES5 version JS. From this point of view, translation software that translates English into Chinese also belongs to compilers.

Ordinary programs cannot be executed directly by the CPU because the CPU can only recognize machine instructions. So to execute a program, you first translate the program written in the high-level language into assembly code (Java takes an extra step to translate the high-level language into bytecode) and then translate the assembly code into machine instructions that the CPU can recognize and execute.

Because assembly language and machine language one – to – one correspondence, and assembly language is more readable. Therefore, the teaching materials on computer principles usually use assembly language instead of machine language to explain machine instructions.

The four arithmetic expressions written in this article need to be translated into machine instructions and executed by the compiler. See an example for the specific process:

// The CPU cannot recognize it
10 + 5

// Translate into assembly language
push 10
push 5
add

// Finally translated into machine instruction, assembly code and machine instruction one-to-one correspondence
// Machine instructions are composed of 1 and 0. The following instructions are not real instructions
0011101001010101
1101010011100101
0010100111100001
Copy the code

The four arithmetic compilers, although very simple, can only compile four arithmetic expressions. But the front end of the compilation principle is almost always involved: lexical analysis, syntax analysis. There is also code generation for the back-end part of the compilation principle. Whether it is a simple or complex compiler, the compilation steps are similar, but complex compilers are more difficult to implement.

One might ask, what’s the advantage of learning how to compile?

I think a good understanding of the internals of the compilation process will make you a better senior programmer. In addition, here is a quote from zhihu netizen – to be more specific:

  1. It is easier to understand what is equivalent and what is different in a language
  2. The differences between languages can be compared more objectively
  3. And less susceptible to being fooled by advocates of a particular language
  4. Learning a new language is also more efficient
  5. Converting from language A to language B is a general requirement, and learning the principles of compilation makes it easier to handle such requirements

Ok, now let’s look at how to write a four-operation compiler.

Lexical analysis

A program is simply a series of characters stored in a text file, and lexical analysis is used to break these characters down into tokens (also known as terminators) according to certain rules, ignoring Spaces and comments.

Example:

// Program code
10 + 5 + 6

// The token obtained after lexical analysis
10
+
5
+
6
Copy the code

terminator

Terminators are the basic elements used in the language and cannot be decomposed.

Terminators in the four operations include symbols and integer constants (unary operators and floating point operations are not currently supported).

  1. symbol:+ - * / ()
  2. Integer constants: 12, 1000, 111…

Lexical analysis code implementation

function lexicalAnalysis(expression) {
    const symbol = ['('.') '.'+'.The '-'.The '*'.'/']
    const re = /\d/
    const tokens = []
    const chars = expression.trim().split(' ')
    let token = ' '
    chars.forEach(c= > {
        if (re.test(c)) {
            token += c
        } else if (c == ' ' && token) {
            tokens.push(token)
            token = ' '
        } else if (symbol.includes(c)) {
            if (token) {
                tokens.push(token)
                token = ' '
            } 

            tokens.push(c)
        }
    })

    if (token) {
        tokens.push(token)
    }

    return tokens
}

console.log(lexicalAnalysis('100 + 23 + 34 * 10/2 ')) 
/ / / "100", "+", "23", "+", "34", "*", "10", "/", "2"]
Copy the code

Syntax rules for four operations (syntax rules are hierarchical)

  1. x*Is zero or more occurrences of x
  2. x | y, indicating that x or y will appear
  3. ( )Parentheses, used for grouping words in language formation

The following rule, from left to right, indicates that the expression on the left can be subdivided further down into the expression on the right until it is no longer subdivided.

  • expression: addExpression
  • addExpression: mulExpression (op mulExpression)*
  • mulExpression: term (op term)*
  • term: ‘(‘ expression ‘)’ | integerConstant
  • op: + - * /

AddExpression corresponds to + – expressions, while mulExpression corresponds to * / expressions.

If you don’t understand the above rules, put them down and move on. See how to implement parsing in code.

Syntax analysis

The process of analyzing input text according to grammatical rules and determining its grammatical structure is called grammatical analysis.

The output of a general syntax analysis is an abstract syntax tree (AST) or a parse tree. But because the four operations are relatively simple, the solution here is to do code generation and error reporting in real time, so there is no need to keep the entire program structure in memory.

So let’s see how we can analyze a four-way expression 1 + 2 * 3.

Expression is matched first. Since there is only one possibility of expression down at present, that is, addExpression, it is decomposed into addExpression. And so on, The next order is MulExpression, term, 1 (integerConstant), + (op), mulExpression, term, 2 (integerConstant), * (op), mulExpression, term, 3 (integerCons) Tant).

As shown below:

One might wonder why an expression is so complicated. Expression is subtracted by addExpression, and addExpression is subtracted by mulExpression. MulExpr is a higher level of operation than addExpr.

1 + 2 * 3 compileExpression | compileAddExpr | | compileMultExpr | | | compileTerm | | | | _ matches IntegerConstant push 1 | | | _ | | matches' + '| | compileMultExpr | | | compileTerm | | | | _ matches integerConstant push 2 | | | matches' * '| | | compileTerm | | | | _ matches IntegerConstant push | 3 | | _ compileOp (' * ') * | | _ compileOp (' + ') + | _Copy the code

There are many algorithms that can be used to build parsing trees, but only two are covered here.

Recursive descent analysis

Recursive descent analysis, also known as top-down analysis. The token stream is recursively analyzed step by step according to the syntax rules, and if a non-terminal is encountered, the analysis continues until the terminal is reached.

LL (0) analysis

Recursive descent analysis method is a simple and efficient algorithm, and LL(0) has an additional step on this basis. When the first token is not enough to determine the element type, “look ahead” for the next character may solve this uncertainty.

The above is a brief introduction to the two algorithms, see the code implementation below.

Expression code generation

The four common expressions we use are infix expressions, but infix expressions are not easy for computers to evaluate. So in the code generation phase, infix expressions are converted into postfix expressions.

Postfix expression

Postfix expression, also known as inverse Polish, refers to the absence of parentheses, the operators are placed after two operands, and all computations are performed strictly from left to right in the order in which the operators appear (regardless of operator precedence rules).

Example:

Infix expression: 5 + 5 converts to suffix expression: 5, 5 +, and then generates code based on the suffix expression.

// 5 + 5 converts to 5 5 + regenerates code
push 5
push 5
add
Copy the code

Code implementation

The theoretical knowledge of compilation principle is like a mystery, often let people see the clouds and fog, but really start to do it, you will find that it is actually quite simple.

If the above theoretical knowledge does not understand, it does not matter, first look at the code implementation, and then combined with the theoretical knowledge to see.

Note: We need to introduce the previous lexical analysis code here.

// Assembly code generator
function AssemblyWriter() {
    this.output = ' '
}

AssemblyWriter.prototype = {
    writePush(digit) {
        this.output += `push ${digit}\r\n`
    },

    writeOP(op) {
        this.output += op + '\r\n'
    },

    // Output assembly code
    outputStr() {
        return this.output
    }
}

// Parser
function Parser(tokens, writer) {
    this.writer = writer
    this.tokens = tokens
    // tokens array index
    this.i = -1
    this.opMap1 = {
        '+': 'add'.The '-': 'sub',}this.opMap2 = {
        '/': 'div'.The '*': 'mul'
    }

    this.init()
}

Parser.prototype = {
    init() {
        this.compileExpression()
    },

    compileExpression() {
        this.compileAddExpr()
    },

    compileAddExpr() {
        this.compileMultExpr()
        while (true) {
            this.getNextToken()
            if (this.opMap1[this.token]) {
                let op = this.opMap1[this.token]
                this.compileMultExpr()
                this.writer.writeOP(op)
            } else {
                // the + - operator is not matched
                // Back the token index one bit
                this.i--
                break}}},compileMultExpr() {
        this.compileTerm()
        while (true) {
            this.getNextToken()
            if (this.opMap2[this.token]) {
                let op = this.opMap2[this.token]
                this.compileTerm()
                this.writer.writeOP(op)
            } else {
                // The corresponding operator is not matched
                // Back the token index one bit
                this.i--
                break}}},compileTerm() {
        this.getNextToken()
        if (this.token == '(') {
            this.compileExpression()
            this.getNextToken()
            if (this.token ! =') ') {
                throw 'missing close parenthesis :)'}}else if (/^\d+$/.test(this.token)) {
            this.writer.writePush(this.token)
        } else {
            throw 'Wrong token: the first' + (this.i + 1) + 'a token (' + this.token + ') '}},getNextToken() {
        this.token = this.tokens[++this.i]
    },

    getInstructions() {
        return this.writer.outputStr()
    }
}

const tokens = lexicalAnalysis('100 + 10 * 10')
const writer = new AssemblyWriter()
const parser = new Parser(tokens, writer)
const instructions = parser.getInstructions()
console.log(instructions) // Outputs the generated assembly code
/*
push 100
push 10
push 10
mul
add
*/
Copy the code

Simulation execution

Now let’s simulate the CPU executing machine instructions. Since assembly code corresponds to machine instructions one by one, we can create an emulator that executes assembly code directly. Before creating the simulator, let’s explain the operation of the relevant instructions.

The stack

In memory, the stack is characterized by the fact that only insert and delete operations can be performed at the same end, i.e. only push and POP operations.

push

The push instruction pushes an operand onto the stack.

pop

The pop instruction is used to pop an operand off the stack.

add

The add instruction performs two pop operations, pops two operands a and b, then executes a + B and pushes the result onto the stack.

sub

The sub instruction performs two pop operations, pops two operands A and B, then executes a-b and pushes the result onto the stack.

mul

The muL instruction performs two pop operations, pops two operands A and B, then executes a * B and pushes the result onto the stack.

div

The sub instruction performs two pop operations, pops two operands A and b, executes a/B, and pushes the result onto the stack.

All instructions of the four operations have been explained, do you think it is very simple?

Code implementation

Note: Lexing and parsing code need to be introduced

function CpuEmulator(instructions) {
    this.ins = instructions.split('\r\n')
    this.memory = []
    this.re = /^(push)\s\w+/
    this.execute()
}

CpuEmulator.prototype = {
    execute() {
        this.ins.forEach(i= > {
            switch (i) {
                case 'add':
                    this.add()
                    break
                case 'sub':
                    this.sub()
                    break
                case 'mul':
                    this.mul()
                    break
                case 'div':
                    this.div()
                    break                
                default:
                    if (this.re.test(i)) {
                        this.push(i.split(' ') [1])}}})},add() {
        const b = this.pop()
        const a = this.pop()
        this.memory.push(a + b)
    },

    sub() {
        const b = this.pop()
        const a = this.pop()
        this.memory.push(a - b)
    },

    mul() {
        const b = this.pop()
        const a = this.pop()
        this.memory.push(a * b)
    },

    div() {
        const b = this.pop()
        const a = this.pop()
        // Floating point arithmetic is not supported, so it is rounded here
        this.memory.push(Math.floor(a / b))
    },

    push(x) {
        this.memory.push(parseInt(x))
    },

    pop() {
        return this.memory.pop()
    },

    getResult() {
        return this.memory[0]}}const tokens = lexicalAnalysis('100 plus 10' times 10 minus 100/10 plus 8 times 4 plus 2 '.)
const writer = new AssemblyWriter()
const parser = new Parser(tokens, writer)
const instructions = parser.getInstructions()
const emulator = new CpuEmulator(instructions)
console.log(emulator.getResult()) / / 1138
Copy the code

A simple four-operation compiler has been implemented. Let’s write another test function and run it to see if it works as expected:

function assert(expression, result) {
    const tokens = lexicalAnalysis(expression)
    const writer = new AssemblyWriter()
    const parser = new Parser(tokens, writer)
    const instructions = parser.getInstructions()
    const emulator = new CpuEmulator(instructions)
    return emulator.getResult() == result
}

console.log(assert('1 + 2 + 3'.6)) // true
console.log(assert('1 + 2 * 3'.7)) // true
console.log(assert('10/2 * 3'.15)) // true
console.log(assert('10 plus 10 over 2'..10)) // true
Copy the code

All tests are correct. In addition, the complete source code is attached, and it is recommended that students who do not understand read more than two times.

Take it to the next level

For industrial compilers, this four-rule compiler is the toy of toys. But you can’t be fat all at once, so it’s best to take a step-by-step approach to learning the principles of compilation. Here is a more advanced compiler that can compile a Jack language (java-like language) with syntax like this:

class Generate {
    field String str;
    static String str1;
    constructor Generate new(String s) {
        let str = s;
        return this;
    }

    method String getString(a) {
        returnstr; }}class Main {
    function void main(a) {
        var Generate str;
        let str = Generate.new("this is a test");
        do Output.printString(str.getString());
        return; }}Copy the code

The output of the above code is: this is a test.

Want to implement such a compiler?

This compiler comes from a book called “Elements of Computer Systems,” and it starts with chapter 6, Chapter 11 covers the assembly compiler (which translates assembly language into machine language), the VM compiler (which translates bytecode-like VM language into assembly language), and the Jack language compiler (which translates high-level Jack language into VM language). Each chapter has detailed explanations and experiments, as long as you follow the step-by-step experiments, the final implementation of such a compiler.

If the compiler is written, where is the machine language executed last?

This book has you in mind. It has five chapters from chapter 1 to chapter 5. Teach you to start from the logic gate, gradually set up the arithmetic logic unit ALU, CPU, memory, and eventually build a modern computer. The program you compiled with the compiler then runs on the computer.

In addition, chapter 12 teaches you how to write the operating system’s various library functions, such as the Math library (which contains Math operations), the Keyboard library (which displays data to the screen when you press the Keyboard), memory management, and so on.

Want to see what happens when the 12-chapter experiment is done? Here I provide a few of this simulation computer running program DEMO GIF, for your reference.

In the upper right corner is the “computer” screen, and the rest is the “computer” stack area and instruction area.

I’ve already done all the experiments for this book (in 3 hours a day, in 2 months), and the answers are on my Github if you’re interested.

The resources

  • Elements of Computer Systems