Principle 1: Want to understand the principle of compilation? This article is enough

Why learn compilation principles

As programmers, whether front-end or back-end, compilation technology is closely related to our work. In practice, we often encounter scenarios that require compilation techniques. For example, front-end developers want to know how TypeScript translates from one language to another, how Babel compiles JavaScript, and so on. Learning compiler technology can improve our competitiveness in the workplace, but also help programmers to go further on the road of technology. By the end of this article you will have a basic idea of how compilation works, such as:

Know why you should learn to compile
Understand how compilation works in relation to the development language we use in our daily development
Understand the place of compilation in the language system and the structure of the compilation system
Learn about lexical analysis, grammatical analysis, semantic analysis, which we often hear about in our work
It’s more important to know how the code we write is recognized and executed by the computer.

What is compilation

The principle of compilation describes how to convert a high-level programming language into a machine language that the computer hardware can recognize so that the computer can process it

Compilation and computer programming languages

The languages we use in daily development are generally advanced syntax such as JAVA, Python, PHP, JavaScript, etc., but computers can only recognize machine codes such as 0 and 1. So how do these high-level languages translate into zeros, ones, etc., that machines can recognize? That’s all it takescompileFirst of all, let’s look at the relationship between compilation and computer program language through the following picture, which helps us intuitively understand the role of compilation.Note: Each machine corresponds to an assembly language

Programming language conversion

The conversion of a source program in one language into a target language program without changing its semantics

There are two real implementations, compilation and interpretation

Compilation: refers to the conversion from high-level language to low-level language, the whole program translation. Commonly used for example: C, C ++, Delphi,Fortran, Pascal, Ada
Interpretation: Takes a statement input from a high-level language, interprets it and controls the computer’s execution, immediately gets the result of the execution of the sentence, and then accepts the next statement. Similar to interpretation, sentence by sentence to explain. Common examples are python

The interpretation takes the source program as the input, and does not produce the target program. Advantages: intuitive and easy to understand, simple structure, easy to realize man-machine dialogue. Disadvantages: low efficiency (no target program generated, needs to be re-executed each time, slow)

The conversion process of compilation

Compile -> Run
Compile -> Assemble -> Run

The position of a compiler in a language-processing system

Now that we know how compilation relates to programming languages, let’s take a look at the compiler’s place in a language-processing system, as shown in the figure below

Build the structure of the system

So how does a machine translate a high-level language into an assembly or machine language program?

Let’s take a look at the example of manual English translation, the illustration of Harbin Institute of Technology compilation principle quoted hereIn the figureThe middle is very important and serves mainly as a bridgeFor example, the middle representation in the figure can be expressed in a variety of languages.

It can be seen from the above figure that semantic analysis first needs to divide sentence components. Then how do we divide sentence components?

Firstly, the part of speech or part of speech of each word in the sentence is analyzed through lexical analysis
Then the sentence structure can be obtained by identifying various phrases in the sentence through grammatical analysis
Then, semantic analysis is carried out to find out what components each phrase acts as in the sentence according to the sentence structure, so as to determine the relationship between each noun component and each core predicate verb
Finally, the intermediate representation is given

In fact, the compiler also goes through these steps when it works, and we becomePhases (the logical organization of a computer in which multiple phases may be combined for implementation), which can be divided into two parts:Analyze source language and generate object codeIn the compiler, they correspond to the front-end and back-end parts of the compiler respectively. The structure of the compiler is shown belowNow that we know the structure of the compiler, let’s start at the front end of the compiler and see what happens at the various stages of lexical analysis, syntax analysis, semantic analysis, and so on.

Lexical Analysis (scanning)

The first stage of compilation scans the characters of the source program line by line from left to right, identifying individual words (which are the smallest syntactic units in a high-level language that are made up of characters in meaning) and determining word types. The recognized words are converted into a unified in-machine representation that is the lexical unit abbreviated Token

Token: < type code, attribute value >

Name to explain

One-word code: For example, if a keyword is unique and predetermined, assign a type code to each keyword
Multi-word one code: For example, all identifiers are assigned the same type code as one type of word. In order to distinguish different identifiers, the second component of token, “attribute value”, is used to store the specific literal values of different identifiers
One type and one code: Different types of constants are constituted in different ways. For example, we assign a kind code to each type of constant. In order to distinguish different constants of the same type, we also use the second component of token “attribute value” to store the specific value of each constant

The figure below is an example of a token sequence after lexical analysisAn effective tool for describing lexical rules isNormal typeandFinite automaton.Normal type: is used to determine whether a word matches a program language specification.Finite automaton: Word and formal comparison using finite automata

Parsing

The definition of parsing

The parser recognizes various phrases from the token sequences output by the lexical analyzer and constructs a parse tree, which describes the grammatical structure of a sentence

The rules of grammatical analysis

Rules of grammar, also known as grammars, dictate how words form phrases, sentences, procedures, and procedures.

The syntax rules are marked as follows, meaning that A is defined as B or C

BNF:A::=B|C

< sentence >::=< subject >< predicate >< object >

< main >::=< set >< name >

Let’s look at the syntax of an assignment statement:

A::=V=E
E::=T|E+T
T::=F|T*F
F::=V|(E)|C
V: : = identifier
C: : = constant

That is to add, subtract, multiply and divide by expressions of identifiers or constants

The method of grammatical analysis

Derive and reduce

Derivation: left-most derivation, right-most derivation
Reduction: most right reduction, most left reduction, the derivation of the inverse process is reduction

Right-most derivation and left-most reduction:Most left derivation, most right reduction:

The syntax tree

Computer passThe syntax treeThe process of parsing can also be marked by an upside-down tree calledThe syntax tree. The correct number of leaf nodes in the syntax tree must be a sign of the expression, for example

Analysis tree of assignment statements:

Analysis tree of variable declaration statements:

Let’s start with the variable declaration syntax (which is a set of rules) :

<D> -> <T> <IDS>;
<T> -> int | real | char | bool
<IDS> -> id | <IDS>, id
Copy the code

Semantic analysis

There are two main tasks for semantics

Collect attribute information for the identifier

Kind: simple variable, compound variable (array, record,…) , process,…
Type: Integer, real, character, Boolean, pointer…
Storage location and length

value
scope
Parameter and return value information, parameter number, parameter type, parameter transmission mode, return value type,…

Identifier information collected during the semantic analysis phase is stored in oneThe symbol tableIn, each identifier corresponds to the symbol tableA recordEach field of the record records each attribute of the identifier. Symbol tables usually have oneString tableUsed to store identifiers and character constants used in programs,The Name is divided into two parts, one for the starting position of the identifier in the string table, and the other for the length of the identifier, the symbol table is as follows:In addition to the symbol table there areOften scale(register all kinds of regular scales);Label table(Definition and application of registration label, not commonly used);Entrance to watch(layer number of registration process, program symbol table entry, etc.), the generation of various tables is mostly in the lexical analysis stage but maintained in the subsequent stages;Intermediate code table

2. Semantic checking

Undeclared use of variables or procedures
Variable or procedure names are declared repeatedly
Operation component types do not match
Type mismatch between operator and operand
- Array subscripts are not integers
- Use array access operators for non-array variables
- Use procedure call operators for non-procedure names
- Procedure call ** argument type or number does not match **
- The return type of the function is incorrect

Intermediate code generation

Usually implemented in conjunction with semantic analysis. Analyze the meaning of all kinds of grammar categories identified by grammar analysis, and conduct preliminary translation to produce a kind of code between source code and object code inspection

Common intermediate code representation

3. A three-address Code consists of a sequence of instructions similar to assembly language, each of which has up to Three operand.
Syntax structure Trees/Syntax Trees
Reverse polish type

Three address instruction representation:

Quadruples, (op, y, z, x)
Three yuan (Triples)
An Indirect triples

An example of intermediate code generation is shown in the figure below

Code optimization

To process and transform the previously generated intermediate code in order to produce more efficient object code in the end, it is necessary to follow the principle of equivalent transformation. Optimization aspects include: extraction of common subexpressions, merging known quantities, deletion of useless statements, and circular optimization.

Object code generation

To convert optimized intermediate code into low-level language for a specific machine

Object code form:

- Absolute-instruction code: Object code that can be executed immediately

Assembly instruction code: assembly language program, need to run after assembly Chen Xu assembly
Relocatable instruction code: all target modules are first connected to determine the position of variables and constants in main memory, and then the absolute instruction code that can be run can be installed in main memory

other

Error handling

If the source program has errors, the compiler should try to find them and report them to the user. This is done by specialized error handlers. Error type:

Grammatical errors: Detected during lexical analysis and grammatical analysis
Semantic errors: generally detected during semantic analysis
Logic errors: undetectable, such as an infinite loop, generally not handled because it cannot be detected at compile time

through

Refers to the source program or the intermediate results of the source program from beginning to end a scan, and do the relevant processing, generate new intermediate results or object code. Times have nothing to do with the meaning of stages

Multiple scans:advantages: Save memory space, improve the quality of the target code, make the logical structure of the compilation clear.disadvantages: Long compilation time. It is better to have as few passes as possible if memory permits

Compiler generation

Write compilers directly in machine language
Compile compiler with assembly language, compiler core part of the commonly used assembly language
It is also common to write compilers in high-level languages
Since the compiler
Compilation tools LEX (parsing) and YACC(for automatically generating LALR parsing tables)
Porting (compilers of the same language are ported between different types of machines)

There are three aspects to building a compiler for a language on a machine:

The source language
The target language
Compiling method

The basic knowledge of compilation is these, the follow-up will continue to study and record in-depth, like trouble like ha