I. Basic introduction
The code compilation process can be roughly divided into four stages:
- Preprocessing
- Compliation
- Assembly
- Linking
There are seven stages:
- pretreatment
- Lexical analysis
- Syntax analysis
- Generate intermediate code
- Generating object code
- assembly
- link
The main difference here is that compilation includes four parts: lexical analysis, syntax analysis, generating intermediate code and generating object code.
The compiler is responsible for five steps: preprocessing, lexical analysis, syntax analysis, generating intermediate code and generating object code. The input of the compiler is the source code, and the output is the intermediate code. The compiler is divided into compiler front end and compiler back end by intermediate code. The front end of the compiler is responsible for parsing and generating the Abstract Syntax Tree (AST), and the back end is responsible for converting the Abstract Syntax Tree into intermediate code.
Ps: Intermediate code is already so close to actual assembly code that it can be transformed almost directly. The bulk of the work is in compatibility with cpus and filling in templates. This is done by a compiler or assembler.
Second, the compiler
The mainstream C/C++ compiler is GCC, which is an influential compiler released by GNU and has become the default compiler for Linux and Unix systems. Xcode for iOS development currently uses the clang+LLVM compiler. Before Xcode4, the GCC compiler was used. Since GCC support for Objective-C was not very good, Apple used clang+LLVM as the default compiler.
The main difference between the two compilers is that GCC is directly responsible for all five steps of the compilation process, whereas CLang +LLVM breaks down the compilation steps, with Clang acting as a compiler front end (remember what a compiler front end does?). LLVM is used in combination with LLVM as the back end of the compiler. Of course, it has happened historically, with GCC as the compiler front end and LLVM as the compiler back end.
1. The Clang of history
Clang (pronounced/klæŋ/) is a compiler front end for C, C++, Objective-C, and Objective-C++ programming languages. It uses the underlying Virtual Machine (LLVM) as its back end. Its goal is to provide an alternative to the GNU compiler suite (GCC). It was developed under the auspices of Apple by Chris Ratner, and the source code license is a bSD-like open source license from the University of Illinois at Urbana-Champaign. Clang projects include Clang front-end and Clang static analyzer.
Clang’s mission was defined before he was born — to kill the damn GCC. With LLVM+Clang, apple’s development landscape has changed since then. This is the end of GCC. To be fair, GCC has a lot of advantages, such as multi-platform support, popularity, and C based compilation without a C++ compiler. These advantages to Apple that may be a weakness, Apple needs is — fast. This is where Clang comes in. In addition to being fast, it is compatible with GCC, has a small memory footprint, readable diagnostic information, is easy to expand, and is easy to integrate with IDE. Here’s a test: Clang compiles Objective-C code three times faster than GCC.
2. IOS compilation process
Both Objective-C and Swift use Clang as the front end of the compiler. The front end of the compiler mainly conducts syntax analysis, semantic analysis, and generates intermediate code. During this process, type checking is performed, and if errors or warnings are found, which line will be marked
The compiler back end does machine-independent code optimizations to generate machine language, and machine-specific code optimizations to generate machine code for different system architectures.
C++,Objective C are compiled languages. When a compiled language executes, it must first generate machine code through the compiler.
3. Summary: Clang-lvvVM, a source file compilation process
As shown in the figure above, the workflow after pressing CMD +B in Xcode.
- Pre-process: His main job is to replace macros, remove comments to expand header files, and generate.i files.
- Lexical Analysis: Dicing the code into tokens, for example, size brackets, equals signs, and strings. The process of converting a sequence of characters into a sequence of markers in computer science.
- Semantic Analysis: Verifies that the syntax is correct, and then assembles all the nodes into an abstract syntax tree AST. Parser and Sema in Clang
- Static Analysis: Use this to indicate that source code is being analyzed for automatic error detection.
- Code Generation: IR intermediate Code Generation begins. CodeGen is responsible for translating the syntax tree from top to bottom into LLVM IR. IR is the output at the front end and the input at the back end of the compilation process.
- Optimize: LLVM will Optimize, Xcode builds at -01, -03, -0s, and write your own Pass. Writing an LLVM Pass – LLVM 5 documentation. If bitcode is enabled, Apple will make further optimization. If there is a new backend architecture, it can still be generated with this optimized Bitcode.
- Assemble: Generate Target related objects (Mach-O)
- Generate the Executable file
- Through these steps, our code written in various high-level languages is transformed into object code that the machine can understand and execute.
3. What did the seven stages do?
1. Preprocessing — processing macro definitions
Preprocessing is used to process macro definitions such as #define, #include, #if, etc. There are many different implementations of preprocessing. Some compilers preprocess before lexical analysis, replacing all macros starting with #, while others preprocess during lexical analysis. Replace only when parsing words beginning with #. While preprocessing and then lexical analysis may be intuitive, in practice GCC uses lexical analysis at the same time.
2. Lexical analysis — Output symbol state
The main implementation principle of lexical analysis is the state machine, which reads characters one by one and then transforms the state according to the characteristics of the characters it reads. Unlike humans, computers can directly identify the content of the source code, which can only identify each word one by one. What lexical analysis does is to split the source code into several words, and mark the state for grammar analysis. For example, when a=1 and a==1 are recognized from left to right, when the computer recognizes the first =, it cannot determine whether the statement is an assignment symbol or a conditional judgment symbol. It must combine the characters after = to make a correct judgment. If 1 is an assignment symbol, and if = is a conditional judgment symbol. According to the different results of recognition, state labeling is carried out.
The main implementation principle of lexical analysis is the state machine, which reads characters one by one and then transforms the state according to the characteristics of the characters it reads. For example, here’s GCC’s lexical analysis state machine (from Compilation Systems Perspectives):
3. Syntax analysis — Output Abstract Syntax Tree
After lexical analysis, the compiler already knows the meaning of each word, but the syntax of the combined words is not clear. In this case, a compiler front end (such as GCC or CLang) is needed to analyze the syntax.
A simple way to realize Syntax analysis is template matching, which abstracts the basic grammar rules of programming language into templates for matching, and generates AST (Abstract Syntax Tree) at the same time.
So int a = 10; Type variable name = constant; .
After successfully parsing the Syntax, we get the AST (Abstract Syntax Tree).
Take this code as an example:
int fun(int a, int b) {
int c = 0;
c = a + b;
return c;
}
Copy the code
His syntax tree is as follows:
Syntax tree transforms source code in string format into tree-like data structure, which is easier for computers to understand and process. But it’s still a long way from intermediate code.
4. Generate IR (Intermeidate Representation)
In fact, abstract syntax trees can be directly translated into object code (assembly code). However, the assembly syntax of different cpus is not consistent. For example, as mentioned in the ARTICLE “AT&T vs. Intel Assembly Style”, the source and target operand positions in the Intel and AT&T architectures are opposite. The Intel architecture does not have prefixes for operands and immediates, but the AT&T architecture does. Therefore, a more efficient approach is to generate language-independent, CPU-independent intermediate code (IR), and then regenerate into assembly code corresponding to each CPU.
Generating Intermediate Representation (IR) code is an important step, independent of language, CPU and implementation. It can be understood that intermediate code is a very abstract and very generic code. It is an objective and neutral description of what the code is supposed to do. If you use Chinese and English for C and Java respectively, intermediate code can be understood as Esperanto in a sense.
Intermediate code, on the other hand, is the dividing line between the front and back ends of the compiler. The compiler front end is responsible for converting source code into intermediate code (IR), and the compiler back end is responsible for converting intermediate code into object code (assembly code). The process of generating intermediate codes (IR) from abstract syntax trees can be roughly divided into two steps. The first step is to generate intermediate codes (IR) :
The second step is to optimize the intermediate code (IR).
Using GCC as an example, generating intermediate code can be divided into three steps:
- Syntax tree to high gimple
- High Gimple to low Gimple
- Low-end Gimple is transferred from CFA to SSA and then to intermediate code
Note: LLVM is responsible for converting the abstract syntax tree to intermediate code (IR) for iOS client development using Xcode.
Gimple is an intermediate instruction produced by the GCC compiler that contains up to three operands (three operands, one operator), which is basically DST = src1@src2. Since Gimple can evaluate at most two operands, a complex expression is expanded into a series of Gimple instructions, a process known as Gimple.
4.1 Syntax tree to High-end Gimple
This step is mainly to deal with the register and stack, such as C = a + b and there is no direct assembly code corresponding to it, generally speaking, you need to save the result of A + B into the register, and then assign the register to C. So this step, if expressed in C language, is actually:
int temp = a + b; // c = temp;Copy the code
In addition, when calling a new function, the function will enter its own stack, and the operation of building the stack must be declared in Gimple.
4.2 High-end Gimple to low-end Gimple
This step is mainly to separate variable definition, statement execution and return statement storage. Such as:
int a = 1;
a++;
int b = 1;
Copy the code
Will be processed into:
int a = 1;
int b = 1;
a++;
Copy the code
The advantage of this is that it is easy to calculate how much stack space a function really needs.
In addition, return statements are handled uniformly and placed at the end of functions, such as:
if (1 > 0) {
return 1;
}
else {
return 0;
}
Copy the code
Will be processed into:
if (1 > 0) {
goto a;
}
else {
goto b;
}
a:
return 1;
b:
return 0;
Copy the code
4.3 Low-end Gimple is transferred from CFA to SSA and then to intermediate code
This step is mostly about making various optimizations, adding version numbers, etc. I don’t know much about it, and there is no need to learn it just to understand the compilation process.
5. Generate object code (assembly code)
Object code can also be called assembly code. Since intermediate code is already so close to actual assembly code, it can be transformed almost directly. The main workload is generating different assembly code for different cpus, compatible with different cpus, and filling in templates. In the final generated assembly code, there are not only assembly commands, but also some instructions for the file.
6. Assembly — The assembler generates binary machine code
The assembler takes assembly code, converts it into binary machine code, generates an object file (suffix.o), and machine code can be recognized and executed directly by the CPU. Since the object code is segmented, the final object file (machine code) is also segmented. This is because:
- Data is separated from code. The code is read-only, the data can be written, convenient permission management, avoid instructions to be rewritten, improve security.
- Modern cpus typically have their own data cache and instruction cache, and the distinction between storage helps improve cache hit ratio.
- When multiple processes are running at the same time, their instructions can be shared, saving memory.
7. Linking
Linking is the relocation of an object file (.o file) to the object file in which the external function is called.
In an object file, it is impossible for all variables and functions to be defined inside the file. For example, if strlen is an external function being called, link the main.o object file to the object file containing the strlen implementation. We know that the function call corresponds to the assembly is actually jump instruction, followed by the address of the called function, but in the process of generating main.o, the address of strlen() function is not known, so we can only use 0 to replace, until the last link, will be changed to the real address.
The linker knows where to relocate by focusing on the location table. Each segment with possible relocation has a relocation table. In the link stage, the linker will locate the address in other target files according to the content that needs to be relocated in the relocation table and relocate it.
We also sometimes hear the term dynamic linking, which means relocation occurs at runtime rather than after compilation. Dynamic linking can save memory, but it can also cause loading performance problems, which will not be explained in detail here. For those who are interested, read the book Programmer Self Training.
Write at the end
Finally, I would like to thank the authors of Basic Compilation principles and Language Knowledge that Big Front-end Developers need to know for contributing such a high-quality article that clarifies the details of the entire compilation process. A great deal of content in this article is written by the original author, and the final copyright belongs to the original author. If there is any infringement, please inform us.
reference
- Mp.weixin.qq.com/s?__biz=MzI…
- sp1.wikidot.com/gccpcode
- Segmentfault.com/a/119000002…