What is LLVM?

LLVM is a collection of compiler toolchain technologies. And one of the LLD projects is the built-in linker.

The compiler compiles each file to generate a Mach-O (executable); The linker merges multiple Mach-O files in the project into one.

The process of Xcode running is to execute some command scripts. The screenshot below shows the script of Xcode compiling main.m.

Find the clang command in the bin directory, add some parameters, such as what language, which schema to compile to, append the configuration parameters set in Xcode, and output a.o file.

LLVM compiler architecture

The compiler is divided into three parts: the front end of the compiler, the general optimizer and the back end of the compiler. The optimizer in the middle will not change

Adding a language just takes care of the compiler front end

To add a schema, all you need to do is add a compiler backend schema handling

Clang represents the C, C++, and Objective-C front end in the compiler architecture. It also acts as a “black box” Driver on the command line, encapsulates the compiler pipeline, front end commands, LLVM commands, Toolchain commands, and so on.

LLVM performs the entire compilation process described above, which looks like this:

  • After you write your code, LLVM preprocesses your code, such as embedding macros in the appropriate locations.
  • After preprocessing, LLVM performs lexical and syntax analysis on the code to generate AST. AST is an abstract syntax tree that is structurally leaner than code and faster to traverse, so using AST allows for faster static checking and faster generation of IR (intermediate representation)
  • Finally, AST generates IR, which is a language that is closer to machine code. The difference is that it is platform-dependent, and multiple machine codes suitable for different platforms can be generated from IR. For iOS, the executable generated by IR is Mach-O.

OC source file compilation process

Use the following command to view the compilation process of the OC source file

clang -ccc-print-phases main.m
Copy the code

0: Find the main.m file first

1: the preprocessor replaces the include, import, and macro definitions

2: Compile into IR intermediate code

3: Intermediate code to the back end, generate assembly code

4: Assembly generation of object code

5: link static library, dynamic library

6: Code that fits an architecture

pretreatment

Use the following command to view the work done in the preprocessing phase

clang -E main.m
Copy the code

Pretreatment mainly does the following things:

1. Remove all #define and replace the code where macro definitions are used

Insert the #include file into the file location. This is a recursive process

3. Delete comment symbols and comments

4. Add line numbers and file identifiers for easy debugging

compile

The process of compilation is to generate the corresponding assembly code after lexical analysis, syntax analysis, semantic analysis and optimization of the pre-processed file

1. Lexical analysis

In this step, the code in the source file is converted into a special markup stream. The source code is divided into characters and words one by one. The corresponding source file and the specific line number are marked in the Loc at the end of the line, which is convenient to locate the problem when an error is reported.

Use the following command for lexical analysis

clang -Xclang -dump-tokens main.m
Copy the code

Take this code as an example:

This source code on line 11

int main(int argc, char * argv[]) {
Copy the code

Through lexical analysis, this is translated into the following special tokens

int 'int'	 [StartOfLine]	Loc=<main.m:11:1>
identifier 'main'	 [LeadingSpace]	Loc=<main.m:11:5>
l_paren '('		Loc=<main.m:11:9>
int 'int'		Loc=<main.m:11:10>
identifier 'argc'	 [LeadingSpace]	Loc=<main.m:11:14>
comma ','		Loc=<main.m:11:18>
char 'char'	 [LeadingSpace]	Loc=<main.m:11:20>
star '*'	 [LeadingSpace]	Loc=<main.m:11:25>
identifier 'argv'	 [LeadingSpace]	Loc=<main.m:11:27>
l_square '['		Loc=<main.m:11:31>
r_square ']'		Loc=<main.m:11:32>
r_paren ')'		Loc=<main.m:11:33>
l_brace '{'	 [LeadingSpace]	Loc=<main.m:11:35>
Copy the code

2. Grammatical analysis

This step is parsed into a syntax tree according to the tag stream of lexical analysis, which is completed by two modules Parser and Sema in Clang

Within this, each node also marks its position in the source code

Verify that the syntax is correct, such as one missing; Report an error

According to the syntax of the current language, semantic nodes are generated and all nodes are combined into an abstract syntax tree

Use the following command for parsing

clang -Xclang -ast-dump -fsyntax-only main.m
Copy the code

It parses into the following syntax tree

-FunctionDecl 0x7ffe251a8ce0 <main.m:11:1, line:20:1> line:11:5 main 'int (int, char **)' |-ParmVarDecl 0x7ffe251a8b00 <col:10, col:14> col:14 argc 'int' |-ParmVarDecl 0x7ffe251a8bc0 <col:20, col:32> col:27 argv 'char **':'char **' `-CompoundStmt 0x7ffe251a9200 <col:35, line:20:1> |-ObjCAutoreleasePoolStmt 0x7ffe251a91b8 <line:13:5, line:18:5> | `-CompoundStmt 0x7ffe251a9188 <line:13:22, line:18:5> | |-DeclStmt 0x7ffe251a8e30 <line:14:9, col:32> | | `-VarDecl 0x7ffe251a8da8 <col:9, line:9:21> line:14:13 used eight 'int' cinit | | `-IntegerLiteral 0x7ffe251a8e10 <line:9:21> 'int' 8 | |-DeclStmt 0x7ffe251a8ee8 <line:15:9, col:20> | | `-VarDecl 0x7ffe251a8e60 <col:9, col:19> col:13 used six 'int' cinit | | `-IntegerLiteral 0x7ffe251a8ec8 <col:19> 'int' 6 | |-DeclStmt 0x7ffe251a9010 <line:16:9, col:31> | | `-VarDecl 0x7ffe251a8f18 <col:9, col:28> col:13 used rank 'int' cinit | | `-BinaryOperator 0x7ffe251a8ff0 <col:20, col:28> 'int' '+' | | |-ImplicitCastExpr 0x7ffe251a8fc0 <col:20> 'int' <LValueToRValue> | | | `-DeclRefExpr 0x7ffe251a8f80 <col:20> 'int' lvalue Var 0x7ffe251a8da8 'eight' 'int' | | `-ImplicitCastExpr 0x7ffe251a8fd8 <col:28> 'int' <LValueToRValue> | | `-DeclRefExpr 0x7ffe251a8fa0 <col:28> 'int' lvalue Var 0x7ffe251a8e60 'six' 'int' | `-CallExpr 0x7ffe251a9128 <line:17:9, col:30> 'void' | |-ImplicitCastExpr 0x7ffe251a9110 <col:9> 'void (*)(id, ...) ' <FunctionToPointerDecay> | | `-DeclRefExpr 0x7ffe251a9028 <col:9> 'void (id, ...) ' Function 0x7ffe20b20e88 'NSLog' 'void (id, ...) ' | |-ImplicitCastExpr 0x7ffe251a9158 <col:15, col:16> 'id':'id' <BitCast> | | `-ObjCStringLiteral 0x7ffe251a9068 <col:15, col:16> 'NSString *' | | `-StringLiteral 0x7ffe251a9048 <col:16> 'char [8]' lvalue "rank-%d" | `-ImplicitCastExpr 0x7ffe251a9170 <col:26> 'int' <LValueToRValue> | `-DeclRefExpr 0x7ffe251a9088 <col:26> 'int' lvalue Var 0x7ffe251a8f18 'rank' 'int' `-ReturnStmt 0x7ffe251a91f0 <line:19:5, col:12> `-IntegerLiteral 0x7ffe251a91d0 <col:12> 'int' 0Copy the code

3, static analysis (through the syntax tree code static analysis, find non-syntax errors)

1. Error checking

Variables such as methods called but not defined, defined but not used

2. Type check

Types are generally divided into two categories: dynamic and static. Dynamic checks are performed at runtime, and static checks are performed at compile time. You can write code that sends any message to any object, and at run time, the object is checked to see if it can respond to those messages.

4. Codegen-ir code generation

CodeGen is responsible for traversing the syntax tree from top to bottom, translating it into LLVM IR
LLVM IR is the output of Frontend and the input of LLVM Backend. It is the bridge language for the front and back ends
Bridge with Objective-C Runtime
Bridging applications with Objective-C Runtime

In Objective-C, the memory structures of Class/Meta Class/Protocol /Category are generated in this step and placed in the Section specified by Mach-O (e.g. Class: _DATA, _objc _classrefs). This DATA segment also holds static variables

2. What will the objCT object eventually compile into when it sends a message? The objc_msgSend call happens in this step. Translate ObjCMessageExpr in the syntax tree into objc_msgSend of the corresponding version, and the call to the super keyword into objc_msgSendSuper

Synthesize @property’s automatic getter/setter from the strong/weak /copy /atomic modifiers, and synthesize the @synthesize

4. Generate block_layout data structures, capture (__block/and __weak) variables, and generate _block_invoke functions

5. ARC is the compiler that inserts some memory management code, which is done in this step

ARC: Analyzes the reference relationship of objects and inserts ARC codes such as objc_StoreStrong and Objc_StoreWeak

Translate ObjCAutotreleasePoolStmt to objc_autoreleasePoolPush/Pop

Implement automatic call [super dealloc]

Synthesize the.cxx_destructor method for each Class that has ivar to automatically free the Class’s member variables, instead of the MRC-era “self.xxx = nil”

LLVM intermediate products and optimization

Use the following command to generate the LLVM Intermediate Representation (IR) and print out the process

clang -O3 -S -emit-llvm main.m -o main.ll
Copy the code

With the following command, the code is optimized using LLVM.

// For global variables optimization, circular optimization, tail recursive optimization, etc. // You can also set the optimization level of -01, -03, -0s in Xcode compilation Settings, and write your own Pass. clang -emit-llvm -c main.m -o main.bcCopy the code

Generating assembly code

Use the following command to generate the corresponding assembly code.

clang -S -fobjc-arc main.m -o main.s
Copy the code

At this point, the compilation stage is completed, converting written code into assembly code that can be recognized by the machine. The assembler converts assembly code into instructions that can be executed by the machine. Almost every assembly statement corresponds to a machine instruction. According to the assembly instructions and machine instructions of the table translation can be.

Use the following command to generate the corresponding object file.
clang -fmodules -c main.m -o main.o
Copy the code
Later Xcode new projects do not have PCH files, why?

The PCH file is just to import UIKit and Foundation libraries as PCH files, so that you don’t have to parse so many things in each source file, and now iOS is making a mess of putting all the global variables, all the module stuff in there.

There is a concept of modules in Xcode, which is also opened in each setting. By default, libraries like UIKit and Foundation are all modules. The nice thing is that when I add this parameter (fmodules) it will automatically change #import to @import, and now it will compile much faster than the original version that doesn’t even have PCH because it doesn’t have PCH by default

$Clang-e-fmodules main.m // Add fmodules arguments to generate an executable file
Copy the code

link

This stage is to link the object file generated in the previous stage with the referenced static library, and finally generate the executable file. The linker solves the link between the object file and the library.

What does the linker do at compile time?

1. In Mach-O, it is mainly code and data. Code is the definition of function, and data is the definition of global variables.

2. In Mach-O code, variables and functions are bound to their respective addresses. The function of the linker is to complete the binding of symbols of variables and functions to their addresses.

Why sign binding?

1. If the address and symbol are not bound, to let the machine know what address you are operating on, you need to set the memory address when writing the code.

2, poor readability, modify the code to maintain the address again

3, need to write multiple pieces of code for different platforms, equivalent to directly write assembly

Why merge multiple Mach-Os in a project into one?

Variables and interfaces between multiple files are interdependent, requiring the linker to bind multiple Mach-O file symbols and addresses in a project.

2. Without binding, mach-O generated by a single file will not run, and the address of the function will not be found when the function implementation of another file is called at runtime.

3, link multiple object files to create a symbol table, record all defined and undefined symbols, if the same symbol occurs, will appear “ld: The dumplicate symbols “error message will prompt” Undefined symbols “if no symbols are found in the object file.

What are the main things the linker does with your code?

1. Look for undefined variables in the code file

2. Collect all symbol definitions and reference addresses and place them in the global symbol table

3. Calculate the length and position after merging, generate segments of the same type for merging, and establish binding

4. Readdress variables in different files in the project

How can the linker keep Mach-O size by removing useless functions?

When the linker collates function calls, it follows each reference from the main function and marks it as live, and any function that is not marked as live is useless.

Summary: compilation of a source file

Code practice

#import <Foundation/Foundation.h> int main() { NSLog(@"hello world!" ); return 0; }Copy the code
1. Generate the Mach-O executable
clang -fmodules main.m -o main
Copy the code
Generate abstract syntax tree
clang -fmodules -fsyntax-only -Xclang -ast-dump main.m
Copy the code

Generate assembly code

clang -S main.m -o main.s
Copy the code

Loading and linking

An App basically needs to go through two steps of loading and dynamic library link from executable file to real running code.

Programs run in a separate virtual address space. Multiple processes run in the operating system at the same time, and the virtual address space is isolated from each other.

Loading is the process of mapping executable files to virtual memory. Due to the scarcity of memory resources, only the most frequently used parts of the program are stored in memory, and the less frequently used data is stored on disk. This is also a dynamic loading process.

The process of loading is the process of creating a process, and the operating system does three things:

1. Create a separate virtual address

2, read executable header, and establish virtual space and executable mapping relationship

3. Set the CPU’s storage area as the entry address of the executable file and start the operation

Static library

Static libraries are compile-time linked libraries that need to be linked into your Mach-O file and recompiled if updates are needed, not loaded and updated dynamically.

The dynamic library

Dynamic library is a runtime linked library, using dyLD can achieve dynamic loading, iOS system libraries are dynamically linked.

Shared cache

Mach-o is a compiled artifact, whereas dynamic libraries are linked at run time, and there is no symbolic definition of dynamic libraries in all Mach-O.

Symbols in dynamic libraries in Mach-O are undefined, but their names and paths to corresponding libraries are recorded.

When dlopen and DLSYM import dynamic libraries at runtime, they first find the corresponding libraries according to the recorded library path, and then find the binding address through the record name symbol.

Advantages:

Code sharing, easy maintenance, reduced executable file size

References:

LLVM framework /LLVM compilation process /Clang front end /LLVM IR/LLVM application and practice

Learning the basics of iOS – a magical journey from compilation to startup

Clang video shared by Sunnyxx