Compiling C or C++ source code into an executable is a two-step process: the first step is to compile each source file individually into a relocatable file (with an extension of.o), and the second step is to link all the relocatable files into an executable. In Linux, both relocatable files and Executable files are in ELF (Executable and Linkable Format).
For readers unfamiliar with THE FORMAT of ELF files, this paper explains the link mode of ELF files through diagrams, focusing on the analysis of why various data structures should be introduced, so that readers can visualize the link process of ELF. If you are already familiar with this section, you can skip to the References section at the end of the article to read the articles and documentation that delve into ELF file formats.
The concepts covered in this article include segments, sections, symbol tables, string tables, and relocation tables. The focus is on how these concepts fit together to serve the ELF linking process, without detailing the binary format of these concepts in files.
To reduce complexity, the ELF programs in this article do not use shared objects and dynamic loading techniques.
Segment 1
From an operating system perspective, the easiest way to load a program into memory is to copy the program directly from a file to a specific location in memory, and then jump to the program entry.
Because the location of the program in memory is predetermined, the location of each function and global variable in memory is also known in advance. The program’s code can be executed without any modification.
In a real operating system, memory is managed in pages. A page is 4096 bytes (0x1000 in hexadecimal), and each memory page can be set with access rights. In x86, memory pages can have both write and execute permissions.
For system security purposes, the memory page on which code resides can be executed but not written, and the memory page on which data resides can be written but not executed. If there is a segment of memory that can be both written and executed, an attacker can exploit a bug in the program to write the attack code there and then execute it, thereby damaging the operating system.
In ELF files, contents with the same memory access attributes are stored consecutively in the file, called segments. Code is stored in text segments, and data is stored in data segments.
In addition to its location and length in a file, a segment needs to state its location and length in memory, as well as the memory attributes it requires. This information is recorded in the program header. Segments correspond to program headers one by one.
Program headers are stored consecutively in ELF files as arrays called program Header tables. Usually, the program header table is after the ELF header, but it can also be located elsewhere in the file. The exact location of the program header table in the file is recorded in the ELF header.
The operating system finds the program header table from the information recorded in the ELF header. After finding the program header table, the operating system will load the corresponding segment into the corresponding location in memory according to the information of each program header, and finally jump to the entrance of the program to start executing the program.
Because memory pages are in 4K, the code and data segments must be integer multiples of 4K. If you’re rounding up 4K multiples by adding zeros at the end of the segment, you’re wasting space. To avoid this waste, the segments of ELF files are closely linked to each other and only map to different memory regions when loaded into memory. Since files must be loaded in 4K through memory mapping, there is a small amount of code at the beginning of the data segment in memory, and a small amount of data at the end of the code segment.
Section 2
The primary concern of the operating system is how to load files into memory, so the information recorded by the segment is:
- The position and length of segments in the file;
- Segment location and length in memory;
- Segment memory properties.
From the linker’s point of view, the second and third points are not the linker’s concern. The linker is more concerned with the functionality of the various parts of the ELF file and how to combine multiple relocatable files into one file by function.
A segment can contain multiple functions that need to be treated differently: in a code segment, normal code and global initialization code should be treated differently; Global variables with initial values and global variables without initial values should also be treated differently in the data segment. Such a contiguous section of the same function is called a section. Unlike segments, each section has a name.
Sections related to code and data include:
The name of the section | describe | Which segment to place in the executable file |
---|---|---|
.text | Generic code | Code segment |
.init | Initialization code, executed at the very beginning of a program run | Code segment |
.fini | Clean up the code and execute before the program exits | Code segment |
.data | Global data with initial values | Data segment |
.bss | Global data with no initial value and an initial value of 0 | Data segment |
.rodata | Read-only global data | Code snippets, because they are closest to code snippets in memory properties |
The structure of a description section is section header. Section headers are stored consecutively in a file as an array, called a section Header table. The section header table is usually placed at the end of an ELF file, and its location is recorded in the ELF header.
It is important to note that a section is not a substructure of a segment, but a structure that has the same status as a segment. Sections are the way ELF files are viewed from the operating system’s perspective, and sections are the way ELF files are viewed from the linker’s perspective. They are optional parts of ELF files: relocatable files have no sections, and executables can have no sections (but most executables do).
Sections with the same name are merged into a single section during linking. Take the.text section for example. Each relocatable file has a.text section, which is merged by the linker into one large.text section and stored in the output relocatable file or executable.
3. Symbol Table and String Table
When writing a program, one file can access variables and functions defined in another file. These variables and functions are collectively called symbols. Symbols that are accessible to other files are called global symbols, and symbols that are accessible only within the file are called local symbols. This is the same function as the STATIC keyword in C.
If a file references a symbol from another file, that symbol is also recorded in the referrer and is called undefined symbol. When finally linked into an executable, all undefined symbols should be able to find the global symbol of the same name, otherwise the link will fail.
In addition to the name of a symbol, which can be referenced in other files, you also need to record the section in which it appears and its offset within the section. This information is continuously stored in a file in an array called a Symbol table. A symbol table is a special section whose name is.sym.
So far, we have found that there are two places in the ELF file where we need to store strings, one for the names of sections and the other for the names of symbols. Because the length of a string is variable, storing it directly in section headers and symbol tables requires pre-allocation of sufficient space, which can result in a lot of wasted space. To save space for storing strings, all strings in ELF files are stored in a special section, called string table, whose section name is.str.
All strings are C-style strings ending in 0, and they are stored consecutively in the string table. They are referenced elsewhere by their subscripts in the string table: the ELF header records the subscripts of the string table in the section header table, and the section header records the subscripts of the section names in the string table; Each symbol table is associated with a string table, and each symbol records the subscript of its name in the associated string table. From this information, the linker can find the section and symbol names.
4 Relocation table
Although we can now refer to symbols defined in other files, we have yet to resolve an important issue.
Operations that access global variables are usually translated by the compiler as instructions to access memory addresses. However, a section in a relocatable file only records its location in the file, not its location in memory like a segment, so we do not know the memory address of a global variable defined in the file. In addition, if a file references a global variable defined by another file, it is impossible to know the memory address of the global variable until they are linked together.
Further, we don’t know the memory addresses of global variables and functions until we finally link them into executable files, and until then we can’t generate instructions to access them.
The ELF file does this: it still generates instructions to access these global variables and functions as normal, but fills in the memory address section with zeros and records the locations of these placeholder zeros in the ELF file. After determining the memory address for all of these symbols, change placeholder 0 to the correct memory address.
If we are accessing an element in the global array or a member of the global structure, we write the offset of the element or member as a placeholder where the memory address appears. After determining the symbol’s memory address, add the two to get the correct memory address. This offset is called addon.
The structures that record where these placeholders occur are called relocation tables. It is also a special section, and there is more than one in the file. Each section that needs to be relocated has a corresponding relocation table. The relocation table name is preceded by.rel or.rela. The relocation table for.text is.rel. Text or.rela.text.
The reason for these two names is that relocation tables have two slightly different formats. The.rel format relocation table writes addon at the location of the memory address as a placeholder, as described earlier; The relocation table in.rela format writes addon to the relocation table, not to the location being relocated.
Because relocation requires the memory address of the symbol, each relocation table is associated with the symbol table as well as the section being relocated. In general, each entry in the relocation table records four kinds of information: the offset of the placeholder to be relocated in the section, the subscript of the referenced symbol in the symbol table, the address calculation method when relocation, and addon.
There are two common address calculation methods:
The name of the | calculation |
---|---|
R_386_32 | Symbol memory address + Addon |
R_386_PC32 | Symbol memory address + Addon-PC |
PC refers to the program counter, which records the memory address of the current instruction. R_386_PC32 is used for addressing relative to PC and is typically used to generate position-independent code (PIC).
When linking into an executable, the linker first merges the same sections and then determines where the sections are in memory to form segments. At this point, the linker can calculate the memory address of all symbols. The linker then iterates through all relocation tables, rewriting each location to a real memory address to complete the relocation operation. The resulting program can then be loaded into memory by the operating system for execution.
5 subtotal
- The structures that record how programs are loaded into memory are segments.
- The structure for recording the various parts of an ELF file is a section.
- Global variables and functions are collectively called symbols.
- Section and symbol names are stored in the string table.
- The structures used to rewrite placeholders in programs into memory addresses are called relocation tables.
6 References
ELF Format is a 60-page document that describes the Format of ELF files. It explains the principles of static linking and dynamic loading in every aspect. It is a must read material to understand ELF files. However, this version is slightly out of date, and if you analyze the current ELF files as per this document, some new attributes are missing from the documentation. However, since the manual is relatively thin, it is very readable, and it is recommended to read this document before going into more detail.
The Oracle ELF documentation is the most detailed and up-to-date document describing the ELF file format and is suitable for reference as a manual.
Load-time Relocation of Shared Libraries by Eli Bendersky explains how to Load Shared objects. Shared objects are divided into two types depending on whether the code is location-independent or not. This article explains how to load shared objects without PIC (location-independent code) enabled, which is the same as link in principle. This article has code and disassembly examples, suitable for readers to more in-depth and concrete understanding of the linking process.
The original link