2016 National Day holiday finally finished the book, sorted out notes and experience here.

About the title

The title comes from the Russian actor Stanislavsky’s creation of the Actor’s Self-cultivation, the author to write this book before and after the revision of 30 years, before agreeing not to modify, to be published. On the one hand, using the title of the book list is not to introduce a new programming language or show some practical programming techniques, but to introduce the mechanism and origin behind the operation of the program, which can be regarded as a kind of “self-cultivation” of programmers; On the other hand, it is a tribute to Stanislavski, to his spirit of excellence in his work. — The preface of this book

Organization of the book

The book is divided into four parts, as follows.

Part I Introduction

Chapter 1 Review the old and learn the new

This section describes basic background information, including hardware, operating system, and threads.

The second part is static connection

Chapter 2 compilation and linking

This section describes the basic concepts and procedures of compiling and linking.

Chapter 3 What’s in the Objective file

Describes the COFF object file format and how source code is stored in object files after compilation.

Chapter 4 static linking

Introduces the process and steps of static linking and static linking libraries.

Chapter 5 Windows PE/COFF

This section describes object files and supported file formats on the Windows operating system

The third part is loading and dynamic linking

Chapter 6 the loading process of executable files

Introduces the concept of process, the distribution of process address space, and the loading process of executable file mapping.

Chapter 7 dynamic linking

Based on the Linux. So shared library, the dynamic link process is analyzed in detail.

Chapter 8 organization of Linux shared libraries

This section describes how to distribute and organize shared files in Linux

Chapter 9 dynamic linking under Windows

This section describes the DLL dynamic link mechanism in Windows

Part four libraries and runtime

Chapter 10 Memory

Mainly introduces heap and stack, heap allocation algorithm, function call stack distribution.

Chapter 11 Runtime

This paper mainly introduces the concept of runtime, C/C++ runtime, Glibc and MSVC CRT, how runtime implements C++ global construction and destruct, and fread() library function.

Chapter 12 system calls and apis

This paper mainly introduces Linux and Windows system call and Windows API.

Chapter 13 implementation of the runtime

It mainly implements a Mini CRT that supports heap, basic file operation, formatted string, basic input/output, C++new/delete, C++string, C++ global construction and destructor.

Key Reading chapters

  • Chapter 1 Review the old and learn the new
  • Chapter 2 compilation and linking
  • Chapter 3 What’s in the Objective file
  • Chapter 4 static linking
  • Chapter 6 the loading process of executable files
  • Chapter 7 dynamic linking
  • Chapter 10 Memory

Reading notes

Consider the old and learn the new

As the title of the first chapter suggests, this chapter focuses on the historical development of computer hardware and software. The main points are:

  • CPU frequency, currently at a ceiling of 4GHz, has not increased by Moore’s Law since 2004 because there has been no fundamental breakthrough in CPU manufacturing.
  • In theory, increasing the number of cpus should increase the speed of computation, and ideally, the speed increase should be proportional to the number of cpus. But that’s not the case, because our programs don’t all decompose into completely unrelated subproblems. One woman can have a baby in 10 months, but 10 women can’t have a baby in a month.

In the second section, the book talks about the architecture of computer system software. There is a wise saying: “Any problem in computer science can be solved by adding an indirect intermediate layer.” Any problem in conputer science can be solved by anther layer of indirection.

Take a look at a computer software architecture diagram to understand the power of this statement.

Each layer between the need to communicate with each other, since the need to communicate must have a communication protocol, we generally known as interface (Inerface), the interface layer is the interface provider, and it defines the interface; The layer above the interface is the interface consumer, which uses the interface to practice the required functionality. In a hierarchical system, the interface is carefully designed to be as stable as possible, so that theoretically any layer can be modified or replaced as long as the interface is followed between layers. With the exception of hardware and applications, everything is called the middle tier, and each middle tier wraps and extends the layer below it.

The book summarizes the functions, what does the operating system do?

  • Provide abstract interface;
  • Manage hardware resources;

With the title don’t Snooze the CPU, the book introduces several patterns of program collaboration during CPU development:

  • Multiprogram;
  • Time-sharing system;
  • Multi-task system;

As for memory, the book raises the question of how to allocate the limited physical memory on a computer to multiple programs. Using a simple memory allocation strategy, you encounter several problems:

  • Address space is not isolated;
  • Low memory usage.
  • The address of the program running is unstable;

The way to solve these problems is to use the magic bullet mentioned earlier: add an intermediate layer, which uses an indirect address access method. The idea is that we treat the Address given by the program as a kind of Virtual Address, and then, through some kind of mapping, translate that Virtual Address into the actual physical Address. So long as we can properly control the mapping process from virtual address to physical address, we can ensure that the physical memory area that any program can access does not overlap with another program, and the effect of address space isolation has been achieved.

Virtual address is to solve the above three problems, the development of virtual address, there are two ideas to solve:

  • Section; Map a segment of virtual space to an address space corresponding to the size of memory required by the program. This solution solves the first and third problems, but does not solve the second problem, memory efficiency.
  • Paging. The basic approach to paging is to divide the address space into artificially equal pages of a fixed size, the size of which is determined by the hardware, or the hardware supports multiple page sizes, the size of which is determined by the operating system.

The implementation of virtual Memory depends on the support of hardware, which is different for different cpus. However, almost all hardware uses a Memory Management Unit (MMU) to carry out page mapping. The flow is as follows: CPU->Virtual Address->MMU->Physical Address->Physical Memory.

When I read this, I was reminded of a blog I read: Alloc init, do you understand 50%?

The idea of paging, much like byte alignment, is described in the Apple documentation Tips for Allocating Memory:

When allocating any small blocks of memory, remember that the granularity for blocks allocated by the malloc library is 16 bytes. Thus, the smallest block of memory you can allocate is 16 bytes and any blocks larger than that are a multiple of 16. For example, if you call malloc and ask for 4 bytes, it returns a block whose size is 16 bytes; if you request 24 bytes, it returns a block whose size is 32 bytes. Because of this granularity, you should design your data structures carefully and try to make them multiples of 16 bytes whenever possible.

This means:

When we allocate a block of memory, assuming that the required memory is less than 16 bytes, the operating system will allocate 16 bytes directly. If the required memory is greater than 16 bytes, the OS allocates a x 16 bytes. For example, if you call malloc and need four bytes, the system will give you a 16-byte block of memory; If you call malloc and need 24 bytes, the system will give you a 32-byte block of memory.

Chapter 1 also covers some of the basics of threading. What is a thread?

Threads, sometimes referred to as Lightweight processes (LWP), are the smallest unit of a program’s execution flow.

The threads in the process are shown below:

  • An operation might be stuck in a long wait, and the waiting thread would fall asleep and not be able to continue. Multithreaded execution can make good use of wait time. A typical example is waiting for a network response, which can take seconds or even tens of seconds.
  • An operation (usually a calculation) takes a lot of time, and if there is only one thread, the interaction between the program and the user breaks down. Multithreading allows one thread to interact and another thread to compute.
  • The program logic itself requires concurrent operations, such as a multiterminal download of software.
  • A multi-CPU or multi-core computer is capable of executing multiple threads at the same time, so a single-threaded program cannot fully exploit the full power of the computer.
  • In contrast to multi-process applications, multi-threading is much more efficient in data sharing.

Access permission of the thread

Threads have free access to all data in the process’s memory, including other threads’ stacks, but in practice, threads also have their own private storage, including the following:

  • Stack (although not completely inaccessible to other threads, but still generally considered private data)
  • Thread Local Stroage (TLS). Thread-local storage is private space that some operating systems provide for threads individually, but usually with limited capacity.
  • Registers (including PC registers), which are the basic data of the execution flow and are therefore owned by this thread.

! [2016091435675Threads and processes data.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091435675Threads and processes data.png)

Scheduling and priority of threads

Let me first explain the difference between parallelism and concurrency

Number of threads <= Number of cpus: Number of concurrent threads > Number of cpus: Number of concurrent threads

Concurrency is a simulated state in which the operating system makes these multithreaded programs take turns to execute. The behavior of constantly switching between different threads on the processor is called Thread Schedule. In Thread Schedule, threads usually have at least three states, namely:

  • Running: The thread is executing.
  • Ready: The thread can execute immediately, but the CPU is already occupied.
  • Waiting: A thread that is Waiting for an event (usually I/O or synchronization) to occur and cannot execute.

A running thread has a period of Time in which it can execute, called a Time Slice. ! [2016091477250Thread state switch.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091477250Thread state switch.png)

The main methods of thread scheduling are as follows:

  • Priority Scheduling
  • Round Robin

A thread’s priority can change in one of three ways:

  • The user specifies the priority
  • Increase or decrease the priority depending on how frequently you enter the wait state
  • Not being executed for a long time was prioritized

When a thread runs out of time, it is forced to move into a ready state, a process called Preemption, in which another thread preempts the current thread.

Thread safety

We call the operation of a single instruction atomic, because the execution of a single instruction is not interrupted anyway.

In order to avoid unexpected consequences caused by multiple threads reading and writing the same data at the same time, we need to synchronize the access of all threads to the same data. The so-called synchronization means that when one thread is accessing data, other threads cannot access the same data. Thus, access to the data is atomized.

The most common method of synchronization is to use locks. Locking is an optional mechanism. Each thread first attempts to acquire (Acqurie) data or resource before accessing it, and then releases (Release) the lock after accessing it. When an attempt is made to acquire the lock while it is already occupied, the thread waits until the lock is available again.

Here are some common iOS locks:

  • @synchronized

    NSObject *obj = [[NSObject alloc] init];
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        @synchronized(obj) {
            NSLog(@"Operation 1 requiring thread synchronization begins");
            sleep(3);
            NSLog(@"Operation 1 requiring thread synchronization ends"); }}); dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{ sleep(1); @synchronized(obj) { NSLog(@"Operation requiring thread synchronization 2"); }});Copy the code

    The obj used by the @synchronized(obj) command is the unique identifier of the lock. Only when the identifier is the same, the mutual exclusion can be met. If the @synchronized(obj) in thread 2 is changed to @synchronized(self), thread 2 will not be blocked. The advantage of @synchronized is that you don’t need to explicitly create a lock object in your code to implement the locking mechanism. However, as a precaution, the @synchronized block implicitly protects the code by adding an exception handler that automatically releases the mutex when an exception is thrown. So if you don’t want the extra overhead of implicit exception-handling routines, you can consider using lock objects.

    The execution result of the above result is:

    Operation requiring thread synchronization 1 Start operation requiring thread synchronization 1 End operation requiring thread synchronization 2Copy the code
  • NSLock

    NSLock *lock = [[NSLock alloc] init];
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        //[lock lock];
        [lock lockBeforeDate:[NSDate date]];
            NSLog(@"Operation 1 requiring thread synchronization begins");
            sleep(2);
            NSLog(@"Operation 1 requiring thread synchronization ends");
        [lock unlock];
    
    });
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        sleep(1);
        if([lock tryLock]) {// Try to obtain the lock, if not return NO, will not block the thread NSLog(@"Operations available to lock");
            [lock unlock];
        }else{
            NSLog(@"Lock unavailable operation");
        }
    
        NSDate *date = [[NSDate alloc] initWithTimeIntervalSinceNow:3];
        if([lock lockBeforeDate:date]) {// Try to acquire the lock in the next 3s and block the thread. If NO recovery thread is acquired in 3s, return NO and NSLog(@) is not blocked"No timeout, lock obtained.");
            [lock unlock];
        }else{
            NSLog(@"Timed out, no lock obtained"); }});Copy the code

    NSLock is the most basic lock object Cocoa provides, and it’s the one we use most often. In addition to lock and unlock methods, NSLock also provides tryLock and lockBeforeDate: two methods. The first method attempts to lock, and if the lock is not available (it is already locked), Just does not block the thread and returns NO. LockBeforeDate: The method tries to lock before the specified Date, and returns NO if the lock cannot be held before the specified time. The result of the above code is:

    Operation 1 that requires thread synchronization Starts Operation 1 that requires thread synchronization Ends Without timeout, the lock is acquiredCopy the code
  • NSRecursiveLock recursive locking

    //NSLock *lock = [[NSLock alloc] init];
    NSRecursiveLock *lock = [[NSRecursiveLock alloc] init];
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
    
        static void (^RecursiveMethod)(int);
    
        RecursiveMethod = ^(int value) {
    
            [lock lock];
            if (value > 0) {
    
                NSLog(@"value = %d", value);
                sleep(1);
                RecursiveMethod(value - 1);
            }
            [lock unlock];
        };
    
        RecursiveMethod(5);
    });
    
    Copy the code

    NSRecursiveLock actually defines a recursive lock that can be requested multiple times by the same thread without causing a deadlock. This is mainly used in circular or recursive operations.

    This code is a typical deadlock case. In our thread, RecursiveMethod is called recursively. So each time it enters the block, it adds a lock, and from the second time, since the lock is already in use and not unlocked, it waits for the lock to be unlocked, which causes a deadlock and the thread to block. The debugger outputs the following information:

    value = 5
    -[NSLock lock]: deadlock (<NSLock: 0x7fd811d28810> '(null)')
    Break on _NSLockError() to debug.
    Copy the code

    In this case, we can use NSRecursiveLock. It allows the same thread to lock more than once without causing a deadlock. A recursive lock keeps track of how many times it is locked. Each successful lock must balance calls to unlock. Only when this balance is reached can the lock finally be released for use by other threads.

    If we replace NSLock with NSRecursiveLock, the above code will execute correctly.

    value = 5
    value = 4
    value = 3
    value = 2
    value = 1
    Copy the code
  • Lock NSConditionLock conditions

    NSMutableArray *products = [NSMutableArray array];
    
    NSInteger HAS_DATA = 1;
    NSInteger NO_DATA = 0;
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        while (1) {
            [lock lockWhenCondition:NO_DATA];
            [products addObject:[[NSObject alloc] init]];
            NSLog(@"Produce a product, amount :%zi",products.count); [lock unlockWithCondition:HAS_DATA]; sleep(1); }}); dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{while (1) {
            NSLog(@"wait for product");
            [lock lockWhenCondition:HAS_DATA];
            [products removeObjectAtIndex:0];
            NSLog(@"custome a product"); [lock unlockWithCondition:NO_DATA]; }});Copy the code

    When using multiple threads, sometimes a lock that only uses lock and unlock is not always sufficient. Because ordinary locks can only care about locking or not locking, and do not care about what key can be used to unlock the lock. When dealing with resource sharing, most of us can only open the lock when certain conditions are met:

    Lock with the lock in the thread 1, so don’t need a condition, so successful is locked, but the unlock the use of the terms of an integer, it can open the other threads are waiting for the key in the critical, thread 2 requires a is identified as two key, so when the thread one cycle to the last time, Finally unblocking in thread 2. But even so, NSConditionLock lock is the same as the other, also is the need to lock and unlock the corresponding, just lock, lockWhenCondition: and unlock unlockWithCondition: can be arbitrary combination, Of course, it depends on your needs.

    The result of the above code execution is as follows:

    wait forProduct produce a product, total amount :1 Custome a productwait forProduct produce a product, total amount :1 Custome a productwait forProduct produce a product, total amount :1 Custome a productCopy the code
  • NSCondition

    NSCondition *condition = [[NSCondition alloc] init];
    
    NSMutableArray *products = [NSMutableArray array];
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        while (1) {
            [condition lock];
            if ([products count] == 0) {
                NSLog(@"wait for product");
                [condition wait];
            }
            [products removeObjectAtIndex:0];
            NSLog(@"custome a product"); [condition unlock]; }}); dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{while (1) {
            [condition lock];
            [products addObject:[[NSObject alloc] init]];
            NSLog(@"Produce a product, amount :%zi",products.count); [condition signal]; [condition unlock]; sleep(1); }});Copy the code

    One of the most basic conditional locks. Manually control the wait and signal threads.

    [condition lock]; Generally, multiple threads access and modify the same data source at the same time to ensure that the data source can be accessed and modified only once at the same time. Commands of other threads must wait outside the LOCK until the UNLOCK is accessed

    [condition unlock]; Use in conjunction with Lock

    [condition wait]; Leave the current thread in wait state

    condition signal]; The CPU signals the thread to stop waiting and continue executing

    The result of the above code execution is as follows:

    wait forProduct produce a product, total amount :1 Custome a productwait forProduct produce a product, total amount :1 Custome a productwait forProduct produce a product, total amount :1 Custome a productCopy the code
  • pthread_mutex

    __block pthread_mutex_t theLock;
    pthread_mutex_init(&theLock, NULL);
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
            pthread_mutex_lock(&theLock);
            NSLog(@"Operation 1 requiring thread synchronization begins");
            sleep(3);
            NSLog(@"Operation 1 requiring thread synchronization ends");
            pthread_mutex_unlock(&theLock);
    
    });
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
            sleep(1);
            pthread_mutex_lock(&theLock);
            NSLog(@"Operation requiring thread synchronization 2");
            pthread_mutex_unlock(&theLock);
    
    });
    Copy the code

    C language under the definition of multi-thread lock.

    1. pthread_mutex_init(pthread_mutex_t mutex,const pthread_mutexattr_t attr); Initialize the lock variable mutex. Attr is the lock attribute and NULL is the default attribute.
    2. pthread_mutex_lock(pthread_mutex_t mutex); lock
    3. pthread_mutex_tylock(*pthread_mutex_t *mutex); Lock, but unlike 2, when the lock is already in use, return EBUSY instead of pending.
    4. pthread_mutex_unlock(pthread_mutex_t *mutex); Release the lock
    5. pthread_mutex_destroy(pthread_mutex_t* mutex); Release after use

    The code execution results are as follows:

    Operation requiring thread synchronization 1 Start operation requiring thread synchronization 1 End operation requiring thread synchronization 2Copy the code
  • pthread_mutex(recursive)

    __block pthread_mutex_t theLock;
    //pthread_mutex_init(&theLock, NULL);
    
    pthread_mutexattr_t attr;
    pthread_mutexattr_init(&attr);
    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
    pthread_mutex_init(&lock, &attr);
    pthread_mutexattr_destroy(&attr);
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
    
        static void (^RecursiveMethod)(int);
    
        RecursiveMethod = ^(int value) {
    
            pthread_mutex_lock(&theLock);
            if (value > 0) {
    
                NSLog(@"value = %d", value);
                sleep(1);
                RecursiveMethod(value - 1);
            }
            pthread_mutex_unlock(&theLock);
        };
    
        RecursiveMethod(5);
    });
    Copy the code

    This is a recursive lock that pthread_mutex uses to prevent deadlocks in recursive situations. Works like an NSRecursiveLock recursive lock.

    If pthread_mutex_init(&theLock, NULL) is used; Initializing the lock causes the above code to deadlock. If you use the form of a recursive lock, there is no problem.

  • OSSpinLock

    __block OSSpinLock theLock = OS_SPINLOCK_INIT;
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        OSSpinLockLock(&theLock);
        NSLog(@"Operation 1 requiring thread synchronization begins");
        sleep(3);
        NSLog(@"Operation 1 requiring thread synchronization ends");
        OSSpinLockUnlock(&theLock);
    
    });
    
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        OSSpinLockLock(&theLock);
        sleep(1);
        NSLog(@"Operation requiring thread synchronization 2");
        OSSpinLockUnlock(&theLock);
    
    });
    Copy the code

    OSSpinLock Spin lock, the highest performance lock. The principle is very simple, just do while waiting. The downside is that it consumes a lot of CPU resources when waiting, so it is not suitable for longer tasks. However, YY explained that OSSpinLock was no longer safe in its blog OSSpinLock.

The simplest type of lock is the Binary Semaphore, which has only two states, occupied and unoccupied. It is suitable for resources that can only be accessed individually by a single thread. When a binary semaphore is in an unoccupied state, the first thread that attempts to acquire the binary semaphore acquires the lock and puts the binary semaphore in the occupied state, after which all other threads attempting to acquire the binary semaphore wait to direct the lock to be released.

A multivariate Semaphore, or Semaphore, is a good choice for resources that allow concurrent access by multiple threads. A semaphore with an initial value of N allows concurrent access by N threads. When a thread accesses a resource, it first obtains a semaphore and performs the following operations:

  • Decrement the semaphore value by 1
  • If the semaphore value is less than 0, the wait state is entered, otherwise execution continues.

After the resource is accessed, the thread releases the semaphore and does the following:

  • Increase the semaphore value by 1
  • If the semaphore value is less than 1, wake up a waiting thread.

The related use of a semaphore in iOS is dispatch_semaphore

dispatch_semaphore_t signal = dispatch_semaphore_create(1);
    dispatch_time_t overTime = dispatch_time(DISPATCH_TIME_NOW, 3 * NSEC_PER_SEC);

    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        dispatch_semaphore_wait(signal, overTime);
            NSLog(@"Operation 1 requiring thread synchronization begins");
            sleep(2);
            NSLog(@"Operation 1 requiring thread synchronization ends");
        dispatch_semaphore_signal(signal);
    });

    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        sleep(1);
        dispatch_semaphore_wait(signal, overTime);
            NSLog(@"Operation requiring thread synchronization 2");
        dispatch_semaphore_signal(signal);
    });

Copy the code

“Dispatch_semaphore” is a GCD method of synchronization. There are three functions associated with dispatch_semaphore_create, “dispatch_semaphore_signal”, “dispatch_semaphore_signal”. Dispatch_semaphore_wait.

  • dispatch_semaphore_create

    dispatch_semaphore_t dispatch_semaphore_create(long value);
    Copy the code

It takes long and outputs a semaphore of type dispatch_semaphore_t with a value of value. Note that value must be greater than or equal to 0, otherwise dispatch_semaphoRE_CREATE returns NULL. ` ` `

  • dispatch_semaphore_signal

    long dispatch_semaphore_signal(dispatch_semaphore_t dsema)
    Copy the code

This function increments the value of the passed semaphore dsema by one; ` ` `

  • dispatch_semaphore_wait

    Long dispatch_semaphore_wait(dispatch_semaphore_t dSEMa, dispatch_time_t timeout); This function reduces the value of the passed semaphore dsema by 1; If the value of the dsema semaphore is greater than 0, the thread continues to execute the statement and decrement the semaphore value by 1. If desema is 0, then this function blocks the current thread waiting for timeout (note that timeout is of type dispatch_time_t and cannot be passed directly as an integer orfloatIf desema is incremented by dispatch_semaphore_signal and the semaphore is acquired by the thread of dispatch_semaphore_wait, then proceed down and decrement the semaphore by one. If no semaphore is retrieved or the value of the semaphore remains 0 during the wait, the thread will automatically execute the subsequent statement after timeout.Copy the code

Dispatch_semaphore is a semaphore, but can also be used as a lock if the total number of signals is set to 1. It performs even better than pthread_mutex when there are no wait conditions, but much worse when there are wait conditions. Its advantage over OSSpinLock is that it does not consume CPU resources while waiting.

In the above code, if the timeout time is set to >2, the synchronization operation can be completed. If thread 1 has not completed its run, the following code will be automatically executed. If thread 1 has not completed its run, the following code will be automatically executed.

The result of the above code is:

Operation requiring thread synchronization 1 Start operation requiring thread synchronization 1 End operation requiring thread synchronization 2Copy the code

If the timeout is set to <2s, the result is:

Operation 1 that requires thread synchronization starts Operation 2 that requires thread synchronization ends operation 1 that requires thread synchronizationCopy the code

The results of YY’s performance test (qualitative analysis) on these locks are shown below:

Multithreading internals

Concurrent execution of threads is implemented by multi-processor or operating system scheduling. But the reality is a little more complicated: most operating systems, including Windows and Linux, provide threading support in the kernel, and kernel threads are also supported concurrently by multiple processors or scheduling. The thread that the user actually uses is not the kernel thread, but the user thread that exists in the user state. User threads do not necessarily correspond to the same number of kernel threads in the operating system kernel. For example, in some lightweight thread libraries, if there are three threads executing at the same time for the user, it is likely that there is only one thread for the kernel.

The three threading models of user-mode and kernel-mode are as follows:

  • One-to-one model! [2016091444276One to one thread model.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091444276One to one thread Model.png) so that user threads have the advantage of being identical to kernel threads. Concurrency between threads is true concurrency, and when one thread blocks for whatever reason, other threads do not execute.

    Generally, threads created directly using the API or the system are one-to-one threads.

    One-to-one threading has two disadvantages:

    • Since many operating systems limit the number of kernel threads, one-to-one threading limits the number of threads a user can have.
    • In many operating system kernel thread scheduling, the overhead of context switch is high, which leads to the execution efficiency of user thread.
  • Many-to-one model! [2016091487869More of a process model.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091487869More of a process PNG), the many-to-one model maps multiple user threads to a single kernel thread. Switching between threads is performed by user-mode code, so the many-to-one model is much faster than the one-to-one model. One of the problems with the many-to-one model is that if one of the user threads blocks, all the threads cannot execute because the threads in the kernel block as well. In addition, on a multiprocessor system, increasing the number of processors does not significantly improve the threading performance of the many-to-one model. But at the same time, the many-to-one model benefits from efficient context switching and an almost unlimited number of threads.

  • Many-to-many model! [2016091439628More for multithreaded model.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091439628More for PNG (multithreaded model.png) combines many-to-one and one-to-one features to map multiple user threads to a few but more than one kernel thread. In the many-to-many model, one user thread blocking does not cause all user threads to block, because there are other threads that can be scheduled to execute. In addition, the many-to-many model does not limit the number of user threads. In other aspects of multi-processor, many-to-many model threads can also get some performance improvement, but not as much as the one-to-one model.

Compilation and linking

For normal application development, we pay little attention to the compilation and linking process because the common development environment is an integrated development environment (IDE) for the process. An IDE usually takes the compilation and linking process one step at a time. Usually, this compilation and linking process is called Build. Even if you compile a source file from the command line, a simple “GCC hello.c” command contains a very complex process.

The default configuration, compilation, and linking parameters provided by the IDE and compiler are sufficient for most application development. However, in such a development process, we are often confused by the powerful functions provided by these complex integration tools, the operation mechanism and mechanism of a lot of system software is covered, many inexplicable errors of the program let us do not know what to do, and we have no plan for various performance bottlenecks when the program is running. If you can understand these mechanics, you can solve these problems with ease.

#include "hello.h"

int main()
{
    printf("Hello World\n");
    return 0;
}
Copy the code

Under Linux, we use GCC to compile programs using only the simplest commands

gcc hello.c
./a.out
Hello World
Copy the code

The above process can be divided into four steps:

  • Prepressing
  • Compilation
  • Assembly
  • Linking! [2016091428338GCC compiler process decomposition.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091428338GCC compiler process decomposition.png)

precompiled

The precompilation process mainly deals with precompiled instructions beginning with “#” in source files. The main rules are as follows:

  • Remove all “#define” and expand all macro definitions.
  • Precompiled instructions deal with all the conditions, such as: “# if”, “# ifdef”, “elif”, “# else”, “# endif”.
  • Process the “#include” precompiled instruction by inserting the included file into the location of the modified precompiled instruction. Note that this process is recursive, meaning that the included file may also contain other files.
  • Delete all comments “//” and “/**/”.
  • Add line numbers and file identifiers, such as #2″helloc.c”2, so that the compile-time compiler generates trial line number information and line numbers that can be displayed at compile-time for error or warning purposes.
  • Keep all #pargma compilation instructions because the compiler needs to use them.

The precompiled.i file does not contain any macro definitions because all macros have been expanded and the included files have been inserted into the.i file. So when we can’t tell if the macro definition is correct or the header file is contained correctly, we can look at the precompiled file to determine the problem.

The first precompilation process is equivalent to the following command:

gcc -E hello.c -o hello.i
Copy the code

compile

Compilation process is to preprocess the file for a series of lexical analysis, syntax analysis, semantic analysis and optimization of the production of the corresponding assembly code file, this process is often what we call the core of the entire program construction, is also one of the most complex part. The above compilation process is equivalent to the following command:

gcc -S hello.i -o hello.s
Copy the code

At its most intuitive, a compiler is a tool for translating high-level languages into machine languages. High-level languages allow programmers to focus more on the logic of the program and less on the limitations of the computer itself. The appearance of high-level programming language greatly improves the efficiency of program development. According to research, the development efficiency of high-level language is more than 5 times that of assembly language and machine language.

The compilation process can be divided into 6 steps: scanning, syntax analysis, semantic analysis, source code optimization, code generation and object code optimization, the whole process is as follows: [2016091652955The build process.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091652955The build process.png)

assembly

An assembler converts assembly code into instructions that can be executed by a machine. Almost every assembly statement corresponds to a machine instruction. Therefore, the assembly process of the assembler is relatively simple compared with the compiler. It has no complex grammar, no grammar, no semantics, and no instruction optimization. It just translates according to the comparison table between the assembly instruction and the machine instruction.

The above assembly process can be done by calling the assembler as:

as hello.s -o hello.o
Copy the code

or

gcc -c hello.s -o hello.o
Copy the code

link

Linking is often a rather convoluting process, so why doesn’t the assembler just print an executable instead of an object file? What exactly does the linking process involve? Why link?

The book briefly reviews the history of computer development. To put it simply, as the scale of software grows, each program is divided into multiple modules. The Linking process of these modules is called Linking.

The linking process mainly includes:

  • Address and Storage Alloction
  • Ps:” Resolution “prefers static linking, while” binding “prefers dynamic linking.
  • Relocation

The most basic static link process is shown below: [2016091665800Link process.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091665800Link process.png) The source code files (such as.c) of each module are compiled by the compiler into Object files (usually with.o or.obj extensions), which are linked together with the library to form the final executable File.

The relocation process is as follows:

Let’s say we have A global variable called var, which is in object file A. We need to access this global variable in object file B. Because the compiler does not know the target address of the variable var when compiling object file B, the compiler sets the target address to 0 and waits for the linker to correct it when object files A and B are connected. This process of address correction is called relocation, and each location corrected is called a relocation entry.

The target file

The files generated by the compiler after compiling source code are called object files.

The most popular Executable files for PC platforms are mainly Portable PE (Executable) for Windows and ELF (Executable Kabel Format) for Linux, both of which are COFF (Common) File format) format variant.

Object files are intermediate files (.obj for Widdows and.o for Linux) that have been compiled from source code but not linked.

In a broad sense, object files and executable files have almost the same format, so we can refer to them collectively as pe-COFF file formats. Under Linux, we can refer to them collectively as ELF files.

Linking Library (DLL, Linking Library) (Windows.dll, Linux.so) and Static Linking Library (Windows.lib) And Linux. A) files are stored as executable files.

A static link library is a bit different. It is a bundle of many object files together into a single file, plus some indexes.

ELF file type instructions The instance
Relocation files These files contain code and data that can be linked into executable files or shared object files. Static link libraries can also fall into this category Linux的.o Windows的.obj
Executable File This type of file contains programs that can be executed directly, and is represented by ELF executables, which generally have no extension For example, the /bin/bash file is Windows.exe
Shared Object files This file contains code and data and can be used in two ways. One is that the linker can use this file to link to other relocatable and shared object files to generate new object files. The second is that the dynamic linker can combine several of these shared object files with executables to run as part of the process image Linux. So Such as /lib/glibc-2.5.so Windows DLL
Core Dump File When a process terminates unexpectedly, the system can dump the contents of the process’s address space, along with some other information at the time of termination, into a core dump file Core dump for Linux

What does the target file look like

This is the type of graph that you often see when you talk about blocks or other memory problems.

The compiled machine instructions of a program’s source Code are often placed in Code sections. Common Code sections have names such as “.code” or “.text”; Global and local static variable Data is often placed in the Data Section, which is usually named “.data”.

See the structure of a simple program compiled into an object file.

! [2016091941627Program and the target file.png](http://7xraw1.com1.z0.glb.clouddn.com/2016091941627Program and the target file.png)

The rest is easy to read. Uninitialized global variables and local static variables are usually placed in a “.bss” section (bSS-block Started by Symbol). Uninitialized global variables and local static variables default to 0, and they could have been placed on the.data side, but since they are both 0, it is not necessary to separate space and store data 0 on the.data side for them. Programs do need to hold storage space when they run, and the executable file must record the sum of all uninitialized global and local static variables as the.bss segment. ** So. BSS is just uninitialized global variables and local static variables that reserve space. ** It has no content, so it doesn’t take up space in the file.

Generally speaking, the program source is divided into two main segments after compilation: program instructions and program data. Code segments belong to program instructions, while data and.bSS segments belong to program data.

Why separate program instructions from data storage?

  • When the program is loaded, the data and instructions are mapped to two virtual memory areas. The data area is read/write to the process and the instruction area is read/write to the process, so the permissions of the two virtual memory areas can be set to read/write and read only respectively. This prevents program instructions from being overwritten intentionally or unintentionally.

  • Modern cpus have this extremely powerful cache architecture, so programs must maximize the cache hit ratio. The separation of instruction area and data area is beneficial to improve the locality of the program. The cache of modern CPUS is usually designed to be separated from the data cache and the instruction cache, so the instruction and data of programs are stored separately to improve the CPU cache hit ratio.

  • The last and most important point is that when multiple copies of the program are running in the system, their instructions are the same, so only one copy of the program’s instructions should be stored in memory. This is true for read-only areas such as instructions, as well as for other read-only data. Of course, the data areas of each replica process are different; they are private to the process.

In addition to the three most commonly used. Text,. Data, and. BSS segments, ELF files may also contain other segments for storing other program-related information. ! [2016092063938Other segments.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092063938Other segments.png)

ELF file structure

The following basic structure chart of EFL documents is formed by eliminating some tedious structures of EFL and extracting the most important ones. ! [2016092093390EFL structure.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092093390EFL structure.png) The EFL target file format also includes the following contents:

  • The EFL Header contains basic attributes that describe the entire file, such as ELF file version, target machine model, program entry address, and so on.
  • Each paragraph
  • Sextion Header Tabel describes information about all segments in an EFL file, such as the segment name, length, offset in the file, read and write permissions, and other attributes of each segment. The segment structure of AN EFL file is determined by the segment table. Compilers, linkers, and loaders rely on the segment table to locate and access the attributes of each segment.
  • When the relocation table linker processes the object file, it needs to relocate some parts of the object file, that is, the reference locations of those absolute addresses in the code segment and data segment. This relocation information is recorded in the RELOCATION table of the EFL file, and there is a relocation table for each code segment that needs to be relocated.
  • The string table ELF file uses many strings, such as segment names, variable names, and so on. Because the length of a string is often variable, it is difficult to label it with a fixed structure. A common practice is to put strings together in a table, and then reference the string using the offset of the string in the table. General String tables are String Table and Section Header String Tabel. As the name implies, string tables are used to hold ordinary strings, and segment table string tables are used to hold strings used in segment tables.

Link interface – symbol

Linking is essentially the process of “sticking” multiple object files together, or piecing them together like toy blocks. In order for different object files to stick together, they must have fixed rules between them, just as building blocks must have bumps and bumps to fit together.

The linking between object files is actually a reference to the address between object files, that is, the reference to the address of functions and variables. For example, if object file B uses function “foo” in object file A, then object file A** defines function “foo” and object file B ** references function “foo” in object file A. The same concept applies to variables. Each function or variable has its own unique name to avoid confusion between different variables and functions during chaining.

In links, we refer to functions and variables collectively as symbols, and the function or variable Name is the Symbol Name.

Symbolic modification and function signature

Until about the 1970s, when compilers compiled object files produced by source code, the symbol name was the same as the name of the corresponding variable function. To prevent similar symbol name collisions, Unix C requires that all global variables and functions in C source files be preceded by the underscore “_” relative to the symbol name. Of course, that too is history.

For example, two functions of the same name, func(int) and func(double), have the same name but different argument lists. This is one of the simplest cases of function overloading in C++. To support these complex features of C++, mechanisms for Name Decoration or Name Mangling have been invented.

Two functions with the same name, func, except that they have different return types and arguments and namespaces. A term called Function Signature was introduced. Function signatures contain information about a Function, including the Function name, parameters, its class and namespace, and other information. Name decoration is used when the compiler and linker process symbols, so that each function signature corresponds to a Decorated Name (Decorated Name).

Weak symbols and strong symbol | the weak strong reference and reference

We often encounter a situation in programming called symbol redefinition. Multiple object files with global symbol definitions of the same name will cause a symbol duplication error when linking these object files.

The definition of a Strong Symbol can be called a Strong Symbol. Some symbols are defined as Weak symbols. For C/C++, the compiler defaults functions and initialized global variables to strong symbols, and uninitialized global variables to weak symbols. We can also use GCC’s “attribute((weak))” to define any box of strong symbols as weak.

Note: both strong and weak symbols are for definitions, not references to symbols. For example, the following program:

extern int ext;

int weak ;
int strong = 1;
__attribute__((weak)) weak2 = 2;

int main() 
{
    return 0;
}

Copy the code

“Weak” and “weak2” are weak symbols (I have tested that when weak2 is changed to weak with a compilation attribute, there will be no double definition error), “strong” and “main” are strong symbols, and “ext” is neither strong nor weak because it is an external variable reference. For the concept of strong and weak symbols, the linker will process and select symbols defined multiple times according to the following rules:

  • Rule 1: Strong symbols are not allowed to be defined more than once (i.e., strong symbols cannot have the same name in different object files). If there are more than one strong symbol definition, the linker reports the symbol definition duplicate error;
  • Rule 2: If a symbol is strong in one object file and weak in all other files, then it is now strong.
  • Rule 3: If a symbol is weak in all object files, select the one that takes up the most space.

Weak references and strong references. At present, we see symbolic references to external object files. When the object file is finally linked into the execution file, they need to be correctly determined. If the definition of the symbol is not found, the linker will report the error of unread symbol, which is called Strong Reference. There is also a Weak Reference. When dealing with Weak references, if the symbol has a definition, the linker determines the Reference of the symbol. If the symbol is not defined, the linker does not report an error for the reference. The linker handles strong and weak references in much the same way, except that an undefined weak reference is not considered an error by the linker. Generally, for undefined weak references, the linker defaults to 0, or a special value that the program code can recognize.

Use “attribute((weak))” to declare a reference to an external function as a weak reference, for example:

__attribute__((weakref)) void foo()
int main()
{
    foo();
}
Copy the code

It compiles into an executable, and GCC does not report link errors. But when we run the executable, a runtime error occurs. Because foo’s address is 0 when main tries to call foo, an invalid address access error occurs. After the improvement:

__attribute__((weakref)) void foo()
int main()
{
    if(foo) foo();
}
Copy the code

Weak references and weak symbols are mainly used during library linking. For example, weak symbols defined in the library can be overwritten by strong symbols defined by the user, so that the program can use the custom version of the function library; Or the program can define the reference of some extension function module as weak reference, when we link the extension module with the program, the function module can be used normally; If we remove some function modules, the program will link normally, but without the response function, which makes the program easier to cut and combine.

Debugging information

The object file may also contain debugging information. Almost all modern compilers support source-level debugging, such as setting breakpoints in functions, listening for variable changes, and single-stepping, as long as the compiler previews the relationship between source code and object code. For example, the address in the object code for which line in the source code, the types of functions and variables, the definition of the structure, the string saved in the object file. Setup Some advanced compilers and debuggers support viewing the contents of STL containers. Consider that Xcode supports viewing the contents of various containers, images, and so on during debugging.

Debugging information takes up a lot of space in the object file and executable file, often several times larger than the program and data itself, so when we develop the program and want to release it, we need to remove the debugging information that is not useful to the user, to save a lot of space. Under Linux, we can use the “strip” command to strip debugging information from ELF files;

strip foo
Copy the code

When Xcode builds the Configuration, it also selects Debug or Release. When Xcode selects Release, the program crashes when running, so the cause and location of the crash will not be displayed in Xcode.

Static link

Because of the different form of link, static link and dynamic link are produced. When we have two object files, how do we link them together to form an executable? What happens in this process? That’s basically the core of the link.

The whole link process is divided into two steps:

  • Step 1: Space and address allocation Scan all input object files, and obtain the length, attributes and positions of each section, and collect all symbol definitions and symbol references in the symbol table of the input object file, and put them into a global symbol table. In this step, the linker will be able to get the segment lengths of all the input object files, combine them, calculate the combined lengths and positions of each segment in the output file, and establish a mapping relationship.
  • Step 2: Symbol resolution and relocation Using all the information gathered in the first step above, read the interrupted data of the input file, relocation information, and perform symbol resolution and relocation, adjust addresses in the code, and so on. In fact, the second step is the heart of the linking process, especially the relocation process.

Space and address allocation

The linker assigns address and space to the target file. The actual space allocation strategy is: merge similar segments, as shown in the following figure: [2016092091451Space allocation.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092091451Space allocation.png)

The linker assigns address and space to the target file.

  • Space in the output executable;
  • Virtual space in the loaded virtual address.

For segments with actual data, such as “.text “and”.data “, space is allocated between the file and the virtual address, because they exist in both. For a segment like “.bss “, the meaning of space allocation is limited to the virtual address space, because it has no content in the file. In fact, the allocation we are talking about here is only concerned with the allocation of virtual addresses, which is related to the subsequent steps of the address calculation in the linker, whereas the allocation of space in the executable itself has less to do with the linking process.

Symbol resolution and relocation

In our common sense, we link because the symbols we use in our object file are defined in other object files, so we link them together.

One of the most common problems we encounter when writing programs is that symbols are not defined when linking. There are many reasons for this whole problem, but the most common ones are missing a library when linking, or entering an object file path incorrectly, or symbols declared differently than defined. So from the point of view of the average programmer, the parsing of symbols takes up the bulk of the linking process.

In fact, the relocation process is also accompanied by the symbol resolution process, each object file may define some symbols, and may reference symbols defined in other object files. During relocation, each relocation entry refers to a symbol, so when the linker needs to relocate a symbol reference, it determines the destination address of the symbol. At this point, the linker will look up the global symbol table composed of all the symbol tables of the input object file and find the corresponding symbol for relocation.

Static library link

A static library can simply be viewed as a collection of object files, that is, many object files compressed and packaged into a single file. For example, libc, the C static library we use most often in Linux, is located in /usr/lib/libc.a and is part of the glibc project.

The link to the static library is shown below: [2016092173427Static library link.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092173427Static library link.png)

Why does the static runtime include a function in an object file? For example, printf.o only has printf() in lib.c.

The linker links static libraries in object files. Such as our reference in the static library printf () function, the linker will contain in the library printf () function of the target file link in, if many of the functions in a target file, probably a lot of no link into the output functions are abandoned, because there are hundreds of thousands of runtime function, quantity is very large, Placing each function in a separate object file minimizes space waste, and objects (functions) that are not used should not be linked to the final output file.

The next book also talked about the link process control, this part is more operational, I will not expand.

Tip:

Operating system kernels are mentioned in many places. In essence, it is a program itself. For example, the Windows kernel ntokrnl.exe is the PE file we usually see, its location is located in \ Windows \ System32 \ntoskrnl.exe. Many people mistakenly believe that the Windows operating system kernel is huge and consists of many files. It is a misconception that the real Windows kernel is this file.

Loading and process of executable files

Introducing the loading of executable files with the start of the process, there are several problems.

  • What is a process’s virtual address space?
  • Why should processes have their own virtual address space?
  • What are the loading methods?
  • What is the virtual address distribution of the process? For example, how are code segments, data segments, BSS segments, heaps, and stacks distributed in the process address space respectively, and how are their positions and lengths determined?

The virtual address space of the process

What is the difference between a program and a process? A program (or executable, narrowly defined) is a static concept. It is a file with a set of pre-compiled instructions and data. Process is a dynamic concept. It is a process during the running of a program. In many cases, the dynamic library is called Runtime, which also has a certain meaning.

Once each program runs, it will have its own Virtual Address Space, the size of which is determined by the hardware of the computer, specifically by the number of CPU bits. Hardware determines the maximum theoretical upper limit of address space, that is, the size of addressing space of hardware. For example, 32-bit hardware platform determines the address of virtual address from 0 to [2 ^ 32] -1, that is 0x00000000~0xFFFFFFFF, that is, we often say the size of 4GB virtual space; 64-bit hardware platforms have 64-bit addressing capability, with virtual address space of [2 ^ 64] bytes, 0x0000000000000000 to 0xFFFFFFFFFFFFFF, 17 179 869 184 GB. This addressing capability is now almost unlimited. But history has a way of teasing, and we may one day find 64-bit address Spaces too small, just as we now find 32-bit addresses inadequate.

In fact, from the point of view of the program, we can judge the C language program pointer space to calculate the size of the virtual address. Generally speaking, the size of the C language pointer is the same as the number of bits in the virtual space. For example, the pointer in the 32-bit platform is 32 bits, that is, 4 bytes. On 64-bit platforms, Pointers are 64 equals, or 8 bytes.

So the 32-bit platform under the 4GB virtual space, our program can be arbitrary use? Unfortunately, no. Because the program runs under the supervision of the operating system, the operating system in order to achieve a series of purposes such as monitoring the program running, the virtual space of the process is under the control of the operating system. A process can only use the addresses assigned to it by the system. If it accesses non-running space, the operating system catches the access, treats it as an illegal operation, and forces the process to terminate. We often encounter annoying “processes need to be shut down due to illegal operations” on Windows or “Segmentation faults” on Linux because processes access unauthorized addresses.

Can a program use more than 4GB of space on a 32-bit CPU? The question should be looked at from two perspectives.

  • If the “space” in the question means virtual address space, the answer is “no.” Since 32-bit cpus can only use 32-bit Pointers, its maximum addressing range is 0 to 4GB.
  • If the word “space” in the question refers to the computer’s memory space, the answer is “yes.” Since 1995, Intel’s Pentium Pro cpus have used 36-bit physical addresses, which means they can access up to 64GB of physical memory.

On a hardware level, the first 32-bit address line can only access up to 4GB of physical memory. But since expanding to 36-bit addresses, Intel has changed the way pages are mapped so that the new mapping can access more physical memory. Intel calls this Address Extension method PAE (Physical Address Extension). Of course, this is just an unconventional way to remedy the lack of 32-bit address space, but the real solution is to use 64-bit processors and operating systems.

Mode of loading

The instructions and data needed by the program execution must be in memory to run normally. The simplest way is to load all the instructions and data needed by the program execution into memory, so that the program can run smoothly. This is the simplest static loading method. But there are many cases where the program needs more memory than there is physical memory, and when there is not enough memory, the fundamental solution is to add memory. Memory, by contrast, is expensive and scarce, as it has been since the dawn of computer disks. So people are trying to get as many programs running as possible without adding memory, using memory as efficiently as possible. Later research found that the program is run by the principle of locality, so we can keep the most frequently used parts of the program in memory, and some less frequently used data stored in disk, this is the basic principle of dynamic loading.

What are the loading methods? Overlay and Paging ** are two elegant dynamic loading methods. Both use similar ideas and in principle exploit the principle of program locality. The idea of dynamic loading is that if a program uses any module, it will load that module into memory, and if it is not used, it will not be loaded and stored in disk.

  • Overlay loading was widely used before virtual storage was invented and is now almost obsolete. The overridden load approach leaves the task of exploiting memory’s potential to the programmer, who must manually slice the program into chunks and then write auxiliary code to manage when those modules should reside in memory and when they should be replaced. This little helper code is called an Overlay Manager.
  • Paging Page mapping is part of the virtual storage mechanism. Similar to the principle of override loading, Page mapping does not load all the data and instructions in the program into memory at once. Instead, it divides the data and instructions in memory and all the data and instructions on disk into several pages in units of “Page”, which is the unit of all subsequent loads and operations.

Understand page mapping:

To demonstrate the basic mechanism of page mapping, assume that our 32-bit machine has 16KB of memory and each page size is 4096 bytes, so there are four pages. Assuming that all the instructions and data add up to 32KB, the program is divided into eight pages. We’ll call them P0 to P7. Obviously, 16KB of memory can’t start loading 32KB at the same time, so we’ll follow the principle of dynamic loading for the entire loading process. If the program starts execution at P0, the load manager finds that program P0 is not in memory, allocates memory F0 to P0, and loads the contents of P0 into F0. After running for a while, the program needs P5, so the load manager loads P5 into F1. Thus, when the program uses P3 and P6, they are loaded into F2 and F3, respectively, and their mapping is shown below. ! [2016092197064Mapping and the pages load.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092197064Mapping and the pages PNG). Obviously, if the program only needs four pages, P0, P3, P5, and P6, then the program will run forever. But the problem is obvious. If the program accesses P4 at this point, the load manager has to make a decision. It must abandon one of the four memory pages it is currently using to load P4. As for which page to choose, we can choose a number of algorithms, such as F0, because it is the first memory page to be allocated (this algorithm can be called FIFO, first in, first out); If the load manager finds that F2 is rarely accessed, we can choose F2 (this algorithm can be called LUR, the least used algorithm). Suppose we drop P0, then F0 has loaded P4. The program then runs this way.

In fact, the so-called load manager in this example is the modern operating system, or more precisely, the storage manager of the operating system. Almost all major operating systems today load executables this way. The familiar loading of PE files on Windows and ELF files on Linux is done this way.

Process establishment

From an operating system perspective, the most critical feature of a process is that it has a separate virtual address space, which sets it apart from other processes. Many times a program is executed with the creation of a new process, so let’s look at the most common case: create a process, then load the response executable and execute. In the case of virtual storage, the above process only needs to do three things at the beginning:

  • Create a separate virtual address space We know that a virtual space consists of a set of page mapping functions that map each page of the virtual space to the corresponding physical space, so creating a virtual space is not actually creating the space but the corresponding data structure needed to create the mapping function.
  • Read executable header, and establish the mapping between virtual space and executable; The page mapping function of the previous step is the mapping of virtual space to physical memory. This step is the mapping of virtual space to executable files. When a page error occurs in the program execution, the operating system will allocate a physical page from the physical memory, and then read the “missing page” from disk to the memory, and then set the mapping between the virtual page and the physical page, so that the program can run normally. But it is clear that when the operating system catches a missing page error, it should know where in the executable the page currently required by the program is. This is the mapping between virtual Spaces and executables. This step is the most important step in the loading process, which is the traditional “loading” process. In Linux, a segment of a process’s Virtual space is called a Virtual Memory Area (VMA). In Windows, this segment is called a Virtual Secion, but they are the same concept. For example, after a process is created, the operating system sets a. Text VMA in the corresponding data structure of the process. The VMA is an important concept for understanding the loading execution of programs and how the operating system manages the virtual space of processes.
  • Set the CPU instruction register to the executable file entry address, start running. In the third and simplest step, the operating system transfers control to the process by setting the CPU’s instruction registers, from which the process begins execution. This seemingly simple step is actually more complex at the operating system level, involving kernel rollout and user stack switching, CPU running permissions switching. But from the process’s point of view this step can simply be thought of as the operating system executing a jump instruction to the entry address of the executable file.

Process virtual memory space distribution

In a normal process, executable files often contain not only code segments but also data segments, BSS, and so on, so more than one segment is mapped to the process virtual space. As the number of segments increases, space is wasted. When an EFL file is mapped, the unit is the page length of the system. Therefore, the length of each segment should be an integer multiple of the length of the system page. If not, the excess will occupy a page. An EFL file often has dozens of segments, so the waste of memory space is imaginable.

When we look at the problem from the perspective of the operating system loading the executable, we can see that it doesn’t really care about the actual contents of the various sections of the executable. The operating system only cares about the problems related to loading. The most important is the section permissions (readable, writable, executable). ELF files tend to have only a few combinations of permissions for segments, basically three:

  • Permissions represented by code segments can read executable segments.
  • Read and write segments represented by data segments and BSS segments.
  • Permissions represented by read-only data segments are read-only segments.

Then we can find a very simple solution: for segments with the same permissions, merge them together and map them as one segment. Segments can be merged together as a “Segment”, so they can be mapped together as a whole during loading. This has the advantage of significantly reducing internal fragmentation, thus saving memory space.

The concept of “Segment” actually redivides ELF into segments from a loading point of view. When the object file is linked to an executable, the linker comes in and allocates segments with the same permission attributes in one space. For example, readable and executable segments are grouped together, typically code segments; Read and write segments are grouped together, typically data segments.

Do we mean “Segment” by “text” and “data”?

See below: ELF executables mapped to process virtual space! [2016092346799The ELF executable file with the process of virtual space mapping relation.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092346799The ELF executable file with the process of virtual space mapping relation.png)

These sections in ELF files are called “sections”. So “Segment” and “Section” divide the same ELF file from different angles. This is called a different View in ELF, and an ELF file is a Linking View from a “Section” perspective and an Execution View from a “Segemnt” perspective. When we talk about ELF loading, “Segment” refers exclusively to “Segment”; In other cases “duan” refers to “Section”.

Stack and heap

In the operating system, the VMA is used to map the segments in the executable file. It also has other functions. The operating system uses the VMA to manage the address space of the process. We know that the process also needs to use Stack and Heap space during execution. In fact, they also exist in the process virtual space in the form of Vmas. In many cases, the Stack and Heap in a process have a corresponding VMA, and they are not mapped to the file. This VMA is called the Anonymous Virtual Memory Area.

To summarize the concept of process virtual address space, the operating system manages process virtual space by dividing the process space into vmas. Basically, the same permission attributes and the same image files are mapped into a VMA. A process can basically be divided into the following VMA regions:

  • Code VMA, permissions read-only, executable; There are image files;
  • Data VMA with read-write and executable permissions; There are image files;
  • Heap VMA, permissions read and write, executable; No image file, anonymous, upscaling.
  • Stack VMA, permissions read and write, not executable; No image file, anonymous, downscaling.

Let’s take a look at a common process virtual space, as shown below: [2016092374781The ELF executable file with the process of virtual space mapping relation.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092374781The ELF executable file with the process of virtual space mapping relation.png)

The maximum number of applications for the heap

The virtual address space given to the process itself on Linux is 3GB (Windows defaults to 2GB), so how much can the program really use? How much memory can malloc() apply to address space? We can test the maximum amount of malloc memory requested by using the following small program:

#include <stdio.h>
#include <stdlib.h>unsigend maximum = 0; int main(int argc, const char * argv[]) { // insert code here... Unsigned blockSize [] = {1024 * 1024,1024,1}; int i,count;for (i = 0; i < 3; i++){
        for (count = 1;; count++) {
            void *block = malloc(maximum + blocksize[i] * count);
            if (block) {
                maximum = maximum + blocksize[i] * count;
                free(block);
            }else{
                break; }}}printf("maximum malloc size = %lf GB\n",maximum* 1.0/1024.0/1024.0/1024.0/1024.0);return 0;
}
Copy the code

The result of running this program on a Linux machine is about 2.9GB; Running the ride on Windows results in about 1.5GB. So what are the factors that affect the maximum number of malloc applications? In fact, the specific value is used by the operating system version, and the size of the program itself, the dynamic/Shared library quantity size, the program stack, quantity, size, etc., and may even be the results of each run will be different, because some operating systems use a technology called random distribution of address space (mainly for security reasons, prevent malicious attack). The heap space of the process becomes smaller.

Section of the alignment

Executable files are eventually loaded and run by the operating system, usually through a virtual memory page mapping mechanism. In the process of mapping, the page is the smallest unit of mapping. Establish a mapping relationship between a physical memory and the process address space. This memory space must be an integer multiple of the page memory. Due to the limitations of length and starting address, executables should optimize their space and address arrangement to save space.

To get around this problem, some UNIX systems have a clever way of sharing a physical page with each of those sections, and then mapping the physical page twice. To explain, WHEN I first saw this, I didn’t understand what it meant. We want to look at the following picture:

! [2016092638493Period of unincorporated situation for executable files.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092638493Period of unincorporated situation for executable files.png)

Assuming that virtual memory SEG0 page and virtual memory SEG1 Page are not fully allocated memory pages, the system maps the physical page of the adjacent part of SEG0 and SEG1 to two copies of the virtual address space. The diagram below:

! [2016092655277The ELF file section of mergers.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092655277The ELF file section of mergers.png)

Because of segment address alignment, the virtual address of each segment is often not an integer multiple of the length of the system page.

Why dynamic linking

Static link allows developers and different level department to relatively independent development and test their program module, in a sense greatly promote the efficiency of application development, the original limit program module has been expanded, but slowly and disadvantages of the static link also gradually exposed come out, such as waste of memory and disk space and the difficulty of module updates, questions, So people have to find a better way to organize the modules of the program.

Memory and disk space

Static link this method is really very simple, easy to understand in principle, difficult to implement in practice, in the early operating system and hardware is not developed, most systems use this scheme. With the development of computer software, the disadvantage of this approach quickly becomes apparent, namely that static linking is a huge waste of computer memory and disk space. Especially in the case of multi-process operating systems, static linking is a huge waste of memory and space. Imagine that in addition to maintaining common libraries such as print(), scanf(), strlen(), and so on, each program has a considerable number of other libraries and auxiliary data structures they need.

For example, Program1 and Program2 contain program1.o and program2.o modules respectively. In the case of static link, because both Program1 and Program2 use the module lib. o, So they both have two copies of the linked output executable Program1 and Program2. When we run Program1 and Program2 at the same time, lib.o has two copies on disk and in memory. When there are a large number of lib.o programs sharing object files on the system, a large portion of that space is wasted. Among static links, C’s static library is a typical example of a waste of space, and there are thousands of other libraries that would waste unimaginable space if they all needed static links.

Program development and distribution

Waste of space is one of the problems with static linking, and the other is that static linking can cause a lot of trouble for updating, deploying, and distributing applications. For example, Program1 uses lib.o provided by a third party vendor. When the vendor updates lib.o (for example, fixing a bug contained in lib.o), the vendor of Program1 needs to get the latest version of lib.o. Then link it with Program1.o and release the whole new Program1 to users. The downside of this is obvious: once any module is updated in the application, the entire application has to be re-linked and published to the user.

Dynamic link

The solution to both wasted space and update difficulties is to separate application modules from each other and separate files instead of linking them statically together. The simple answer is not to link the object files that make up the program until the program is ready to run. That is, delaying Linking until runtime is the basic idea of Dynamic Linking.

Take Program1 and Program2 as examples. Both Program1 and lib.o have been added into the memory. If we need to run Program2, the system only needs to load program2. o instead of reloading lib.o, because there is already a copy of lib.o in the memory. All the system does is link program2.o to lib.o. The benefits of sharing an object file module in memory go beyond memory savings. It also reduces paging in and out of physical pages. It also improves CPU cache hit ratios, since data and instructions from different processes reside in the same shared module.

Dynamic linking scheme can make the program updates easier, when we want to upgrade the library or application sharing a module, theoretically as long as the old object files simply cover points, without all the application link again and again when the program the next run, the new version of the target file will be automatically loaded into memory and link, The program completes its upgrade goal.

Program scalability and compatibility

Another feature of dynamic link is that various module programs can be dynamically loaded when the program is running. This advantage is later used to make plug-ins for programs.

Dynamic linking can also enhance program compatibility. A program running on different platforms can dynamically link to the dynamic link libraries provided by the operating system. These dynamic link libraries are equivalent to adding an intermediate layer between the program and the operating system, thus eliminating the dependency difference between programs on different platforms.

Dynamic linking also has a lot of annoying and puzzling problems. A common problem is that when a module on which a program depends is updated, the original program cannot run because the interface between the new module and the old module is not compatible.

The basic implementation of dynamic link

The basic idea of dynamic linking is to divide program modules into relatively independent parts and link them together to form a complete program when the program is running, rather than linking all program modules into a single executable file like static linking.

Dynamic links involved in the runtime and multiple file loading, must want to have the support of the operating system for dynamic link, under the condition of the distribution of the virtual address space of the process will be more complex than static linking, and storage management, Shared memory, process thread mechanism under the dynamic link will also have a subtle change. In Linux, ELF dynamically linked files are called dynamic shared objects, or shared objects for short. They are usually files with the extension “so”.

On Linux, a dynamically linked version of glibc, a common C runtime, is stored in the “/lib” directory with the name “lib.so”. The whole system only keeps a dynamic link file of C language “lib.so”, and all programs written in C language, dynamic link can use it at run time. When the program is loaded, the system dynamic linker will load all the dynamic link libraries needed in the program (the most basic is libc.so) into the address space of the process, and bind all the undecided symbols in the program to the corresponding dynamic link library, and relocate the work.

The actual linking between the program and libc.so is done by the dynamic linker, which postpones the linking process from before the original program loads to load time. This is certainly flexible, but isn’t it slow to have to relink every time your program is loaded? It is true that dynamic linking can cause some performance loss, but the linking process of dynamic linking can be optimized. It is estimated that dynamic linking has a performance loss of less than 5% compared to static linking. This performance is well worth the savings in space and the flexibility to build and upgrade programs.

Dynamic linking process

! [2016092812881Dynamic linking process.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092812881Dynamic linking PNG) lib. c is compiled into lib. so shared file, and programl1.c is compiled into program1.o, and linked into executable file Program1. There is one step in the figure that is different from the static link, that is, program. o is linked into the executable. In the static link, program1. o and lib. o will be linked together, and the output executable Program1 will be generated. But in this case, lib.o is not linked in, and the linked input target file is program1.o (and of course the C voice runtime, which is ignored here). But lib.so is also involved in the linking process, right?

When program1.o is compiled to program1.o, the compiler does not yet know the address of the foobar() function. When the linker links program1.o to an executable, at this time the linker must determine the nature of the foobar() function referenced in program1.o. If foobar() is a function defined with other static target modules, then the linker will reposition the foobar() address reference in program1.o according to the rules of static linking. If foobar() is a function defined in a dynamically shared object, the linker marks the symbol’s reference as a dynamically linked symbol without relocating it, leaving the process for load time.

How does the linker know if a reference to foobar is a static symbol or a dynamic symbol? This is actually why we use lib.so. Lib.so holds the complete symbol information (because dynamic linking at runtime also requires symbolic information). Lib.so is also used as one of the link input files, so the linker knows when parsing the symbol: Foobar () is a dynamic symbol defined in lib.so so that the linker can treat foobar() as a special reference to a dynamic symbol.

About the module

In static linking, the entire program ends up with only one executable file, which is an indivisible whole; But under dynamic link, a program is divided into several files, there has been a major part of the program, to perform file (Program1) and the program depends on the Shared object (Lib. So), most of the time we also put these points is called modules, namely under dynamic link executables and Shared object can be seen as a module of the program.

Address distribution when a dynamically linked program is running

The final load address of the shared object is not determined at compile time, but at load time, the loader dynamically allocates a chunk of the virtual address space of sufficient size to the responding shared object according to the current free address space.

Address independent code

We know that relocation is based on symbols. Load-time relocation is one of the solutions to dynamic modules that have absolute address references, but it has a major disadvantage: the instruction part cannot be shared between multiple processes, thus losing one of the memory-saving advantages of dynamic linking. In fact our aim is simple, hope to share when loading instruction part of the program modules do not need because of the change of loading address change, so the basic idea is to put the instruction in advance in part of those who need to be modified, put together, with the data portion of such instruction part can remain the same, and the data part can do have a copy in each process, This solution is currently known as PIC (Position-independent Code) **.

Firstly, we analyze the reference mode of various types of address in the module. We divide the address reference in the shared object module into two types according to whether it is cross-module or not: internal reference and external reference of module; According to different reference methods can be divided into instruction reference and data access, so we get the situation as shown in 4 below.

! [2016092821717Four kinds of addressing mode.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092821717Four kinds of addressing mode.png)

  • The first is function calls, jumps, and so on inside the module.
  • The second is data access outside the module, such as global variables defined in the module, static variables.
  • The third is function calls, jumps, and so on outside the module.
  • The fourth is data access outside the module, such as global variables defined in other modules.

When compiling the above code, the compiler can’t actually determine whether the variable B and function ext() are outside or inside the module, because they might be defined in other object files of the same shared object. Because of this uncertainty, the compiler can only treat them as functions and variables outside the module.

  • Intra-module calls or jumps The first type is probably the simplest, which is intra-module calls. Because the function being called is in the same module as the caller, their relative positions are fixed, so this case is simpler. For modern systems, the jump and function call inside the module can be called relative address or register-based relative call, so there is no need to relocate this instruction.

  • Intra – module data access Next comes the second type, intra – module data access. Obviously, the instruction cannot directly contain the absolute address of the data, so the only solution is addressing. As we know, in front of a module is usually several pages of code, followed by a number of pages of data, the relative position between these pages are fixed, that is, any one instruction and it needs to access the relative position between the internal data module is fixed, so you just need to relative to the current instruction, combined with the fixed offset can internal data access module.

  • Inter-module data access Inter-module data access is slightly more troublesome than inter-module data access because the target address of the data access added by the module is not determined until load time, such as the variable B in the above example, which is defined in other modules, and the address is determined at load time. We mentioned earlier that to make code address independent, the basic idea is to put the address-related parts of the data segment. Obviously, the addresses of the global variables of these other modules are related to the module load address. ELF creates an array of Pointers to these variables in the Global Offset Table (Global Offset Table). When a Global variable needs to be referenced, it can be referenced indirectly by relative entries in the Global Offset Table (Global Offset Table). [2016092880002The data access between modules.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092880002The data access PNG between modules.png) when you need to access variable B in the global offset command, the program will find the destination address of the variable according to the corresponding item of the global offset variable. Each variable has a 4-byte address. When loading modules, the linker looks for the address of each variable and populates the entries in the GLOBAL offset system to ensure that each pointer points to the correct address. Because the GLOBAL offset is stored in the data segment itself, it can be modified when modules are loaded, and each process can have its own copy, unaffected by each other.

    How does the GLOBAL offset address independent? From the second type of data access, we know that a module can determine the offset of its internal variables relative to the current instruction at compile time, so we can also determine the offset of the GLOBAL offset relative to the current instruction at compile time. To determine the location of the global offset system (GOT), use the same method as above: obtain the value of PC and add an offset. The offset value of a variable’s address in the global offset system (GLOBAL offset System) is determined by the compiler.

  • Inter-module calls and jumps For inter-module calls and jumps, we can also use the above method of type 4. In the global offset system, the corresponding entries store the address of the object function. When the module needs to call the object function, the entries in the global offset system store the address of the object function. [2016092872480Call and jump between modules.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092872480Call and jump Between modules.png) this approach is simple, but has some performance problems, and in fact ELF takes a more complex and sophisticated approach, introduced in dynamic link optimization. ! [2016092867796Various ways to address references.png](http://7xraw1.com1.z0.glb.clouddn.com/2016092867796Various ways to address references.png)

Q&A

Q: If A shared object defines A global variable G in lib.so, and processes A and B both use lib.so, will G in process B be affected when process A changes the value of the global variable G? A: no. Because when lib.so is loaded by two processes, its data segment portions have separate copies in each process, the global variables in the shared object are virtually indistinguishable from those defined inside the program from this perspective. Any process accesses only its own copy and does not affect other processes. So if we condition this problem to thread A and thread B in the same process, can they see each other’s changes to global G in lib.so? For two threads of the same process, they access the same process address space, i.e. the same copy of lib.so, so their changes to G are visible to each other.

So can we do the opposite? For example, two processes are required to share a copy of a shared object or two threads are required to access different copies of global variables. These two requirements exist. For example, multiple processes can share a global variable to achieve inter-process communication. Multiple threads accessing different copies of global variables can prevent interference between different threads. In fact, there is a solution to both requirements. Multiple processes sharing global variables is also called “sharing data segments”. Multiple threads accessing different copies of global variables are called “Thread Local Stroage”.

Major issues affecting dynamic linking performance

  • In dynamic linking, global and static data are located and then addressed indirectly. The global offset system (GOT) must be located first, and then an indirect jump must be made. This will inevitably slow down the program. ;
  • The link work of dynamic link is completed at runtime, that is, when the program starts to execute, the dynamic linker has to do the link work once. The dynamic linker will look for and load the required shared object, and then carry out symbol search relocation and other work, which is bound to slow down the start speed of the program.

Delayed binding (PLT)

Under dynamic linking, program modules contain a large number of function references between them (global variables tend to be fewer, because a large number of global variables leads to greater coupling between modules). So before the program starts to execute, dynamic linking can take a lot of time to solve the symbol lookup and relocation of function references between modules, which is the second reason we mentioned above for slowing down dynamic linking. However, it can be imagined that during the running of a program, many functions may not be used after the execution of the program, such as some error handling functions or some function modules rarely used by users, etc. It is actually a waste to link all functions well at the beginning. So ELF uses a practice called Lazy Binding. The basic idea is to bind a function when it is first used (symbol lookup, relocation, etc.) and not bind it when it is not used. This can greatly speed up the startup of programs, especially for programs with a large number of function references and a large number of modules. ELF uses the PLT (Procedure Linkage Table) method, which uses some nifty instruction programs.

This brings to mind the +load and +initialize methods of the NSObject class in iOS.

When the program starts, the Runtime loads all classes. During this period, the +load method is called if a class or a class’s classification implements it.

The + Initialize method is called before the class or subclass receives the message for the first time, including the instance object or the class object. This method is not called if the class is never used.

Because of the special nature of these two methods, we can handle some of the preconditions required for class use in these two methods. However, if possible, it should be in +initialize. Because the +load method is called when the program is started, it will affect the startup time of the program. The +initialize method, on the other hand, is a lazy call that is executed only when needed.

Dynamic linker

In the case of dynamic linking, the loading of executable files is basically the same as in the case of static linking. First, the operating system reads the Header of the executable file to check the validity of the file, then reads the virtual address, file address, and attributes of each “Segment” from the “Progeam Header” in the Header, and maps them to the corresponding location in the virtual space. These steps are basically the same as the previous static link loading. In the statically linked case, the operating system can then transfer control to the entry address of the executable and begin execution.

But in the case of dynamic linking, the operating system cannot yet hand over control to the executable once it is loaded, because we know that the executable depends on many shared objects. At this point, many references to external symbols in the executable are still in the invalid address state, that is, the actual location of the corresponding shared object has not been linked, so after mapping the executable, the operating system starts a Dynamaic Linker.

Under Linux, the dynamic linker ld.so is actually a shared object that the operating system also loads into the process address space by mapping. After loading the dynamic linker, the operating system cedes control to the entry address of the dynamic link. When the dynamic linker gains control, it performs a series of initialization operations of its own, and then dynamically links executable files based on the current environment parameters. When all linking is complete, the dynamic linker transfers control to the executable’s entry address and the program begins execution.

Although the details about the dynamic linker itself are not expanded, but as a very characteristic, also very special shared object, the implementation of the dynamic linker is worth thinking about several issues:

  1. Is the dynamic linker itself dynamically linked or statically linked? The dynamic linker itself should be statically linked, it is not dependent on other shared objects, the dynamic linker itself is designed to help other ELF files resolve shared object dependencies, if it also depends on other shared objects, then who will help it resolve dependencies? So it itself must not depend on other shared objects. This can be determined using LDD:
ldd /lib/ld-linux.so.2
    staticall linked
Copy the code
  1. Does the dynamic linker itself have to be PIC? It doesn’t matter whether a dynamic linker is PIC or not, but it is often easier to use PIC. On the one hand, a non-PIC would make the code segment unshareable and waste memory, and on the other hand, the initialization of lD.so itself would be more complicated because the code segment would need to be repositioned during bootstrapping. Ld-linux.so.2 is actually PIC.
  2. Dynamic linkers can be executed as executables, so what should be the loading address? The load address of ld.so is the same as that of a normal shared object, namely 0x00000000. The load address is an invalid load address. As a shared library, the kernel selects an appropriate load address for it.

“. Interp “section

So what is the dynamic linker in the operating system, and who decides where it is? Are all dynamic linkers for *NIX systems located in /lib/ld.so? In fact, the location of the dynamic linker is determined not by the system configuration, nor by the environment parameters, but by the ELF executable. In dynamically linked ELF executables, there is a special section called the “.interp “section (” interp” is short for “interpreter”).

Interp contains a string, which is the path of the dynamic linker required by the executable. Under Linux, the path of the dynamic linker required by the executable is almost always “lib/ld-linux.so.2”. Other * NIx operating systems may have different paths.

“Dynamic”

There are several ELF segments dedicated to dynamic linking, such as “.dynamic “and”.dynsym “, similar to “.interp”

The most important structure in dynamic link ELF is the “.dynamic “section, which holds basic information for dynamic linkers, such as dependencies on shared objects, locations of dynamic link symbol tables, locations of dynamic link relocation tables, and addresses of shared object initialization codes.

Dynamic symbol table

In order to accomplish dynamic linking, the most important thing is to depend on the symbol and related file information. We know that in the static link, there is a special section called Symbol Table “.symbtab “(Symbol Table), which holds all the Symbol definitions and references for the target file. The notation for dynamic linking means that it is actually very similar to static linking. For example, the Program1 program in the previous example relies on lib.so and references the foobar() function inside. So for Progarml, we often call Progarm1** Import ** foobar Function, foobar is the Import Function of Program1; From lib.so’s point of view, it actually defines the foobar() Function and provides it to other modules. We usually call lib.so Export foobar(), and foobar is the Export Function of lib.so. Putting this import/export relationship into the context of static links, we can think of them as ordinary function definitions and references.

In order to show the Dynamic link between the Symbol import and export relations between these modules, ELF has a special d segment called Dynamic Symbol Table which holds this information. This segment name is usually called “.dynsym” (Dynamic Symbol). Unlike “.symtab”, “.dynsym” saves only symbols associated with dynamic links, not symbols within modules, such as module private variables. In many cases, dynamically linked modules have two tables: “.dynsym” and “.symtab”. “.symtab” often contains all symbols, including symbols in “.dynsym”.

Like “.symtab”, dynamic symbol tables require some auxiliary tables, such as a string table for holding symbols. Static link is called symbol String Table “symtab” (String Table), in this case is the Dynamic symbol String “.dynstr” (Dynamic Stirng Table); Because of dynamic linking, we need to look up symbols at runtime, and there are often auxiliary symbol hashes (“.hashes “) to speed up the symbol lookup process.

The structure of a dynamic linked symbol table is almost the same as that of a static linked symbol table. We can simply treat an import function as a reference to a function in another object file. Think of exported functions as functions defined in this object file.

Dynamically linked relocation tables

The main reason shared objects need to be relocated is the presence of import symbols. Under dynamic linking, either an executable or a shared object, once it depends on another shared object, that is, if it has an imported symbol, it will have a reference to the imported symbol in its code or data. The addresses of these import symbols are unknown at compile time, and in static linking, these unknown address references are corrected at final linking. In dynamic linking, however, the addresses of the imported symbols are determined at run time, so references to these imports need to be corrected at run time, that is, relocated.

Steps for dynamic linking

  1. Start the dynamic linker itself;
  2. Load the required shared objects;
  3. Relocation and initialization;

Displays runtime links

Systems that support dynamic Linking support a more flexible way of loading modules called Explicit run-time Linking, sometimes called runtime loading. This allows the program to control the loading of the specified module at runtime and unload the module when it is no longer needed. As we have seen from the previous section, runtime loading is also theoretically easy if the dynamic linker can load shared modules into memory at run time and perform relocation operations. Shared objects can be loaded at runtime without subsequent modification. These shared objects, often called Dynamic Loading libraries, are essentially the same as shared objects, but they are used in different ways by developers.

This runtime loading makes the module organization of a program more flexible and can be used to implement functions such as plug-ins and drivers. Modules are loaded when the program needs a plug-in or driver, rather than having to load them all from the beginning, which reduces program startup time and memory. And the program can reload a module at the time of running, so that the program itself does not have to restart and realize module increase, delete, update, etc., which is a great advantage for many programs that need to run for a long time. The most common example is a Web server program, which needs to choose a different script interpreter based on configuration. Database connection driver, for different script interpreters are made into an independent module, when the Web server needs a script interpreter can be loaded in; The same is true for database connection drivers. In addition, for a reliable Web server, long-term operation is necessary to ensure that if we need to add a script interpreter, or if a script interpreter needs to be upgraded, we can inform the Web server program to reload the shared module to achieve the corresponding purpose.

In Linux, dynamic libraries are virtually indistinguishable from regular shared objects in terms of the format of the files themselves. The main difference is that the shared object is loaded and linked by the dynamic linker before the program starts. This series of steps are automatically completed by the dynamic linker, which is transparent to the program itself. Dynamic library loading is provided through a series of dynamic linkers API, specifically a total of four functions:

  • The dlopen() function opens a dynamic library and loads it into the address space of the process to complete the initialization process. Its C prototype is defined as

    void * dlopen(const char *filename,int flag);
    Copy the code

    The first argument is the path to be loaded. If the path is absolute (starting with a “/”), the function will try to open the dynamic library directly. If it is a relative path, then dlopen() tries to find the dynamic library files in a certain order:

    The second parameter flag indicates how the function symbol is resolved.

    • RTLD_LAZY: indicates the use of delayed binding, the function is used for the first time before binding, namely PLT mechanism;
    • RTLD_NOW: indicates that all function bindings are completed when the module is loaded. If any undefined symbol music bindings cannot be completed, then dlopen() returns an error.
    • RTLD_GLOBAL: Can be used with either of the above (by constant or) to indicate that the loaded module’s global symbols are merged into the process’s global symbols, making them available to later loaded modules.

    The return value of dlopen is the handle to the loaded module, which will be used later in dlSYM or DLCLOSE. If the module fails to log, NULL is returned. If the module has already been loaded via dlopen, the same handle is returned. In addition, if there is A dependency relationship between the loaded modules, for example, module A depends on module B, then the programmer needs to manually load the dependent module, for example, load B first and then load A.

  • Finding symbols (dlSYm ()) The dlSYM function is basically a core part of runtime loading. We can use this function to find the symbols we need. It is defined as follows:

    void * dlsym(void * handle, char * symbol);
    Copy the code

    The definition is very succinct, with two arguments. The first argument is the handle to the dynamic library returned by dlopen(); The second argument is the name of the symbol to look for, a C string ending in “\0”. Dlsym () returns the value of the symbol if it finds one, or NULL if it does not. The value returned by dlsym() has different meanings for different types of symbols. If the symbol being looked for is a constant, it returns the value of that constant. The question is: if the value of the constant is either NULL or zero, how do we know if dlsym() found the symbol? This problem uses the dlerror() function described below. If the symbol is found, dlerror() returns NULL, and if not, deerror returns an error message. When symbols in many shared modules have conflicting names, the symbol loaded first takes precedence. This is called a Load Ordering system. When we use DLsym () to find the address of a symbol, does this function also find the symbol according to the priority of the load sequence? The reality is that DLSYm () has two types of priority for finding symbols.

    • Loading sequence: If filename is NULL for a global symbol lookup (dlopen()), then dlSYm () uses the loading sequence because of the loading sequence used by the global symbol.
    • Dependency Ordering: If we are conducting symbol search on a shared object opened by dlopen(), then we use the dependency sequence, which takes the shared object opened by Dlopen () as the root node and performs breadth-first traversal of all its dependent shared objects until symbols are found.
  • Every time we call dlopen(), dlsym(), or dlclose, we can call dlerror() to determine whether the last call was successful. Dlerror () returns type char*. If NULL is returned, the last call was successful. If not, an error message is returned.

  • Dlclose () is the opposite of dlopen() in that it unloads a module that has already been logged. The system maintains a load reference counter that is incremented each time a module is loaded using dlopen(); Each time a module is unloaded using dlclose(), the corresponding counter is decreased by one. The module is actually unloaded only when the counter value drops to 0.

Programs can operate on dynamic libraries through these apis. These apis are implemented in /lib/libdel.so.2, and their declarations and constants are defined in the system standard header <dlfcn.h>.

The memory layout of a program

Generally speaking: the memory space used by the application has the following “default” areas:

  • Stack: The context in which a stack is used to maintain a function call, which cannot be implemented without it. Stacks are usually allocated at the highest address in user space and are usually several megabytes in size;
  • Heap: The heap is used to hold an area of memory dynamically allocated by an application. When a program allocates memory using malloc or New, the resulting memory comes from the heap. The heap usually exists at the bottom of the stack (low address direction), and at some point the heap may not have a consistent storage area. Heaps are typically much larger than stacks, and can have tens to hundreds of megabytes of capacity.
  • Image of executable file: This is where the memory image of the executable file is stored. This is where the memory of the executable is read or mapped by the loader at load time.
  • Tenure: A tenure is not a single memory area, but a general term for memory areas that are protected from access. For example, in most operating systems, extremely small addresses, such as NULL, are not allowed to access. This is often why C assigns an invalid pointer to 0, because there is normally no valid accessible data at the 0 address.

The typical memory layout for the next Linux process is as follows:! [201609295828Linux process address space layout.png](http://7xraw1.com1.z0.glb.clouddn.com/201609295828Linux process Address Space layout.png) In the image above, there is an area that is not described: the “Dynamic link library mapping area”, which is used to map loaded dynamic link libraries. Under Linux, if the executable depends on other shared libraries, the system allocates space for its address starting at 0x40000000 and loads the shared libraries into that space.

The arrows in the figure indicate the size growth direction of several variable extents, where it is clear that the stack grows to the low address and the heap grows to the high address. When the stack or heap is not large enough, it will increase its size in the direction shown in the figure until the reserved space is used up.

Q&A

Q: How do I get a segment fault when I write a program?

A: This is A typical error caused by illegal pointer dereference. This error occurs when a program attempts to use a pointer to read or write to a memory address that it is not allowed to read or write to. In a Linux or Windows memory layout, some addresses are never read or written, such as address 0. Some addresses are not allowed to read and write at the beginning, the application must be right to request access to these addresses, speaking, reading and writing in advance, or certain address at first did not map to the actual physical memory, the application must request in advance the address mapping to the actual physical address (commit), before he can read and write freely the memory. When a pointer points to one of these regions, reading or writing to the memory it points to raises an error. There are two common reasons for this:

  1. The programmer initializes the pointer to NULL and then starts using the pointer without giving it a reasonable value.
  2. The programmer does not initialize the pointer on the stack, the value of the pointer is usually a random number, and then starts using the pointer directly.

Therefore, if your program has such errors, double check the use of Pointers.

The stack

Stack is one of the most important concepts in modern computer programs. Almost every program uses the stack. Without the stack, there would be no functions, no local variables, and there would be no all computer languages that we can see today. Let’s look at the definition of a traditional stack:

In classical computer science, a stack is defined as a special container In which the user can push data onto the stack (push) or pop data that has already been pushed onto the stack (pop, pop), but the stack container must follow one rule: First In Last Out (FILO) data that has been pushed onto the stack.

In a computer system, the stack is a dynamic memory region with the above properties. Programs can push data onto or off the stack. The push operation makes the stack larger, and the pop operation makes the stack smaller.

The stack always grows downward. Under the I386, the top of the stack is located by a register called ESP. The address at the top of the stack decreases with the push operation, and the address at the top increases with the pop operation.

Stack plays an important role in program operation. Most importantly, the Stack holds the maintenance information needed for a function call. This is often called a Stack Frame or Activate Record. A stack frame generally includes the following aspects:

  • The return address and arguments of the function.
  • Temporary variables: Includes non-static local variables of a function and other temporary variables automatically generated by the compiler.
  • Saved context: Includes registers that need to remain unchanged before and after a function call.

Stack call convention

There is no doubt that the caller and the called of a function need to have a clear agreement on how the function is to be called. The Convention is called a Call Convention, but only if both parties adhere to the same agreement can the function be called correctly. An invocation convention generally dictates several things.

  • There are many ways to pass function arguments, the most common of which is through stacks. The caller of the function pushes parameters onto the stack, and the function itself takes them off the stack. For functions with multiple arguments, the calling convention dictates the order in which the function caller pushes the arguments: from left to right or from right to left. Some call conventions also allow parameters to be passed in registers to improve performance.
  • Stack maintenance After the function pushes parameters, the function will be called. After that, the pushed parameters need to be ejected to keep the stack consistent before and after the function call. The pop-up work can be done by the caller of the function or by the function itself.
  • The name-mangling strategy distinguishes call conventions for chaining, while call management modifies the Name of the function itself. Different call management has different name modification strategies.

Function returns value pass

The examples and discussions in the book are very long. Here I would like to share my own understanding. For example:

test(a,b)
{
	return a + b;
}
int main()
{
	int a = 1;
	int b = 1;
	
	int n = test(a,b);
	return 0;
}
Copy the code
  • The main function first creates an extra slice on the stack and uses a portion of that space as a temporary object to pass in the return value, called temp.
  • The address of the temp object is passed to the test function as a hidden argument.
  • The test function copies data to the Temp object and passes out the temp object’s address.
  • After test returns, main copies the contents of the temp object to N.

Of course, the above process will have some assembly code, which is omitted here.

The heap

Having a stack alone is not enough for procedural programming, because the data on the stack is released when the function returns, so there is no way to pass data outside the function. There is no way to generate global variables dynamically. It can only be defined at compile time and in many cases lacks expressiveness. In this case, the Heap is the only option.

The heap is a large chunk of memory, often occupying most of the virtual space. In this space, the program can request a contiguous chunk of memory and use it freely, which will remain valid until the program voluntarily abandons it.

If a system call is required every time an application requests or frees heap space, this can be performance costly. So the program requests an appropriate amount of heap space from the operating system and manages it itself. The space allocation that manages the heap is often the program’s runtime.

The runtime essentially “wholesales” a large chunk of heap space to the operating system and then “retails” it for use by programs. When they are all “sold out” or the program has a large memory requirement, they “purchase” the operating system based on actual demand. Of course, when a runtime sells heap space to a program, it must manage its wholesale heap space and cannot sell the same address twice, resulting in address conflicts. The runtime then needs an algorithm to manage the heap space, and this algorithm is the heap allocation algorithm.

Heap allocation algorithm

How to manage a large contiguous chunk of memory space and allocate and release the space as required is the algorithm of heap allocation. There are many heap allocation algorithms, ranging from simple ones (such as the ones described below) to complex ones suitable for high performance or other special requirements.

  1. Free List ** The method of Free List ** is actually a heap of Free blocks to install a linked List, when the user requests a piece of space, the whole List can be traversed until the appropriate size of the block is found and split; When the user frees space, merge it into the free linked list. We first need a data structure to register all the free space in the space so that we know which memory should be allocated to the program when it requests space. There are many kinds of such structures, but here’s the simplest one — the free linked list. A free list is a structure in which each free space in the heap begins (or ends) with a header. The header structure records the addresses of the last (prev) and next (next) free blocks, that is, all free blocks form a linked list. As shown below:! [2016100616640The distribution of free list.png](http://7xraw1.com1.z0.glb.clouddn.com/2016100616640The distribution of PNG) how to allocate space in such a structure? It first looks for a free block in the free list that is large enough to accommodate the requested size, and then divides the block into two parts, one for the requested space and the other for the remaining free space. The structure of the corresponding free block in the list is updated to the new free block. If the size of the remaining free block is 0, the structure is directly removed from the list. The following diagram illustrates the state of the heap after a user requests a block of memory exactly equal to the free block. ! [2016100612103The distribution of free list 2.png](http://7xraw1.com1.z0.glb.clouddn.com/2016100612103The distribution A free linked list implementation such as of free List 2.png is simple, but given a pointer to an allocated block, the heap cannot determine the size of the block when freeing space. A simple solution is that when the user requests K bytes of space, we actually allocate K +4 bytes, which are used to store the allocated size, k+4. So when you free the memory, you just look at the value of these four bytes, and you know how big the memory is, and then you insert it into the free list. Of course, this is just the simplest allocation strategy, and there are many problems with this approach. For example, once a linked list is broken, or four bytes of record length are broken, the heap cannot function properly, which is easily accessed by out-of-bounds reads and writes.

  2. Bitmaps are a cheap alternative to free linked lists. This approach is called a Bitmap. The idea is to divide the heap into a large number of blocks, each of the same size. When a user requests memory, the user is always allocated an integer of blocks of space. The first block is called the Head of the allocated area, and the rest is called the Body of the allocated area. We can use an array of integers to record block usage. Since each block has only three states: head, body, and free, only two bits are needed to represent a block, so it is called a bitmap. Assuming the heap size is 1MB, then we make a block size of 128 bytes, so there are a total of 1M/128=8K blocks, which can be stored with 8K /(32/2)=512 ints. This array of 512 ints is a bitmap, where every two bits represent a block. When a user requests 300 bytes of memory, the heap allocates three blocks to the user and marks the corresponding position of the bitmap as head or body. The following figure shows an example of such a heap. ! [2016100612565Figure bit allocation.png](http://7xraw1.com1.z0.glb.clouddn.com/2016100612565Figure bit allocation.png) The heap is allocated 3 pieces of memory, 2/4/1 blocks each, as outlined in dotted boxes. The corresponding bitmap will be: (HIGH) 11 00 00 10 10 10 11 00 00 00 00 00 00 00 10 11 (LOW) where 11 indicates H (Head), 10 indicates Body, and 00 indicates Free. This approach has several advantages:

  • Fast: Because the entire heap’s free information is stored in an array, accessing the array is easy for the cache to hit.
  • Good stability: in order to avoid users from reading and writing data damage, we simply back up the bitmap. And even if some data is corrupted, the whole heap does not fail to work.
  • Blocks require no additional information and are easy to manage.

Of course, the disadvantages are obvious:

  • Fragmentation occurs when allocating memory. For example, if 300 bytes are allocated, 3 blocks are allocated, i.e. 384 bytes, and 84 bytes are wasted.
  • If the heap is large, or if a block is set to be small (to reduce fragmentation), the bitmap will be large, potentially losing the advantage of high cache hits, and wasting space. In this case, we can use multi-level bitmaps.
  1. Object pool The heap management methods introduced above are the two most basic. In fact, in some situations, the size of the allocated objects is relatively fixed. At this time, we can design a more efficient heap algorithm for such characteristics, called object pool. The idea behind object pooling is simple: if the size of the heap allocated is the same each time, then the whole heap can be divided into a large number of small blocks based on the size allocated per request, and only one small block needs to be found per request. The object pool can be managed using either a free list or a bitmap, except that it is easy to implement because it assumes that each request is of a fixed size. Because only one unit of memory is always requested at a time, requests are satisfied very quickly without having to find a large enough space. In fact, in many practical applications, heap allocation algorithm is often adopted by a variety of algorithms. Glibc, for example, applies a pool-like approach to space smaller than 64 bytes. For the space application larger than 512 bytes, the best matching algorithm is adopted. For bytes larger than 64 bytes and smaller than 512 bytes, it takes the best of the above trade-offs depending on the situation; For applications larger than 128KB, it uses the mmap mechanism to apply for space from the operating system.

At the end

The book before and after the young students read twice, the first time to read the physical book, no notes; Read the electronic version for the second time and make notes as you go. After reading the book for one hundred times, I have a general understanding of the whole book. After reading it carefully and taking notes for the second time, I have understood many things that I did not understand in my previous work. But there are still many things I don’t understand, which need to be digested and realized slowly in my future career.