Writing in the front

Since this is the first article on threads, we will focus on some of the basics of threads, and some of the conceptual aspects. Some references may be too official. But so should systematic learning, and I hope you can sit through it.

When reading this article, I hope you will temporarily abandon the inherent cognition of language. Different languages may have great differences in the treatment of some problems. Therefore, the code of this article is also based on pseudo-code, aiming to explain the knowledge points for you after abandoning the interference of language.

You may find that some situations do not appear in your language, perhaps because your compiler optimization strategy has improved, etc. It is better to have no books than to believe them all. I hope we can practice a lot. Practice is the only criterion for testing truth.

What is a thread

Threads and processes

Thread (Thread), also called Lightweight Process (LWP), is the smallest unit of program execution flow. The standard thread consists of thread ID, current instruction pointer (PC), register set and stack. A typical thread/process diagram looks like this:




Image from Programmer self-cultivation: Linking, loading, and Libraries

The history of

multiprogramming

In the early days of computing, CPU resources were very expensive. If a CPU ran only one program, it would be a waste of time when, for example, the program read and write disks (maybe tapes).

A monitor was quickly written to start A program that was waiting for CPU resources when program A was temporarily out of use. Make the CPU fully utilized.

This approach is called Multiprogramming and, while primitive, greatly improves CPU utilization.

Of course, this approach has a major drawback, namely that the scheduling strategy is too crude, and some programs may need the CPU to complete some tasks (such as user interaction) and may not be allocated to the CPU until a long time later.

Time-sharing system

After a bit of improvement, the program runs in collaborative mode, where each program actively cedes CPU to other programs after a certain period of Time. This program collaboration mode is called time-sharing System, especially for interactive tasks, which is very important.

Earlier versions of Windosw (before Windows 95 and Windows NT) and Mac OS X used this time-sharing system to schedule programs.

Of course, there are also big problems, such as a Zheng Xu is carrying out a very time-consuming calculation, has been occupying the CPU, then the operating system can not do, other programs can only wait, as if the system crashed. For example, a while(1) loop would cause the entire system to stop. That’s ridiculous today.

Multitask system

The multi-tasking system has been familiar to us for days. The operating system takes over all hardware resources and itself runs at a hardware-protected level.

All applications run as processes at a lower level than the operating system and have their own independent address space. Cpus are allocated by the OPERATING system (OS). Each process has the opportunity to obtain cpus based on its priority. If the running time exceeds a certain period, the OS suspends the process and allocates cpus to other waiting processes.

This type of CPU allocation, we call it Preemptive, is where the operating system can forcibly strip CPU resources and allocate them to the process that it thinks needs it most at the moment.

Some basic concepts

concurrent

When we talk about concurrency, we often refer to the cooperative scheduling of multiple threads or processes, which is a good way to solve the problem of time-consuming operations that cause the operating system to get stuck.

In fact, the number of cores in a CPU represents the true number of concurrent threads, whereas a single-core CPU simply switches between threads so quickly that it looks like they’re going on simultaneously.

Three states of a thread

Threads usually have at least three states, which are:

  • Running: The thread is executing
  • Ready: The thread can execute, but the CPU is occupied
  • Waiting: a thread Waiting for an event (usually I/O or synchronization)

The thread switches back and forth between the three states, as shown below:




Image from Programmer self-cultivation: Linking, loading, and Libraries

Rotation and priority scheduling

Different schemes and algorithms have been proposed for thread scheduling since the beginning of multi-task operating system. Although the mainstream methods are different, they all have the traces of two classical algorithms, one is the transition method, the other is the priority scheduling.

As the name suggests, each thread takes turns executing for a short period of time. It is characterized by staggered execution between threads. Priority scheduling is based on Thread priorities. Threads have their own Thread priorities, and threads with higher priorities are executed earlier.

Starvation occurs in priority scheduling, when the CPU is always preempted by high-priority threads, resulting in a low-priority thread that never executes. A good example of this is OSSpinLock

Of course, if a thread waits long enough to avoid starvation, its priority will be raised.

I/O intensive versus CPU intensive

Threads that wait frequently are referred to as IO Bound threads, while threads that wait rarely are referred to as CPU-intensive threads. IO – intensive threads are always easier to prioritise than CPU – intensive threads.

Step into multithreading

The concepts and history of threads are just the tip of the iceberg, but I’m getting bored, so I’ll bring you an interesting case to follow.

See the pseudocode snippet below

int x = 0;
thread1: {
	lock();
    x++;
    unlock();
};

thread2: {
	lock();
    x++;
    unlock();
};

after thread1 & thread 2: {
	print(x);
}

Copy the code

After analyzing the code, we believe that the auto-increment operation is carried out after the locking operation of X, ensuring the integrity of auto-increment operation. But in older compilers, or older versions of the system, this operation often outputs x = 1.

Why is that? Let’s talk about thread safety.

Thread safety

Competition and atomic manipulation

Let’s start with a famous example where two threads separately execute code in a table

Thread 1 Thread 2
i = 1;

++i;
–i;

In many architectures, ++ I is implemented in three steps:

  1. Read I to some register X
  2. X++
  3. Store the contents of X back into I

Since thread 1 and thread 2 execute concurrently, their execution sequence might look like the following table

The serial number instruction The value of the variable after execution thread
1 i=1 I = 1, X [1] = unknown Thread 1
2 X[1]=i i=1, X[1]=1 Thread 1
3 X[2]=I i=1, X[2]=1 Thread 2
4 X[1]++ i=1, X[1]=2 Thread 1
5 X[2]– i=1, X[2]=0 Thread 2
6 X[2]– i=2, X[1]=2 Thread 1
7 X[2]– i=0, X[2]=0 Thread 2

The result of I may be 0, 1, 2. The reason is that there is more than one increment operation compiled into assembly code. It is likely that the scheduling system interrupts the execution of other code in the middle of execution.

We call single-instruction operations Atomic operations, and in no case can single-instruction operations be interrupted. In fact, if you look at the libDispatch source code, it defines a number of atomic operations, such as:

  • __sync_add_and_fetch((p), 1) increments by 1 and returns
  • __sync_sub_and_fetch((p), 1) decrement by 1 and returns
  • __sync_add_and_fetch((p), (v)) adds v and returns
  • __sync_sub_and_fetch((p), (v)) decrement v and returns
  • __sync_fetch_and_OR ((p), (v)) returns or
  • __sync_fetch_and_AND ((p), (v)) returns and

Synchronization with the lock

However, as the logic became more and more complex, atomic operations obviously could not meet our needs. At this time, we had to introduce a new concept, locking

Locking is a common method of synchronization in which each thread attempts to acquire an Aquire lock when accessing a resource and then releases the lock when it is finished. When an attempt is made to acquire the lock while it is occupied, the thread waits until the lock is available again.

There are many types of locks, but here are a few without too much explanation:

  • Binary Semaphore
  • Semaphore
  • Mutex
  • Critical Section
  • Read-Write Lock (Shared/Exclusive)
  • Condition variables

Do many of them look familiar? Pthread_mutex, NSLock, NSCondition, dispatch_semaphore, @synchronized, etc.

There is also the concept of Reentrant. A function can be reentrated to indicate that the function has not completed its execution and is executed again due to external factors or internal calls. A function is called reentrant, which means that the function is reentrant without any adverse consequences and is a strong guarantee of concurrency safety. A reentrant function can be safely used in multithreaded environments.

Excessive optimization

If x++ is not an atomic operation, then x++ is not an atomic operation. If x++ is not an atomic operation, then x++ is not an atomic operation.

In order to improve the access speed of x, x will be put into a register, but we know that the registers of different threads are independent, so there is a problem. The program order is as follows:

  1. [threa1] reads the register to R[1]=0
  2. [thread1] R[1]++
  3. [thread2] reads the value of x to some register R[2]=0
  4. [thread2] R[2]++

6.[thread1] (much later) writes R[1] back to x

From this, we get the key word volatile.

Volatile means different things in different locales. In C and C ++, volatile ensures that the directive is not skipped because of compiler optimizations and requires that the value be read directly each time.

A volatile variable forces the value of the member variable to be re-read from shared memory each time it is accessed by a thread. Moreover, when a member variable changes, threads are forced to write the changed value back to shared memory. This way, at any time, two different threads will always see the same value of a member variable.

The volatile keyword can be used to prevent excessive optimization, essentially doing two things:

  • Prevents the compiler from caching a variable into a register without writing it back in order to improve speed.
  • Prevents the compiler from adjusting the order of instructions to operate on volatile variables.

Obviously, in the above problem, the first point can be optimized, but the second point cannot be satisfied, because even if we can prevent the compiler from over-optimizing, we cannot prevent the CPU from dynamically switching the execution order.

Let’s take a look at some code that appears to have no problems. The following code is a singleton that appears to have no problems. Note that double-checked locks are used to reduce lock granularity.

volatile T *pInstance = 0;
T * GetInstance() {
	if (pInstance == NULL) {
    	lock();
        if (pInstance == NULL) {
        	pInstance = new T;
        }
        unlock();
    }
    return pInstance;
}
Copy the code

At first glance, such code looks fine, as lock() and unlock() prevent errors caused by multithreaded races. But there is actually another problem, again caused by out-of-order CPU execution.

New in C++ actually consists of two steps:

  1. Allocate memory
  2. Calling the constructor

So pInstance = new T; It actually involves three steps:

  1. Allocate memory
  2. Call the constructor at the location of memory
  3. Assign a memory address to pInstance

In these three steps, the positions of 2 and 3 are practically interchangeable. That is, one thing can happen: pInstance is not empty, but the constructor hasn’t been called yet. So when another thread accesses a pInstance that has not yet been initialized, how the program behaves depends on how the class is designed.

This leads us to a new concept: memory barriers. Also known as a barrier, adding a Barrier instruction prevents the CPU from swapping instructions that precede it into the one behind it, and vice versa. What line should we add to the barrier directive?

Let’s look at the most common code used to build singletons in Objective-C:

static T *_tInstance_ = nil;
+ (instancetype) shareInstance
{
    static dispatch_once_t onceToken;
    dispatch_once(&onceToken, ^{
        _tInstance_ = [[T alloc] init];
    });
    return _tInstance_;
}
Copy the code

If you are familiar with GCD source code, you will know that the dispatch_once implementation has added a memory barrier, excerpt code is as follows:

dispatch_once_f(dispatch_once_t *val, void *ctxt, void (*func)(void *)) { volatile long *vval = val; If (dispatch_atomic_cmpxchg)(val, 0L, 1L) {// call func(CTXT); // add memory barrier dispatch_atomic_barrier(); *val = ~0L; } else {// wait for lock do {_dispatch_hardware_pause(); } while (*vval ! = ~0L); dispatch_atomic_barrier(); }}Copy the code

In this code, we can guarantee that the object construction and assignment will be completed before dispatch_atomic_barrier(). Of course, here will not give the above reason is the lock, this point readers can experience. Some oF the apis provided by high-level languages often do this for us, causing us to lose sight of the details that require our attention.

Write in the back

Most of the presentation in this article is outdated, but knowledge is never outdated, and the underlying knowledge and principles we learn help us better understand the computer world and solve all kinds of weird problems.

When learning knowledge, we should hold a rigorous attitude and combine theory with application. Many of the optimizations mentioned above have been lost in actual development. For example, the x++ problem in multithreading mentioned above was actually tested without the volatile keyword. (Objective-C & Apple Clang Version 11.0.3 (Clang-1103.0.32.59))

We can write a small demo to test this. The environment is the same as above:

{
	int i = 1;
    i++;
    
    int j = i;
    j++;
    
    int k = j;
    i++;
    k = i;
}
Copy the code

We throw the compiled.app file into Hopper and look at the corresponding assembly code, as shown below:




Objective-C & Apple clang version 11.0.3 (clang-1103.0.32.59)

We can see that each time we need to use the previous variable, it will be re-read from memory into the register, and each time after the register operation, it will be re-written to memory, without putting the value of the variable in the register without writing back. (If we open the compiler optimization to the Fast/Fastest, Smallest, etc., we will find more interesting things, please try it yourself.) Therefore, we should be skeptical of the knowledge we see, whether we read books or articles. Only by trying and exploring more can we make real progress.

If there are any mistakes in the article, please point them out.

This article first nuggets, reprint please note the source. Personal email: [email protected] Personal wechat: Leeluanxin