background

Go’s greatest feature is its support for concurrency (Goroutine) from the language level. Goroutine is the most basic execution unit in Go. Virtually every Go program has at least one Goroutine: main Goroutine. The Go program starts with the main() function of the main package, and creates a default Goroutine for the main() function when the program starts.

To understand the design of the Go language coroutine, let’s take a look at the history of Goroutine design and see how it came to be today.

Single process era

In the early operating system, each program is a process, and the operating system can only run one process at a time, until the process is finished, then the next process can be run. This period can be called single-process era — serial era.

As shown in figure: serial execution among processes, with processes A, B and C executing in sequence.

Two problems in the single-process era:

  1. Single execution process, computer can only one task one task processing.
  2. The wasted CPU time caused by process blocking is inevitable (for example, if process A is blocked and the CPU has no switching capability, but has to wait for process A to finish before the next process can be executed).

How can we get the most out of our CPU when faced with this problem?

Multiprocess era

Later, the operating system had the earliest concurrency capability: multi-process concurrency, when a process is blocked, switch to another waiting process, so as to use as much CPU as possible, so that CPU is not wasted.

In the multi-process era, there is the concept of time slice. Processes are divided into time slices on the CPU according to the scheduling algorithm, and processes A, B, and C are executed concurrently according to the time slice. (Scheduling algorithm)

This has two advantages:

  1. For a single core can execute multiple processes concurrently, the application scenarios are more rich,
  2. When I/O blocks in a process, CPU usage is also guaranteed.

However, with the development of The Times, the disadvantages of CPU scheduling by process are becoming more and more obvious.

Process switching requires:

  1. Switch the page directory to use the new address space
  2. Switch kernel stack and hardware context

Because processes have so many resources, they take a long time to create, switch, and destroy. Although the CPU is used, a large portion of the CPU is used for process scheduling.

How can we improve CPU utilization?

Multithreaded age

So, lightweight processes: Threads were born. Threads require far fewer resources to run than processes.

Threads and processes can be understood as follows:

  • When a process has only one thread, it can be considered equal to a thread.
  • When a process has multiple threads, these threads share resources such as virtual memory and global variables. These resources do not need to be modified during context switching.
  • Threads also have their own private data, such as stacks and registers, which need to be saved during context switches.

Threads are the smallest unit of CPU scheduling and processes are the smallest unit of resource allocation.

  • Process: A process is the smallest unit of resource allocation and has a separate memory unit during execution.
  • Thread: Thread is the smallest unit of CPU scheduling, thread switching only requires saving and setting the contents of a small number of registers.

Although threads are relatively lightweight, they also have a significant overhead in scheduling. Each thread occupies more than 1 MB of memory space. Switching between threads consumes a lot of memory. Restoring the contents of registers also requires the operating system to apply for or destroy resources.

Multi-process and multi-thread concurrent ability have improved system, but in today’s Internet high concurrency scenario, create a thread for each task is not realistic, because each thread has its own stack space, a large number of threads need to take up a lot of memory space, and the number of threads also restricted by system parameters threads – parameters such as Max.

Are there lighter threads to support the high concurrency of today’s Internet? How can we make full use of CPU, memory and other resources to achieve higher concurrency?

Coroutines era

Coroutines, as user-mode threads, are also lightweight threads, which are used to solve the resource overhead of thread switching in high concurrency scenarios.

As you probably know, there are kernel-mode threads and user-mode threads. User-mode threads need to be bound to kernel-mode threads. The CPU does not know that user-mode threads exist, it only knows that it is running one thread, which is actually a kernel-mode thread.

User-mode threads actually have a name called co-routine. To make it easier to tell the difference, we use coroutines to refer to user-mode threads and usage threads to refer to kernel-mode threads.

Coroutines are different from threads

  • Threads/processes are scheduled by the kernel, have the concept of CPU time slice, and perform preemptive scheduling (there are multiple scheduling algorithms)
  • coroutines The kernel is transparent, that is, the system does not know the presence of coroutines, is entirely their own by the user program schedule, because it is controlled by the user program itself, so it is difficult to do the compulsory CPU like preemptive scheduling control switch to the other process/thread, usually only for collaborative scheduling, Other coroutines can be executed only after they voluntarily transfer control.

Scheduling of coroutines

1:1 schedule

One coroutine binds one thread, which is easiest to implement. The scheduling of coroutines is done by THE CPU, but there is a disadvantage that the cost of creating, deleting and switching coroutines is done by the CPU, and the context switch is very slow, equivalent to the thread switch.

N: 1 scheduling

N coroutines bound to one thread, the advantage is that the coroutine will complete the switch in the user state thread, will not fall into the kernel state, this switch is very light and fast. However, there are also big disadvantages. All coroutines of a process are bound to one thread. One is that a certain program cannot use the multi-core acceleration capability of hardware.

M: N scheduling

M coroutines bind N threads, which is a combination of N:1 and 1:1 types. It overcomes the shortcomings of the above two models, but is the most complicated to implement.

Go language coroutine scheduling

Go Runtime scheduler

The Go scheduler reduces the memory overhead of frequent thread switching by using the same number of threads as the NUMBER of cpus, while at the same time reducing the operating system and hardware load by performing a cheaper extra Goroutine on each thread.

Each Goroutine is very lightweight, taking up only a few kilobytes of memory, which allows for a large number of Goroutines in a limited amount of memory, allowing for more concurrency. Although a Goroutine stack is only a few kilobytes, it is actually scalable, and the Runtime automatically allocates more content to the Goroutine if needed.

Goroutine is built on top of operating system threads, which implement a many-to-many (M:N) two-level threading model with operating system threads.

M:N refers to M Goroutines running on N kernel threads, which are responsible for scheduling the OPERATING system threads, and which are responsible for scheduling and running the M Goroutines through the Goroutine scheduler.

G – M model

The coroutine of Go1.0 is a g-m model

  • G stands for Goroutine, which is essentially a lightweight thread that contains call stacks, important scheduling information, such as channels
  • M refers to Machine, one M is associated with a kernel OS thread, managed by the operating system.

If M(kernel thread) wants to execute and put G back, it must access the global G queue, and there are multiple M, that is, multiple threads accessing the same resource need to lock to ensure mutual exclusion/synchronization, so the global G queue is protected by a mutex.

This scheduler has several disadvantages:

  • There is a single global mutex and centralized state. Global locks protect all Goroutine-related operations (e.g., create, complete, reschedule, etc.), causing lock contention problems.
  • Goroutine delivery problem: “runnable” Goroutine is often passed between M’s, which causes scheduling delays to increase;
  • Each thread M needs to perform m.cache, which results in high memory usage and poor data locality.
  • System calls frequently block and unblock running threads, resulting in additional performance costs.

G – P – M model

The new coroutine scheduler introduced P(Processor) as a complete GPM model. Processor, which contains the resources to run the Goroutine. If a thread wants to run the Goroutine, it must first fetch P, which also contains a runnable G queue.

The functions of each module in the figure are as follows:

  • Global queue: stores G waiting to run
  • The local queue of P: similar to the global queue, it also stores G waiting to run. The maximum number of G waiting to run is 256. When I create a new G,G is preferentially added to the local queue of P, and if the queue is full, half of G in the local queue is moved to the global queue
  • P list: All P’s are created when the program starts and stored in an array. The maximum number of P’s is GOMAXPROCS, which can be modified by running runtime.GOMAXPROCS(N), where N indicates the number of Settings.
  • M: Each M represents a kernel thread, which the operating system scheduler allocates to the CPU core for execution.

Simply put, a G execution requires M and P support. An M is associated with a P to form a valid G running environment. Each P contains a runq of G.

The data structure The number of meaning
G Goroutine Runtime. g Runs function Pointers, stacks, registers, etc. Every timego funcThey all stand for one G, unlimited Represents a user code execution flow
P Processor Runtime. p Run G context, scheduler, including McAche, runq, free G, etc. Default is the number of machine cores, can passGOMAXPROCSEnvironment variable adjustment. Represents the resources required for execution
M Machine Runtime. m corresponds to a kernel thread More than P, the maximum number of M can be set, and this initial value is 10000 Represents the executor, the underlying thread

Scheduling strategy: The core idea of scheduler is to avoid frequent creation and destruction of threads as much as possible, and to reuse threads to improve efficiency.

  1. Work Stealing

When this thread has no G to run, it steals G from another thread’s bound P rather than destroying the thread directly.

  1. Hand off mechanism

When this thread M1 blocks due to a system call from G, the thread releases the bound P and transfers the P to another idle M0.

preemption

A Goroutine consumes up to 10ms of CPU, preventing other Goroutines from starving to death by waiting too long for execution.

Global G queue

A global G queue is protected by a mutex. Access to a global G queue requires contention for locks. The new scheduler has disabled the function of a global G queue, stealing G only when M cannot steal G from another P while performing work stealing.

conclusion

This article explains Goroutine’s design from a historical and macro perspective, but the actual code implementation is far more complex than that. Will continue to update the Go language coroutine source code exactly how to achieve.

What is the function of P?

By comparison, we can find that:

  • The initial Go concurrency performance of G-M model is not very bright, and the scheduling of coroutines and system threads is rough, leading to many performance problems, such as global resource lock and high memory of M, resulting in many performance losses.
  • The G-P-M model adds P’s design to implement a scheduling algorithm over work Stealing: P maintains the Goroutine queue and selects an appropriate M binding. P provides related execution Context, such as McAche and task queue. The number of P determines the maximum number of G that can be parallel in the system.

What is the point of goroutine?

Goroutine is, of course, a solution to some of the problems of operating system threads

  1. Creating and switching too heavy operating system thread creation and switching need to enter the kernel, and entering the kernel consumption performance cost is relatively high, large overhead;
  2. By default, the kernel allocates a large stack memory to an operating system thread when it is created, and there is a risk of overflow.

The advantage of Goroutine is that it costs less

  1. Goroutine is a user-mode thread, which is created and switched in user code without entering the operating system kernel, so its overhead is much lower than that of system thread creation and switch.
  2. Goroutine starts with a default stack size of 2k, which is sufficient in most cases. Even if it is insufficient, Goroutine’s stack will automatically expand, and it will automatically shrink if the stack becomes too large and wasteful, so that there is no risk of stack overflow or waste of stack memory space.

Coroutines for other languages

Coroutines are a good thing. Many languages support coroutines, such as: Coroutines for C, Coroutines for Java, libco and libgo for C++, Kotlin’s Kotlinx. coroutines, and Python’s gevent.

< Full text >

Understand the design and principle of Go language coroutine

Welcome to follow my wechat official account: [Miannong in Singapore], have more good technology to share.