1. Why is a “scheduler” needed?

1. Single-process era

Early operating systems could only handle one task at a time, that is, all tasks were executed sequentially, and the next task was executed only after one was completed.There are two obvious drawbacks to this model:

  • Computers cannot handle more than one task at a time
  • When one process is stuck, other processes must wait for it to complete, wasting CPU time

2. Multi-threaded/multi-process era

Because the single-process pattern was very problematic, a better multi-process/thread pattern evolvedIn this mode, the problem of single-process mode is solved. When a process is blocked, CPU computing resources are released to the next process, so that CPU resources are not wasted, and the switch is fast enough that the CPU is also processing multiple tasks at the same time from the macro view.

However, this mode introduces a new problem. Switching CPU computing resources occupies CPU time. If switching is performed frequently, a large part of CPU resources will be occupied by process scheduling.

3. Coroutines

Multi-process and multi-threading have increased the concurrency of systems, but in today’s high-concurrency scenarios on the Internet, it is not practical to create a thread for each task because of the amount of memory it consumes (process virtual memory can take up 4GB [on 32-bit operating systems], while threads take up about 4MB).

As a result, a large number of processes/threads have new problems

  • High memory usage
  • Scheduling high CPU consumption

Then, engineers discovered that a thread can actually be “kernel” and “user” threads. A user-mode thread must be bound to a kernel-mode thread, but the CPU does not know that user-mode thread exists, only that it is running a kernel-mode thread (Linux PCB process control block).

So, let’s break it down a little bit, kernel threads are still calledThread (thread).The user thread is called"Coroutines (co - routine)".

The difference between coroutines and threads is that threads are preemptively scheduled by the CPU, while coroutines are cooperatively scheduled by the user. One coroutine cedes the CPU before the next one is executed.

Based on this, some people thought, since a co-routine can bind one thread, can multiple co-routines bind one or more threads? Therefore, several binding relationships are naturally derived:

1:1 relationship

This model is essentially not that different from the multi-threaded/multi-process era, so the pros and cons are pretty much the same, but there’s not much to describe

N: 1

N coroutines bound to one thread, the advantage is that the coroutine will complete the switch in the user state thread, will not fall into the kernel state, this switch is very light and fast. But there is also a major disadvantage, 1 process all coroutines are bound to 1 thread. Once a coroutine blocks, the thread blocks, and the process’s other coroutines are unable to execute, there is no concurrent ability at all. Incidentally, this is the mode currently supported by Python

M: N relationship

This pattern is the most complex implementation, but addresses the shortcomings of the above model

Second, Goroutine scheduler GMP model design thought

Go uses Goroutine and channel to provide an easy-to-use concurrency method. Goroutine comes from the concept of a coroutine that allows a set of reusable functions to run on top of a set of threads, and even if a coroutine blocks, the other coroutines of that thread can be scheduled by the Runtime to run on another runnable thread. Crucially, the programmer can’t see these low-level details, which makes programming easier and provides easier concurrency.

In Go, the coroutine is called a Goroutine, and it is very lightweight. A Goroutine takes only a few kilobytes, and that is enough to run the Goroutine. This allows for a large number of Goroutines in a limited amount of memory, allowing for more concurrency. Although a Goroutine stack is only a few kilobytes, it is actually scalable, and the Runtime automatically allocates more content to the Goroutine if needed.

Goroutine features:

Smaller memory footprint (several KB) More flexible scheduling (Runtime scheduling)

1. The model of the GMP

  • Global Queue: Stores G waiting to run.
  • The local queue of P: similar to the global queue, stores G waiting to run. The number of stores is limited, not more than 256. When G ‘is created, G’ is preferentially added to the local queue of P. If the queue is full, half of G in the local queue will be moved to the global queue.
  • P list: All PS are created when the program starts and stored in an array, up to GOMAXPROCS(configurable).
  • M: A thread that wants to run a task has to get P from the local queue of P to get G. When the queue of P is empty, M will also try to get some G from the global queue and put it into the local queue of P, or steal half from the local queue of other P and put it into its own local queue of P. M runs G, and after G executes, M gets the next G from P, and so on.

The quantitative relationship between P and M

  1. The amount of P is determined either by the startup environment variable $GOMAXPROCS or by the Runtime method GOMAXPROCS()

  2. The default maximum number of M threads is 10000(which can be ignored). You can also set the maximum number of M threads by using the SetMaxThreads function in Runtime /debug

When P and M are created

  1. After determining the maximum number of P, n, the system will create n P’s based on this number when the program runs.

  2. Created when there is not enough M to bind the corresponding P, including the following two scenarios:

    1. When a G system call is blocked, P will unbind with the current M and look for a dormant M to bind again. When the dormant M is insufficient, a new one will be created. After processing the task, the former M will enter the dormant state because there is no BINDING with P

    2. When a new G is created, it tries to bind the idle P and M, and if the dormant M is insufficient, it also creates a new M);

To sum up, it can also be concluded that P is not directly related to the number of M

2. Scheduling flow of a coroutine

3. Scheduling policy of the scheduler

The work mechanism of stealing

When this thread has no running G, try to steal G from P bound by another thread.

Hand off mechanism

When the system call is blocked by G, the thread releases the bound P and transfers P to other idle threads for execution. After the execution of the blocked G, it tries to find an idle P queue (the previous P queue will be preferentially found). If there is no idle queue, it puts it into the global queue for execution.

Preemption (introduced after 1.14)

And co – routinue waiting for coroutines active release different CPU, go coroutines mechanism in order to prevent other coroutines be ‘starvation’, increased the preemption, when a collaborators cheng CPU time slice, after more than 10 ms can be other coroutines preemption, after be preempted the G to suspend operation, save the current stack information into the queue waiting for the next global scheduling.

To do a simple experiment, this is a very simple program. The main coroutine creates a G and gives the thread to the new G by sleep:

package main
import (
	"fmt"
	"runtime"
	"time"
)

func main(a) {
	runtime.GOMAXPROCS(1)

	fmt.Println("The program starts ...")

	go func(a) {
		count := 1
		for {
			count += 1
		}
	}()

	time.Sleep(time.Second)
	fmt.Println("I got scheduled!")}Copy the code

Add Dockerfile as follows:

ARG GO_VERSION
FROM golang:${GO_VERSION}
COPY ./demo.go /app/
CMD ["go", "run", "/app/demo.go"]
Copy the code

Run the two versions of the mirror respectively.

#Build and run dockerDocker run -t app13 --build-arg GO_VERSION=1.13. Docker run -t --rm app13#The running results are as follows:
The program starts ...

#Build and run dockerDocker run -t app14 --build-arg GO_VERSION=1.14. Docker run -t --rm app14#The running results are as follows:
The program starts ...
I got scheduled!

Copy the code

When you create G, you try to create a new MP binding

When G is created, an attempt is made to bind free M and P(if there is no free P, it is discarded), and then the new binding will steal G from other queues and run. In this case, if MY P is 4, the for loop initiates four coroutines in main, coordinating the work stealing mechanism, and G will eventually load evenly.

Spin thread

G0 is in the state where the spin keeps trying to get the executable G, and the corresponding thread is called the spin thread. The spin state will continue until the executable G is found or recycled.

On the block

There are two types of blocking, user-mode blocking and Syscall blocking:

  1. User-mode blocking

For example, network IO, blocking channel, sleep and other scenarios (simply speaking, the CPU has nothing to do with this coroutine at this time). For this kind of blocking, G will be temporarily suspended in a temporary waiting queue, and P will be re-placed after the blocking is over.

  1. System call blocking

Syscall level blocking triggers hand off

3. Special M0 and G0

M0 is the main thread numbered 0 after starting the program. The instance of this M is allocated on the global runtime. M0 does not need to be allocated on the heap. M0 performs initialization and starts the first G, after which M0 is the same as any other M.

G0 is the first gourtine that is created every time an M is started. G0 is used only by the G responsible for scheduling. G0 does not point to any executable function, and each M has its own G0. The stack space of G0 is used during scheduling or system calls, and the G0 of global variables is the G0 of M0.