The origin of Golang “scheduler”?
There is no scheduler in the single-process era
We know that all the software runs on the operating system, and the CPU does the real work. In the early days of the operating system, each program was a process, and the next process could not be carried out until a program was finished. This was the “single-process era”.
Early single-process operating systems faced two problems:
- A single execution process, the computer can only one task at a time.
- CPU time wasted when a process blocks.
Can you have multiple processes working together to perform multiple tasks?
Later on, operating systems had the earliest concurrency capabilities: multiple processes running concurrently, and when one process blocked, it was switched to another process waiting to execute, so that the CPU could be used as much as possible and not wasted.
Scheduler during multiprocess/multithreaded times
In a multi-process/multi-threaded operating system, the blocking problem is solved because a blocked process can immediately switch to another process, and the algorithm that schedules the CPU ensures that any running process can be allocated to the CPU’s run-time slice. From a macro perspective, it appears that multiple processes are being run at the same time.
However, a new problem arises. Processes have too many resources, and it takes a long time to create, switch, and destroy processes. Although THE CPU is used, if there are too many processes, a large part of the CPU is used for process scheduling.
How can I increase CPU utilization?
Obviously, CPU scheduling switches between processes and threads. Although threads are nice to look at, multi-threaded development design can actually become more complex, with many synchronization contention issues to consider, such as locking, contention conflicts, etc.
Coroutine to improve CPU utilization
Multi-process, multi-threaded systems have improved concurrency, but in today’s highly concurrent Internet scenarios, creating one thread for each task is impractical because it consumes a lot of memory (process virtual memory takes up 4GB [on 32-bit operating systems], while threads take up approximately 4MB).
A large number of processes/threads have new problems
- High memory usage
- Scheduled high CPU consumption
Engineers then discovered that a thread is actually divided into “kernel-mode” threads and “user-mode” threads.
A “user-mode thread” must be bound to a “kernel-mode thread”, but the CPU does not know that a “user-mode thread” exists, it only knows that it is running a “kernel-mode thread” (the Linux PCB process control block).
In this way, let’s further refine the classification. Kernel threads are still called “threads” and user threads are called “co-routines”.
Mapping between N:M coroutines and threads
A coroutine is different from a thread. A thread is scheduled preemptively by the CPU. A coroutine is scheduled cooperatively by the user.
Coroutines goroutine
Go uses goroutine and channel in order to provide an easier way to use concurrency. A goroutine comes from the concept of a coroutine, allowing a set of reusable functions to run on top of a set of threads. Even if a coroutine blocks, other coroutines from that thread can be scheduled by Runtime and transferred to other runnable threads. Crucially, these low-level details are invisible to the programmer, which makes programming easier and provides easier concurrency.
In Go, coroutines are called Goroutines, and they are very lightweight. A goroutine takes only a few kilobytes, and those kilobytes are enough for a goroutine to run. This supports a large number of Goroutines in a limited amount of memory, and supports more concurrency. Although a Goroutine’s stack is only a few kilobytes, it is actually scalable, and if more content is needed, Runtime will automatically allocate it to the Goroutine.
Goroutine features:
- Smaller memory footprint (a few KB)
- Scheduling is more flexible (Runtime scheduling)
Goroutine scheduler GMP model
In Go, threads are entities that run goroutines, and the scheduler’s function is to assign runnable Goroutines to worker threads.
- Global Queue: Stores the gs that are waiting to run.
- Local queue of P: similar to the global queue, it also holds G waiting to run. The number of stores is limited to 256. When G ‘is created, G’ is added to the local queue of P first. If the queue is full, half of G ‘in the local queue is moved to the global queue.
- P list: All P’s are created at program startup and stored in an array of up to GOMAXPROCS(configurable).
- M: If a thread wants to run a task, it must fetch P and fetch G from P’s local queue. If P’s queue is empty, M will also try to fetch a batch of G from the global queue and put it on P’s local queue, or steal half of G from other P’s local queue and put it on its own P’s local queue. M runs G, and after G executes, M gets the next G from P, and so on and so on.
Start a scheduling process for the Goroutine
Go Func creates and executes the process
Several conclusions can be drawn from the figure above:
1. We create a goroutine by going func ();
2. There are two queues for storing G, one for local scheduler P and one for global G. The newly created G is stored in P’s local queue first, and if P’s local queue is full, it is stored in the global queue.
3, G can only run in M, an M must hold a P, M and P is 1:1 relationship. M will eject an executable G from P’s local queue. If P’s local queue is empty, M will steal an executable G from another MP combination.
4. The process of M scheduling G execution is a circular mechanism;
When M is executing a G, it will block if syscall or some other blocking operation occurs. If there are some G’s currently executing, runtime detach the thread from P. Then create a new operating system thread (reuse idle threads if any are available) to serve this P.
6. When the M system call ends, this G will try to get an idle P to execute and put it in the local queue of this P. If P is not obtained, then the thread M becomes dormant and joins the idle thread, and the thread G is placed in the global queue.
Go coroutine scheduler each scenario scheduling process analysis
Scenario 1: G1 Create G2
P owns G1, and M1 starts to run G1 after obtaining P. G1 uses go func() to create G2. For locality, G2 is added to the local queue of P1 first.
Scenario 2: G1 is complete
After G1 runs (function: goexit), the goroutine running on M switches to G0, which is responsible for the coroutine switch at schedule (function: schedule). Fetch G2 from P’s local queue, switch from G0 to G2, and start running G2 (function: execute). Realized the reuse of thread M1.
Scenario 3: G2 creates too many Goroutines
Suppose the local queue for each P can hold only three Gs. The first three Gs (G3, G4, and G5) have been added to the local queue of P1. The local queue of P1 is full.
Scenario 4: Create a Goroutine when the local queue of G2 is full
When G2 is creating G7, the local queue of P1 is full and needs to perform load balancing (transfer the first half of G in the local queue of P1 and the newly created G to the global queue)
(The implementation does not have to be the new G. If G is executed after G2, it will be stored in the local queue and the new G will be added to the global queue instead of the old G.)
When these Gs are transferred to the global queue, they are out of order. So G3,G4, and G7 are transferred to the global queue.
Scenario 6: Wake M from hibernation
Rule: When G is created, the running G tries to wake up other idle P and M combinations to execute.
Suppose G2 wakes up M2, M2 binds to P2, and runs G0, but P2 has no G in the local queue and M2 is now a spin thread (no G but a running thread, constantly looking for G).
Scenario 7: Awakened M2 fetches G from the global queue
M2 tries to take a batch of G from the global queue (” GQ “) and put it into P2’s local queue (function: findrunnable()). The amount of G M2 takes from the global queue conforms to the following formula:
n = min(len(GQ)/GOMAXPROCS + 1, len(GQ/2))
Take at least 1 G from the global queue, but do not move too many G’s from the global queue to the p local queue at a time, leaving points for other P’s. This is load balancing from the global queue to the P local queue.
Scene 8: M2 steals G from M1
Assume that G2 has been running on M1. After two rounds, M2 has obtained G7 and G4 from the global queue to the local queue of P2 and finished running. Both the global queue and the local queue of P2 are empty, as shown in the left part of scenario 7.
M is required to perform work stealing: STEALING half of G from other P’s and putting it on its own local P queue. P2 takes half of the G from the end of the local queue in P1, which in this case is only one G8, and puts it in the local queue in P2 and executes it.
Scene 9: Maximum spin limit
G1 local queues G5 and G6 have been stolen and run by other M, currently M1 and M2 are running G2 and G8 respectively, M3 and M4 have no goroutine to run, M3 and M4 are in spin state, they are constantly looking for goroutine.
Why spin m3 and M4 when the spin is essentially running and the thread is running without executing G is a waste of CPU.
Why not destroy the site to save CPU resources. Since creating and destroying a CPU is also a waste of time, we want M to run a new Goroutine as soon as it is created. Destroying and creating a goroutine increases latency and reduces efficiency. Of course, considering that too many spin threads are a waste of CPU, there are at most GOMAXPROCS of spin threads in the system (GOMAXPROCS=4 in the current example, so there are four P’s), and any extra idle threads will put them to sleep.
Scenario 10: G system call/block occurs
Suppose that in addition to M3 and M4, M5 and M6 are idle threads (no P binding, note that we can only have 4 P at most here, so the number of P should always be M>=P, most of which are M preempting P that needs to be run), G8 creates G9, G8 makes a blocking system call, M2 and P2 are immediately unbound, and P2 performs the following judgment: If P2 has G in the local queue, G in the global queue, or an idle M, P2 will immediately wake up an M and bind to it. Otherwise, P2 will be added to the list of free P’s and wait for M to get an available P. In this scenario, local queue P2 has G9, which can be bound to other idle thread M5.
Scenario 11: G system call/non-blocking occurs
G8 creates G9, if G8 makes a non-blocking system call.
M2 and P2 will unbind, but M2 will remember P2, and then G8 and M2 will go into system call state. When G8 and M2 exit the system call, an attempt is made to get P2. If not, the idle P is obtained. If not, G8 is recorded as runnable and added to the global queue.
conclusion
The Go scheduler is lightweight and simple enough to support goroutine scheduling, and gives Go its native (and powerful) concurrency capabilities. The essence of Go scheduling is to allocate a large number of Goroutine to a small number of threads for execution, and use multi-core parallelism to achieve more powerful concurrency.