Understand golang scheduler ii: Go scheduler

preface

There are three articles in this section that cover some of the go scheduler

The three articles are:

Understand one of Golang scheduling: operating system scheduling
Understand golang scheduler ii: Go scheduler
Understand Golang scheduling # 3: Concurrency

Introduction to the

The first article explained scheduling at the operating system level, which is important for understanding Go scheduling. In this section, I’ll explain how the Go scheduler works at the semantic level and focus on some of its advanced features. The Go scheduler is a very complex system, the details are not important, but a good understanding of how it works will allow you to make better engineering decisions.

Start with a program

When your GO program starts, each virtual kernel defined on the host is assigned a logical processor (P) to it. If your processor has multiple hardware threads (hyperthreads) per physical kernel, each hardware thread is a virtual kernel for your GO program. To understand this, take a look at the system configuration of my MacBook Pro.

Figure 2.1

You can see that a single processor has four physical cores. The configuration sheet doesn’t say how many hardware threads each physical core has. Intel Core I7 processors have their own hyperthreading, which means there are two hardware threads per physical kernel. So the Go program knows that eight virtual kernels are available when operating system threads are executed in parallel

To verify this, take a look at the following program

L1

package main

import (
	"fmt"
	"runtime"
)

func main() {

    // NumCPU returns the number of logical
    // CPUs usable by the current process.
    fmt.Println(runtime.NumCPU())
}
Copy the code

I run this program on my local machine, the NumCPU() method returns 8, and any Go program I run on my local machine allocates 8 logical processors (P).

Each P is allocated an OS thread (M). M is for machine. This thread is handled by the OS, and the OS is responsible for putting the thread on a core to execute. This means that when I run a Go program on my machine, I have eight available threads to perform my work, each individually connected to a P.

Each Go program also has an initial Goroutine (G). A Goroutine is essentially a Coroutine, but in Go, the literal “C” is replaced by a “G” so we call it a Goroutine. You can think of Goroutine as a user-application-level thread and it’s similar to OS threads in many ways. The only difference is that OS threads do context switching on Core, whereas Goroutines do it on M.

The last thing that gets confusing is the run queue. There are two different types of run queues in the Go scheduler: global run queue (GRQ) and local run queue (LRQ). Each P is assigned an LRQ to handle the Goroutines that P’s context will execute. These Goroutines switch contexts on M bound to P. GRQ processes Goroutines that have not yet been assigned to P. How Goroutines move from GRQ to LRQ we’ll talk about in a minute.

Figure 2.2 is a picture containing all the relevant components

Figure 2.2

Collaborative scheduling

As we discussed in Part 1, the OS scheduler is a preemptive scheduler. That means you don’t know what the scheduler is going to do next. The kernel makes decisions that are indeterminate. Applications running at the top of the OS have no control over scheduling in the kernel unless you use synchronized raw operations, such as atomic instructions and MUtex calls

The Go scheduler is part of Go Runtime, which compiles into your application. This means that the Go scheduler runs in user space on top of the kernel

The current Go scheduler is not a preemptive scheduler, but a cooperative trial scheduler. Cooperative trial scheduler means that the scheduler needs defined user-space events occurring at safe points in the code to make scheduling decisions.

One of the great things about Go’s collaborative scheduling is that it looks preemptive. There is no way to predict what the Go scheduler will do; scheduling decisions are not made by the developer but by the Go Runtime. It is important to think of the Go scheduler as a preemptive scheduler because scheduling is uncertain and does not need to be extended too much.

Goroutine state

Same as threads. Goroutine has three identical advanced states. A Goroutine can be any state: Waiting, Runnable, or running.

Wait: At this point Goroutine has stopped and waits for the event to occur before executing again. This may be due to reasons such as waiting for the operating system (system call) or synchronous calls (atomic operation and mutex operation). These types of delays are the root cause of poor performance.

Executable: At this point Goroutine wants to execute the instructions assigned to it on M. If there are many Goroutines that want time slices on M, then Goroutines must wait longer. Also, as more Goroutines compete for time slices, the time allocated by individual Goroutines shrinks, and this type of scheduling delay can lead to poor performance.

Running: This means that Goroutines have been placed on M and execute its instructions. At this point, the application’s work is about to be completed, which is the desired state.

Context Switching

The Go scheduler needs well-defined user-space events that occur at safe points in the code for context switching. These events and safety points occur when a function is called. Function calls are critical to the health of the Go scheduler. In Go 1.11 or earlier, if you run an endless loop without making a function call, it will cause scheduler delays and garbage collection delays. It is important to use function calls when appropriate.

Note:Related issues and suggestionsHas been proposed and implemented in version 1.12. Apply non-cooperative preemption techniques that allow preemption in tight loops.

There are four types of events in the Go program that allow the scheduler to make scheduling decisions.

Use the keyword go
The garbage collection
The system calls
Synchronous processing

Use the keyword go

Use the keyword go to create a Goroutine. Once a new Goroutine is created, the scheduler has the opportunity to make scheduling decisions

The garbage collection

GC has its own Goroutines, which also require time slices on M. This can cause a lot of scheduling chaos in the GC. But the scheduler is smart enough to know what Goroutines are doing and make reasonable scheduling decisions. One scheduling strategy is to switch the context between goroutines that want to access the heap and goroutines that do not access the heap during GC. There are many scheduling policies when GC occurs.

The system calls

If a Goroutine blocks M by making a system call, the scheduler will sometimes replace the Goroutine on M with a new one. But sometimes a new M is needed to execute a Goroutine hanging on the P queue, which I’ll cover in the next section.

Synchronous processing

If the call to an atomic, mutex, or channel operation causes a Goroutine to block, the scheduler switches to a new Goroutine to execute. Once that Goroutine is ready to execute again, it is suspended to the queue and eventually context-switched back on M.

Asynchronous system call

When the OS is capable of handling asynchronous system calls, it is more efficient to use network poller to handle system calls. Different operating systems use kqueue (MacOS), Epoll (Linux), IOCP (Windows) to implement this.

Many operating systems today can handle network-based system calls. This is where the name Network Poller comes from, because its main purpose is to handle network operations. By using the Network Poller on networked systems, the scheduler prevents Goroutines from blocking M during system calls. This allows M to execute other Goroutines on P’s LRQ instead of creating a new M. This reduces scheduled loading on the OS.

The best way to do that is to give an example of how these things work.

Figure 2.3

Figure 2.4

Figure 2.5

Synchronous system call

What happens when a Goroutine wants to make a system call that can’t be made asynchronously? In this case, network Poller cannot be used and system calls generated by Goroutine block M. It’s unfortunate but we can’t stop it from happening. One example is file-based system calls. If you use CGO, other things can happen when you call C functions that block M.

Note: The Windows operating system does have the ability to make file-based system calls asynchronously. Technically, network Poller can be used when running on Windows.

Let’s look at what happens when a synchronous system call (such as file I/O) blocks M.

Figure 2.6

Figure 2.6 shows our basic scheduling legend again. But this time goroutine-1’s synchronous system call blocks M1

Figure 2.7

In Figure 2.7, the scheduler can determine that Goroutine-1 has blocked M. At this point, the scheduler will take M1 off P, and goroutine-1 will remain on M1. The scheduler then takes a new M2 to serve P. At this point, Goroutine-2 on LRQ will be switched to M2. If an M is already available, it is faster to use it than to create a new M.

Figure 2.8

In Figure 2.8, the blocking system call for Goroutine-1 ends. At this point goroutine-1 can go back to the LRQ and be executed again by P. M1 will then be set aside for use in similar situations in the future.

Work Stealing

From another perspective, schedulers work in work-stealing. This behavior can make scheduling more efficient in some cases. The last thing we want is for an M to go into a wait state, because once that happens, the OS will switch M off of core. This means that even with an executable Goroutine, P can’t do anything until M switches back to Core. Work Stealing also balances all Goroutines on P to make Work more efficient and efficient.

Let’s look at an example

Figure 2.9

In Figure 2.9, we have a multithreaded Go program. Two PS each serve four Goroutines. And a separate Goroutine on GRQ. So what happens if one P executes all of its Goroutines very quickly?

Figure 2.10

P1 has no more Goroutines to execute, but there are executable Goroutines in both GRQ and P2’s LRQ. In this case, P1 will steal the Work. The rules of Work Stealing are as follows

L2

runtime.schedule() {
    // only 1/61 of the time, check the global runnable queue for a G.
    // if not found, check the local queue.
    // if not found,
    //     try to steal from other Ps.
    //     if not, check the global runnable queue.
    //     if not found, poll network.
}
Copy the code

So based on L2’s rule, P1 needs to look at the Goroutines on P2’s LRQ and take half.

Figure 2.11

In Figure 2.11, half of the Goroutines are stolen from P2, and P1 can now execute those Goroutines

What happens if P2 completes all the execution of Goroutines and P1’s LRQ is empty?

Figure 2.12

In Figure 2.12, P2 has done all its work and now wants to steal something. First of all, it’s going to look at the LRQ of P1 and see that there’s nothing there. Next he’ll see GRQ. He’ll find Goroutine-9

Figure 2.13

In Figure 2.13, P2 steals Goroutine-9 from GRQ and starts to do its work. The great advantage of this type of work stealing is that it gives M something to do all the time instead of being idle. This type of work stealing can be considered as internal M rotations, the benefits of which are well explained in this blog.

Practical example

In order to look at the Go scheduler in order to do more things at the same time, to understand how all of this is happening together. Imagine a multithreaded C application that needs to deal with two OS threads communicating with each other.

Figure 2.14

In Figure 2.14, there are two threads communicating with each other. Thread 1 context switches to Core1 and is now executing, allowing thread 1 to send messages to thread 2.

Note: The method of communication is not important. What matters is the thread state in the process.

Figure 2.15

In Figure 2.15, once thread 1 has finished sending the message, it waits for the response. This causes thread 1 to switch from Core1 and be in a wait state. Once thread 2 receives the message notification, it enters the executable state. Now the OS does a context switch and thread 2 executes on a Core2. Thread 2 then processes the message and sends a new message to thread 1.

Figure 2.16

In figure 2.16. As thread 1 receives the message from thread 2, a context switch occurs again. Thread 2 now switches from the executing state to the waiting state. And thread 1 switches from the wait state to the executable state, and eventually back to the running state. Now thread 1 can process and send a new message back.

All of the context switches and state changes take time to process, which limits speed. Each context switch results in a potential delay of 50ns. The expected time for hardware to execute instructions is 12 instructions per ns, and you will see 600 fewer instructions executed when context switches are performed. Because these threads cut back and forth before different cores, the latency caused by cache-line misses also increases.

Let’s look at the same example, using the Goroutines and Go scheduler instead.

Figure 2.17

In Figure 2.17, there are two Goroutines passing messages to each other. G1 context switches to M1 for work processing, which was previously done on Core1. Now G1 is sending messages to G2.

Figure 2.18

In Figure 2.18, once G1 has sent the message, it waits for the response to return. This switches G1 off the M1 and puts it into a waiting state. Once G2 receives a message notification, it enters the executable state. Now the Go scheduler switches G2 to M1, which is still running on Core1. G2 then processes the message and sends a new message to G1.

Figure 2.19

In Figure 2.19, a context switch occurs again as G1 receives the message sent by G2. Now THAT G2 has switched from the executing state to the waiting state and G1 has switched from the waiting state to the executable state and finally back to the running state, G1 is able to process and send new messages to G2 again.

On the surface things are not that different. Whether you use threads or Goroutines, there are context switches and state changes. But there is an important difference between threads and Goroutines that may not be noticeable.

In the case of Goroutines, the same OS thread and Core are used throughout the process. This means that, from the OS’s point of view, the OS thread has never been in a waiting state, not once. The result is that the instructions we lost in context switching in the thread are not lost in Goroutines.

In essence, go turns IO /blocking work into CPU intensive work at the OS level. Because all context switching takes place at the application level, context switching does not throw away 600 instructions (on average) as threads do. The Go scheduler also helps improve cache-line efficiency and NUMA. This is why we don’t need more threads than the number of virtual kernels. In Go, more things are processed over time because the Go scheduler tries to do more with fewer threads per thread, which helps reduce load delays at the OS and hardware levels.

conclusion

The design of the Go scheduler is really surprising considering the complexity of operating system and hardware work. Converting IO /blocking work to CPU intensive work at the operating system level is where the big success lies in leveraging more CPU capacity. That’s why you don’t need more OS threads than virtual kernels. With one OS thread per virtual kernel, you can reasonably expect all of your work (CPU intensive, IO intensive) to be done. This can also be done for network programs and those that do not require system calls to block OS threads.

As a developer, you still need to understand what your program is doing when it handles different types of work. You can’t create goroutine indefinitely for better performance. Less is always more, but by understanding the GO scheduler, you can make better decisions. In the next installment, I’ll explore ways to use concurrency in a conservative way to improve performance, but there’s a balance to be made between code complexity.