This article has participated in the Denver Nuggets Creators Camp 3 “More Productive writing” track, see details: Digg project | creators Camp 3 ongoing, “write” personal impact.
Thread pool defects
In high concurrency, it is unnecessary to create threads frequently, so with thread pool, it can pre-store a certain number of threads, new tasks do not need to create threads, but to publish tasks to the task queue, thread pool constantly pull tasks from the task queue and execute. This reduces the overhead of thread creation and destruction.
As shown above, we call each task in the task queue G, and G usually stands for a function. Worker threads in the thread pool constantly take out tasks from the task queue for execution, and the scheduling of worker threads is carried out by the operating system.
If system call occurs in G task executed by worker thread, the operating system will put the thread in the blocking state, which means that there are fewer worker threads of the thread slowing down and consuming the task queue, that is to say, the ability of the thread pool to consume the task queue is weakened.
If most tasks in the task queue are system calls, most worker threads will enter the blocking state, resulting in the accumulation of tasks in the task queue.
One way to solve this problem is to re-examine the number of threads in the thread pool. Increasing the number of threads in the thread pool can increase consumption capacity to a certain extent. However, as the number of threads increases, excessive threads will compete for CPU, and consumption capacity will be capped, or even decrease. As shown below:
Goroutine scheduler
If the number of threads is too large, the operating system will frequently switch threads, and frequent context switching becomes a performance bottleneck. Go can be scheduled by itself in threads, and context switching can be more lightweight, achieving the effect of fewer threads and not much concurrency. And the Goroutine is what’s scheduled in the thread.
The main concepts of Goroutine are as follows:
G (Goroutine) : Go coroutine. Each Go keyword creates a coroutine. M (Machine) : Worker thread, called Machine in Go. P(Processor): Processor (a concept defined in Go, not the CPU), which contains the resources necessary to run Go code and has the ability to schedule goroutine.
- M must have P to execute the code in G
- P contains a queue containing multiple G, and P can schedule G to be executed by M.
Its relationship is shown in the figure below:
In the figure, M is the thread assigned to the operating system for scheduling. M holds a P, and P schedules G into M for execution. P also maintains a queue containing G (shown in gray).
The number of P is determined at the start of the program, which is equal to the number of CPU cores by default. Since M must hold a P to run the Go code, the number of M (threads) running at the same time is generally equal to the number of CPU, so as to achieve the maximum use of CPU without excessive thread switching overhead.
You can use runtime.gomaxprocs () to set the number of P’s to improve performance in some IO intensive scenarios.
Goroutine scheduling policy
Queue rotation
As can be seen in the figure above, each P maintains a queue containing G. Without considering the entry of G into system calls or IO operations, P periodically schedules G to M for execution, executes for a short period of time, saves the context, and then puts G to the end of the queue, and then retrieves a G from the queue for scheduling.
In addition to the G queues maintained by each P, there is a global queue. Each P periodically checks whether there is G waiting to be run in the global queue and schedules it to M for execution. The source of G in the global queue mainly includes G recovered from system call. The reason why P periodically checks the global queue is to prevent G in the global queue from starving to death.
The system calls
By default, the number of P is the number of CPU cores. Each M must hold a P before G can be executed. Generally, the number of M is slightly larger than the number of P. Similar to thread pools, Go also provides a pool of M, which is fetched from the pool when needed, put back into the pool when used, and create another one when not enough.
When a G of M runs produces a system call, it looks like the following:
As shown, when G0 is about to enter the system call, M0 will release P, and then some idle M1 will acquire P and continue to execute the remaining G in the P queue. While M0 is blocked due to system calls, M1 takes over from M0, ensuring full CPU utilization as long as P is not idle.
M1 may be a cache pool from M, or it may be newly created. When the G0 system call ends, G0 will be treated differently according to whether M0 can obtain P:
- If there is a free P, one is taken and G0 continues.
- If no P is free, G0 is placed in the global queue, waiting to be scheduled by another P. M0 then goes to sleep in the cache pool.
Workload theft
Queues of G maintained in multiple PS can be unbalanced, as shown in the following figure:
In the left side of the vertical line, P on the right has executed all G, and then look up the global queue, there is no G in the global queue, and in the other M, there are three G in the queue to run in addition to the running G. At this point, the idle P steals some of the G from the other PS, usually half at a time.
Impact of GOMAXPROCS Settings on performance
In general, the size of GOMAXPROCS is set to the number of CPU cores, allowing the Go program to take full advantage of the CPU. In IO intensive applications, this setup may not perform best. Theoretically, when a Goroutine enters a system call, a new M is enabled or created, continuing to fill the CPU. However, there is a lag between the old M being blocked and the new M being run, so in IO intensive applications it might be better to make GOMAXPROCS larger.