Openresty coroutine scheduling vs. Go coroutine scheduling

Liu Daiming, a literary young man (Nani? Youth? Yes, you read that right: The World Health Organization (2020) definition of youth: 18 to 65 years old [from Baidu Encyclopedia]). Now qihoo 360 search technology department as a Web server technical expert position. This article is the author’s original, I wish you a happy reading.

In the field of Web programming, Both Openresty and Go have excellent processing capabilities. In the face of highly concurrent Web programming, both of them are generally preferred technical solutions. I use both of them all the time, and both of them have coroutines, so I’ll summarize them for a quick note.

Openresty and its workflow

Based on Openresty version 1.18

Integrate Lua into Nginx, which stands for high-performance HTTP servers.

Nginx is multi-process single thread: a master process and multiple worker processes, processing requests is the worker process.

Start the process

When the master process is created, the lua VM is initialized using the ngx_http_lua_init_vm function. When the work process is fork out, the Lua VM is integrated into the work process. Each work process has one Lua VM.

When the worker starts up, the worker process processes requests in a loop. When a new request arrives, only the worker that has applied to ngX_accept_mutex will process it (register the LISTEN FD to its epoll) to avoid stampezes.

The reason for Nginx’s high performance is its asynchronous, non-blocking event handling mechanism with system calls like SELECT /poll/epoll/kqueue.

Coroutines scheduling

If you have this configuration:

The configuration items are: location ~ ^/ API {content_by_lua_file test.lua; }Copy the code

For each request, for example: request=/ API? age=20

Openresty creates a coroutine to handle it.

And this created coroutine is the system coroutine, the main coroutine, and the user has no control over it. The coroutine created by the user with ngx.thread.spawn is created with ngx_http_lua_coroutine_create_helper. The coroutine created by the user is a subcoroutine of the main coroutine. Ngx_http_lua_co_ctx_s holds information about coroutines.

Coroutines are run and scheduled through the ngx_http_lua_run_thread function. The current coroutine to execute is ngx_http_lua_ctx_t->cur_co_ctx.

For each worker process, each user request creates a coroutine, each coroutine is isolated from each other, and the user also creates a user coroutine, which is finally handed over to the lua VM in the current worker process for execution. There can only be one coroutine to execute at any one point in time. How do you schedule these coroutines?

In fact, these coroutines are event-based (using nginx’s event mechanism) collaborative scheduling:

1. For the system created coroutine, when the system event is not triggered, the corresponding IO event is not ready (ET mode, epoll_WAIT returns active fd to read or write until returns EAGIAN), the currently executing coroutine will give out the CPU, let another coroutine to execute;

2. For user-created coroutines, in addition to the 1 mentioned above, if the user code performs an assignment, it will also perform an assignment.

GO and its workflow

Based on go version 1.15

Go is a single-process, multi-threaded, multi-coroutine.

Start the process

For this simple GO program:


package main
import "fmt"

func main(a) {
	fmt.Println("Hello world!")}Copy the code

We can track the startup process through GDB.

Args -> Runtime.osinit -> Runtime.schedinit -> Runtime.newProc -> Runtime.mstart

Among them:

Runtime. args: Initialize argc,argv; The AUXV, the auxiliary vector, is traversed to initialize some system running variables: For example, the memory page size (physPageSize), Osinit: Set the number of CPU cores (nCPU) and memory size (physHugePageSize) for huge pages. Runtime. Schedinit: initialization stack, memory allocator, random number seed; Initialize m0 and put it into ALLM. Gc initialization; Newproc: This is the function that go calls to actually create a goroutine when we use go func() to create a coroutine in the Go language. If not, it gets the globally free G (schedt.gFree), and if not, it creates a g on the heap with an initial stack size of 2k and initialitates with the latter. Create a g directly on the heap to execute runtime.main.runtime. mstart: start m and schedule G (circular scheduling)Copy the code

The above function explains each step of the function, the details are more complicated, the above function introduces m0 and all p initialization, as for g0, will actually be initialized on the stack in the entry assembly. The stack size at startup is about 64K (65432 bytes). The reference of m0 and g0 to each other is also established at this point, so that the relationship of m0,g0,allp[0] is established.

Scheduling model

Overview:

Among them:

G: the G struct object, representing the Goroutine. Each G represents a task to be executed.

M: an M struct object representing the worker thread (each worker thread has an M for it)

P: indicates the P structure object, which represents the Processor.

After the program starts, it will create P equal to the number of CPU cores (you can also change, generally do not change). Each P holds a ring queue (the local run queue) of G to run.

M-p-g scheduling is performed in user mode. The relationship between M and G is many-to-many (M:N). That is, M threads are responsible for scheduling N G, and the kernel is responsible for scheduling M threads.

Goroutine scheduling

Circular scheduling, preemptive scheduling (signal-based preemptive scheduling was introduced in GO 1.14).

schedule()->execute()->gogo()->g.sched.pc()->goexit()->goexit1->goexit0()->schedule()

Among them:

1. The schedule() function is designed to find an executable g:

1. After 61 rounds of scheduling, the system obtains G from the global run queue for execution

2. Obtain G from the local run queue for execution

3. If neither of the above steps is found, the search will continue (blocking) until an executable G is found.

This phase finds an executable G in runqueues that try to run local runqueues, global runqueues, netpoll, steal other P’s

2. The execute() function sets the curg of the current thread, assigns m to the current thread, and changes the state of g from _Grunnable to _Grunning

3. The gogo() function is written in assembly language:

Switch to current g (switch stack g0->g, restore the value of the register stored in g.sched structure to the CPU register)

Let the CPU actually execute the current g (the execution entry function is g.sin. PC, that is, the PC register, the entry address of the next instruction to be executed).

4. g.szy.pc (), for our program, is main goroutine, and the entry function is Runtime.main:

1. Start a thread to execute the sysmon function, responsible for netpoll monitoring of the entire program, GC, preemption scheduling (releasing P for G caught in a blocking system call, preemption for g running for long time (>10ms)), etc. This thread runs independently (no p, loop running)

2. Runtime package initialization

3. Start the gc

Import packages are also initialized at this stage

5. Execute main.main (our main function)

6. After returning from main.main, execute a system call to exit the process

When the main goroutine is finished, our program is finished. This is why we start a goroutine in the main function. If we do not make a chan to receive data from the coroutine, we do not see the result of the coroutine execution.

For non-main goroutine, after performing fn (i.e. G.sen.pc) :

Goexit: executes the runtime.goexit1 function

Goexit1 function: McAll switches to g0 stack and executes runtime.goexit0 function

Goexit0 function: g is put into the gFree queue for reuse for the next round of circular scheduling.

Network IO

The same implementation of epoll/ kQueue and other system calls, the underlying use of assembly implementation. Such as the epoll correlation function:

epollcreate
epollctl
epollwait

We’ll call it Netpoller, which combines Goroutine with IO multiplexing. A list of active Goroutines can be obtained by netpoll().

Some go trivia:

1. What does M0 do? What’s the difference between m0 and other M?

1. As you can see, m0 is the first thread created.

2. The function of M0 is the same as other M, is the system thread, THE CPU allocated time slice to execute the task thread.

3. The maximum number of m is 10000. G can only be executed after binding with M.

2. What exactly does G0 do, what is the difference between G0 and other G? Will G0 also be scheduled?

1. As you can see, g0 is the first g to be created, but it is not a normal G and will not be scheduled.

2. The function of g0 is to provide a stack for runtime code to execute. Typical functions are McAll () and SystemStack (), both of which switch to g0 stack execution function. The latter can initiate the switch at g or G0. If the current g0 stack is executed directly, otherwise it will switch to the G0 stack execution function, and after the function is executed, it will switch back to the code that is currently executing and continue to execute the subsequent code.

3. Each M has a G0.

4. G0 is different from other G’s. First of all, the size of the initialization stack is different. Ordinary G initialization stack size is 2K, and g0 initialization stack size has two cases: nearly 64K size and 8K size. When the new M is established, the corresponding G0 in the non-CGO case will allocate 8K stack.

5. Different stack positions. Normal G will allocate stack space on the heap, while G0 will allocate on the system stack.

4. For the GO program, how many threads will be created after startup?

Each platform is different; For Windows, some threads are created early in the OSInit phase. For Linux, only one thread is created before running runtime.main.

1. When creating a Goroutine, threads will be created as needed.

2. Create threads in the Runtime phase, such as starting the Sysmon system monitoring thread and cGO call startTemplateThread.

3. When a CGO is executed, multiple Cgos are executed simultaneously, each of which requires one thread.

4. When THE GO coroutine is scheduled, P cannot find m to be idle for execution. Typical scenarios such as Web development, when goroutine executes a blocking syscall call, and when a new go coroutine arrives that needs to be processed.

A maximum of 10,000 users can be created

5. Do P, M,g change during the course of the program?

P is the number of CPU cores (unless manually adjusted) and is stored in ALLP.

G and M will increase, but they won’t decrease. There is no upper limit of G. It has to do with memory. The free g is put into gFree (the gFree of p is the local free queue; The gFree of schedt is a global idle queue); The maximum number of M is 10000 and stored in ALLM.

6. Do you need coroutine pools when doing high-performance Web development?

In the face of high concurrency, if the number of G is not limited to one g per request, when the local P is full (256 per P), it will be put into the global queue. A large number of G will increase gc scanning pressure and occupy a large amount of memory, and a large number of global P will be locked.

So it’s necessary to limit the number of g’s. We actually have no control over the Goroutine, and the GO scheduler will reuse the Goroutine in gFree itself.

So a more accurate name for a coprocess pool is a consumption pool (requests are like production, and we process requests like consumption). So what we’re going to do is

1. Minimize heap allocation and pool overcommitment.

2. Avoid blocking system calls,

3. Optimize the downstream and algorithm response time;

4. Do a good job of limiting the current, limit the number of G,

If all of this is still valid access and the pressure is high, add machines.

contrast

1. After Openresty is started, each CPU core is bound to a process, while for Go, each worker thread corresponds to a CPU core, which is similar to the same work.

2. It can be seen that the scheduling model of GO is much more complex.

Openresty is collaborative scheduling based on Nginx events (” hot cycles “caused by long CPU-intensive calculations should be avoided)

Go implements a set of efficient P-M-G scheduling (signal-based preemptive scheduling (from 1.14))

3. For network components, the underlying layer uses MULTIplex I/O to improve web performance.

4. When serving as a high-performance Web server, you should avoid blocking system calls: If it involves blocking system calls that take a long time, for Openresty, the current coroutine always occupies the CPU, leading to direct blocking of the process and significant deterioration of processing performance;

For Go, when the current Goroutine is trapped in the blocking system call, although P will be released, the worker thread will also be trapped. For other Goroutine to be processed, if there is no idle worker thread, the worker thread will continue to be created, and a large number of threads will greatly increase the context switch, resulting in performance degradation.

Openresty coroutine scheduling vs. Go coroutine scheduling

Openresty and its workflow

Start the process

Coroutines scheduling

GO and its workflow

Start the process

Scheduling model

Goroutine scheduling

Network IO

Some go trivia:

contrast

Related Posts

How to select filter components in 10 minutes

Use enum in Python

Six current limiting schemes