The server usually needs to support the high concurrency service access, how to design the excellent SERVER network IO worker thread/process model plays a crucial role in the high concurrency service access demand.

This paper summarizes a variety of network IO thread/process models under different scenarios, and gives the advantages and disadvantages of each model and its performance optimization methods, which is very suitable for server development, middleware development, database development and other developers for reference.

(By the way, OPPO Internet document database team is in urgent need of mongodb kernel and storage engine development talents, welcome to join OPPO to participate in the development of ten million peak TPS/trillion data volume database. Contact: yangyazhou#oppo.com)

1. Thread model 1. IO multiplexing model for single-threaded networks

Description:

  1. All network IO events (accept events, read events, and write events) are registered to the Epoll event set
  2. In the main loop, epoll_WAIT is used to obtain the epoll event information collected in kernel mode once, and then polling to execute the corresponding callback for each event.
  3. Event registration, epoll_WAIT event fetching, and event callback execution are all handled by a single thread

1.1 Composition of a complete request

A complete request processing process mainly includes the following parts:

Step 1: Run epoll_wait to obtain network I/O events at one time

Step 2: Read data and parse protocols

Step 3: After the parsing is successful, process the service logic and reply to the client

1.2 Defects of the network threading model

  1. All work is performed by one thread, including epoll event fetching, event processing (data reading and writing), and as long as the event callback processing of any one request is blocked, all other requests are blocked. For example, if there are too many filed files, assuming that a hash key contains millions of filed files, the whole Redis will be blocked when the hash key expires.

  2. In the single-thread working model, CPU will become the bottleneck. If THE QPS is too high, the load of the whole CPU will reach 100% and the delay jitter will be severe.

1.3 Typical Cases

  1. Redis cache

  2. Twitter caching middleware, TwemProxy

1.4 Main cycle workflow

While (1) {//epoll_wait waits for a network event and returns if one exists, or timeout range size_t numevents= epoll_wait(); // Execute the corresponding event callback for (j = 0; j < numevents; J++) {readData() {// readData() // parseData() // requestDeal()} else if(write event) {// write event, Write data logic processing writeEentDeal()} else {// Exception event processing errorDeal()}}}Copy the code

Note: In the subsequent multi-thread/process model, the main flow of each thread/process is the same as that of the while() process.

1.5 Redis source code analysis and asynchronous network IO reuse compact demo

Due to the previous work needs to do secondary optimization and development of the Redis kernel, some code annotations are made for the whole Redis code, and the network module of Redis is made into a simple demo independently. The demo will be helpful for understanding epoll network event processing and Io reuse implementation. The code is relatively short. You can refer to the following items:

Redis source code detailed annotation analysis

Redis network module compact demo

Twitter cache middleware TwemProxy source analysis implementation

2. Thread model 2. Single listener+ fixed worker thread

The thread model is shown in the following figure:

Description:

  1. The listener thread is responsible for accepting all client links

  2. The listener thread generates a new FD each time it receives a new client link and sends it to the corresponding worker thread via the dispatcher (hash)

  3. After the worker thread obtains the new link fd, all subsequent network IO reads and writes on the link are processed by the worker thread

  4. Assume that there are 32 links. After the 32 links are established, each thread processes the read/write, packet processing, and service logic processing of four links on average

2.1 Defects of the network threading model

  1. There is only one listener thread for accept processing, which can easily become a bottleneck in instantaneous high-concurrency scenarios

  2. A thread uses I/O multiplexing to process data read and write, packet parsing, and subsequent service logic processing of multiple fd links. This process may cause serious queuing. For example, the internal processing time after receiving and parsing a packet of a link is too long, the queuing of other link requests will be blocked

2.2 Typical Cases

Memcache cache applies to cache scenarios and proxy intermediate scenarios where internal processing is fast. Memcache source code implementation Chinese analysis can see: memcache source code implementation analysis

3. Thread model 3. Fixed worker thread model (reusePort)

The prototype diagram of this model is as follows:

Description:

  1. Linux Kernel 3.9 supports the function of Reuseport. Every new link obtained by the kernel protocol stack is automatically distributed to the user worker thread in a balanced manner.

  2. This model solves the problem of single point bottleneck of listener in model 1. Multiple processes/threads can accept new client links simultaneously as listeners.

3.1 Defects of the network process/thread model

With the support of Reuseport, the kernel distributes different new links to multiple user mode worker processes/threads through load balancing, and each process/thread processes data reading and writing, packet parsing and business logic processing of new fd links of multiple clients through IO reuse. Each worker process/thread processes multiple link requests at the same time, and if the internal processing time after the message of one link is received and parsed is too long, the other link requests will block the queue.

This model solves the single bottleneck problem of listener, but the queuing problem inside worker thread is not solved.

However, Nginx, as a seven-layer forwarding agent, is suitable for this model due to its short internal processing time due to its in-memory processing.

3.2 Typical Cases

  1. Nginx (nginx uses processes and has the same model), which is suitable for scenarios with simple internal business logic, such as nginx agents

  2. The application of Nginx multi-process with high concurrency, low latency, and high reliability in twemProxy caching (Redis, memcache

In addition, refer to nginx source code Chinese annotation analysis

4. Thread model 4: Link one thread model to another

The thread model is shown as follows:

Description:

  1. The listener thread is responsible for accepting all client links

  2. The listener thread creates a thread every time it receives a new client link. The thread is only responsible for data reading and writing, packet parsing, and business logic processing on the link.

4.1 Defects of the network threading model:

  1. One link creates one thread, if 100,000 links, then 100,000 threads are needed, too many threads, system responsibility, memory consumption will be too much

  2. Threads also need to be destroyed when links are closed, and frequent thread creation and consumption add further load to the system

4.2 Typical Cases:

  1. The default mysql mode and the mongodb synchronous thread model are applicable to scenarios where request processing is time-consuming, such as database services

  2. Apache Web server, this model limits Apache performance, nginx advantage will be more obvious

5. Thread model 5: Single listener+ dynamic worker thread (single queue)

The thread model is shown in the following figure:

Description:

  1. After receiving a new link FD, the listener thread submits the FD to the thread pool for processing. All subsequent reads and writes, message parsing, and service processing of the link are processed by multiple threads in the thread pool.

  2. In this model, a request is converted into multiple tasks (network data reading and writing, message parsing, business logic processing after message parsing) and queued to the global queue, and the threads in the thread pool get the task execution from the queue.

  3. The same request access is split into multiple tasks, and a single request may be processed by multiple threads.

  4. When there are too many tasks and the system is stressed, the number of threads in the thread pool increases dynamically

  5. When tasks are reduced and system stress is reduced, the number of threads in the thread pool decreases dynamically

5.1 Statistics related to the running time of worker thread:

T1: time when the underlying ASIO library is called to receive a complete mongodb message

T2: all subsequent processing after receiving the packet (including packet parsing, authentication, engine layer processing, and sending data to the client)

T3: How long the thread is waiting for data (e.g., no traffic for a long time, now waiting to read data)

5.2 How can a Worker Determine that it is in the “Idle” state?

Total thread running time =T1 + T2 +T3, where T3 is useless wait time. If the proportion of useless wait time in T3 is large, the thread is idle. The worker thread will judge the effective time ratio after each cycle processing. If it is smaller than the specified threshold, it will exit the destruction directly

5.3 How can I Determine whether a Worker thread in a Thread Pool is Too Busy?

Control threads are specifically used to determine the stress of worker threads in the thread pool to determine whether to create new worker threads in the thread pool to improve performance.

The control thread periodically checks the thread pressure state in the thread pool after a certain period of time. The implementation principle is simply to record the current running status of the threads in the thread pool in real time. The count is as follows: total thread count _threadsRunning, number of threads currently running task _threadsInUse. If _threadsRunning=_threadsRunning, it indicates that all worker threads are currently working on tasks. The thread pool is overloaded and the control thread starts to increase the number of threads in the pool.

We previously published an article: MongoDB network transfer processing source code implementation and performance tuning – experience the ultimate design of kernel performance

5.4 Defects of the network threading model:

  1. Thread pools get tasks to execute, lock contention, and this becomes a system bottleneck

5.5 Typical Cases:

The mongodb dynamic adaptive threading model is suitable for time-consuming scenarios such as database services.

The model detailed source code optimization analysis implementation process reference: Mongodb network transmission process source code implementation and performance tuning – experience the ultimate design of kernel performance

6. Threading Model 6. Single Listener + dynamic worker thread (multi-queue)-mongodb network thread model optimization practice

The thread model is shown as follows:

Description:

A global queue is divided into multiple queues. When tasks are queued, they are hash to their respective queues. When workers obtain tasks, they hash to the corresponding queues to obtain tasks.

6.1 Typical Cases:

OPPO’s self-developed mongodb kernel multi-queue adaptive thread model is optimized, which has a good performance improvement. It is suitable for time-consuming request processing scenarios, such as database services.

The model detailed source code optimization analysis implementation process also refer to: Mongodb network transmission process source code implementation and performance tuning – experience the ultimate design of kernel performance

6.2 is it? Why don’t mysql, mongodb and other databases take advantage of the kernel’s reuseport special – multithreading accept requests at the same time?

Answer: Virtually any service can take advantage of this feature, including database services (mongodb, mysql, etc.). But because the database service access delay are generally not ms level, if use reuseport properties, delay there will be dozens of us performance improvement, compared to the internal processing database of ms delay, the dozens of us performance improvement, basically can be ignored, this is why most database service did not support this function.

Middleware such as cache and proxy, due to their own internal processing time is relatively small, also at the US level, so need to make full use of this feature.