Millions of concurrent Nginx optimizations

Share: Tao Hui Editor: Bai Fan

Introduction: tao-fai, worked in huawei, tencent company to do the related work of the underlying data has written a book called “the deep understanding of Nginx: interpretation module development and architecture, currently in hangzhou wisdom chain as co-founder worked as a technical director, currently focused on using Internet technology power construction industry transformation and upgrading.

Today’s post focuses on Nginx performance, and hopefully gives you some systematic thinking to help you do Nginx more effectively.

1. Optimization methodology

Today, I will focus on two questions:

First, to maintain the number of concurrent connections, how to achieve efficient use of memory
Second, the importance of maintaining high throughput while maintaining high concurrency

The realization level is mainly optimized in three aspects, mainly focusing on application, framework and kernel.

Hardware limitations have been discussed, and some of you may have heard of them. Switching your network card to 10 gigabytes, 10 gigabytes, or 40 gigabytes is the best option. SSD or mechanical disks will be selected depending on the cost budget and application scenario. CPU is one of the metrics we look at.

The main point of this page is that it actually shifts the operating system switching costs to the process itself, so it has a very low switching cost from one connector to another, it performs very well, and the coroutine Openresty is the same.

Efficient use of resources, reducing memory is helpful for us to increase concurrency, reduce RTT, and increase capacity.

Reuseport is all about improving the CPU core. And fastSocket, because I worked on the aliyun network when I was in Aliyun, it actually provides a lot of performance improvement, but it also has obvious problems, bypasing the kernel itself.

2. The “Lifetime” of requests

I’m going to talk a little bit about how to look at “requests” first, and after that it will be clear how to optimize.

Before I say this, I must mention the module structure of Nginx. Like Nginx, any external framework has a feature that if you want to form an entire ecosystem, you must allow third party code to come in and form a sequence of requests to be processed by modules one by one.

The same goes for Nginx. These modules are strung together in a sequence, and each request is processed one by one. There are two in the core module, steam and NGX.

2.1 Incoming Request

What happens when a connection is established and a request comes in, there’s a queue in the operating system kernel waiting for our process to make a system call, and then because there are so many worker processes, who’s going to make a system call, there’s a load balancing strategy, and there’s a slide on that.

When a new connection is established by accept, it is allocated to the connection memory pool. This memory pool is different from all memory pools. It is allocated when the connection is created.

Only when the connection is closed will it be released. Next comes the NGX module, at this time will add a timer, 60 seconds, that is, after the establishment of a connection within 60 seconds after the client does not receive the automatic shutdown, if 60 seconds after the memory will be allocated, read buffer. What does that mean?

Now the operating system kernel has received the request, but my application can’t handle it because it hasn’t read it into user-mode memory, so I have to allocate memory, from the connection memory pool. How much memory should I allocate? It expands to 1 kelvin.

2.2 Receiving a Request

When the request is received, the uri and header are received and the request memory pool is allocated. At this time, the request pool size is 4K, which is 8 times larger than the previous one.

To see why it consumes resources, first of all, it is described by the state machine solution, the so-called state machine solution is to treat it as a sequence, one branch section, one branch down, if it is found that the line break, that is, the request line is finished;

But if it’s a really long request, it’s going to redistribute a bigger one, so I’m not going to be able to use 1K, so why 4 by 8K? The reason is that instead of allocating 32 karats when 1 is not enough, you allocate 8 karats. If the previous identifier is not resolved after 8K, a second 8K is allocated.

Everything I received before is not released, it’s just a pointer to the URL or to the protocol, how long it is.

The next step is to solve the header. This process is exactly the same and there is no difference. At this time, there will be an insufficient situation. This means that all external servers are strung together to process a request by a number of modules.

Like the blue area in the last two pages of PPT, what does it mean to request the next 11 stages? The yellow one, the green one, and the one on the right are all in stage 11.

You don’t have to remember these 11 stages, it’s very simple, as long as you master three key words.

You just read the header and you have to process it, so the first stage is post-read. Next comes rewrite, and then access and preaccess.

First look at the left hand side, when we download Nginx source code, there will be a referer, where all the third-party data will be arranged in order, these sequences are not a simple request to give it to it, first divided into 11 stages, within each stage, people are ordered one after another. But there are stages in the 11 stages.

Let me break it down a little bit, the first referer stage has a lot of modules, and then this is sorted.

This diagram has two more key points than the previous diagram. First, when a module is reached, it can decide to continue to the module after the sequence, or skip to the next stage, but not multiple stages.

The second one is that you generate the response to the client, and then you do something with the response, so it’s ordered, you do the thumbnail first and then you compress, so it’s in a strict order.

2.3 Requested Reverse proxy

Request the reverse proxy, the reverse proxy this is our key application scenarios, Nginx because Nginx would consider a scenario, I don’t know if you have used, the client is take the public, so very poor network environment, network speed is very slow, if the simple use a buffer to the upstream server from the client, the server upstream pressure will be very big, Because upstream servers tend to be efficient, they don’t process another request until one request has been processed.

Nginx takes this scenario into account, it will first receive the entire request, and then establish a connection to the upstream server, so the default first configuration is proxy Request buffering on, store the package body to the file, the default size is 8K. When establishing the upstream connection, it will set timeout, 60 seconds, and add the timeout timer, also 60 seconds.

Make the request (read the style package) and if you send a large package upstream, sizK is 8K. By default, the proxy limit rate is turned on. We will cache the request to the end first, so there is an 8×8K. If turned off, it will send a little bit upstream and then send a little bit downstream. Now that you know how to do this, you can see that this is a pretty big memory drain.

2.4 Return Response

Return the response, this is actually quite a lot of content, let me simplify it for you, the same as the official package, this is also smooth from the bottom up, if there are a large number of third-party modules coming in, the number will be very high.

The first key point is the header filter above, write filter above, and copy filter below. It can be further divided into two categories: one is to be processed, and the other is not to be processed. Instructions for OpenResty, where the first code is executed, and the second is the SDK.

3. Application layer optimization

3.1 protocol

To optimize the application layer, we will first look at whether there is any optimization in the protocol layer. For example, the encoding mode and header are passed to the Nginx architecture every time, so that a lot of traffic is wasted. We can improve Http2, and there are many such agreements that will significantly improve its performance.

Of course, if you improve http2, there will be other problems, such as http2 having to go this route. Again, this route is a big topic, and it’s a conflict between security and performance.

3.2 compressed

We want the “business” as big as possible, compression here will have a focus on the dynamic and static, for example, we use copy, for example, from disk can be issued directly by the kernel, but once the compression has to first read the file into Nginx, to the back of the kernel to do some processing.

The same is true for keepalive long connections. There are many things involved in keepalive long connections. Because the connection has a slow start process, its window is relatively small at first, it may only send a very small 1K, but later it may send tens of K, so every time you create a new connection it will restart, which is very slow.

Of course, there is a problem here, because the Nginx kernel has a connection open by default, and the power of long connections is reduced when idle. Improving memory usage

Nginx only has to have these modules for the downstream, client header, buffer size: 1K, upstream NETWORK HTTP header and body.

The CPU by caching to get stored on things, it is a batch of a batch of take, each batch is currently 64 bytes, so the default is 8 k, if you deserve it will give you up to 32, 64, if you can rise with 65 to 128, because it is a a serialized reorganization, so get to know the things after his match when it won’t make again. Red-black trees are used a lot here, because they relate to specific modules.

3.4 the speed limit

Most of the time when we do branch flow control, what is the main limit? The main limit is the speed at which Nginx can send responses to clients. This thing is very useful because it can be linked to Nginx quantitative. This does not limit the speed of requests upstream, but the speed of responses from upstream.

3.5 Load Balancing between Workers

When I was using version 0.6, I was using this by default, this “lock,” which was using inter-process synchronization to implement load balancing. How do you implement load balancing? In order to ensure that all Worker processes only one Worker process is processing distance at the same time, there will be several problems here. The green box represents its throughput. The throughput is not high, which leads to the second problem of requests, which is also quite long, and the variance is very large.

If the “lock” is turned off, you can see that throughput is increasing and variance is decreasing, but its time is increasing. Why is this happening? As a result, a Worker may be very busy, its connection number is already very high, but other Worker processes are very idle.

If requests is used, it does load balancing at the kernel level. This is a dedicated scenario, and you can see significant changes if requests are opened or not opened in a complex application scenario.

3.6 the timeout

This is actually what I’ve been saying a lot, but it’s implemented by a red-black tree. The only thing to say here is that Nginx is now ripe for four-tier reverse proxies.

Protocols like UTP can do reverse proxies, but to quickly kick out problematic connections, follow the principle of one request for one response.

3.7 the cache

You have to work on caching to improve performance. For example I used to do cloud disk in ali cloud, the cloud disk cache when there will be a concept named spatial dimension cache, when reading a piece of content may be read around other content also cache, you will know if familiar with optimization has branch prediction in the code to read the first space, the use of less, based on the time dimension with more.

3.8 Reducing DISK I/OS

In fact, there are a lot of things to do, optimized read, Sendfile zero copy, ram disk, SSD. Reduce write, AIO, disk is much larger than memory and will degrade to a call as it consumes your memory. Like Thread pool, which only reads files, switching to multiple threads prevents its main process from being blocked when degraded to this mode, which the official blog says has a nine-fold performance improvement.

4. System optimization

The first is to increase the capacity configuration, which we also have when we build connections, and some have a port range when we initiate connections to clients, and some have things like network card devices.

The second CPU cache affinity, it depends, now with L3 cache size is almost 20 megabytes, CPU cache affinity is a very big topic, I won’t expand here.

Third, THE CPU affinity of NUMA architecture, which divides memory into two parts, one is close to this core and one is close to that core, which is fast if you access this core, and about three times as fast if you access the other side. Don’t worry about this if the use of multi-core cpus increases performance significantly.

Fourth, the network of fast fault tolerance, because the TCP connection is the most troublesome connection in the establishment and closure of the connection, there are a lot of parameters are in tune, each place resend, how long resend, resend how many times.

What I’m showing you here is fast start, and there are several concepts in it, the first one is fast start at twice the speed. Because the bandwidth of the network is limited, when you exceed the network bandwidth, the network equipment will be lost, that is, the amount of control is falling, and then the recovery will be slow.

TCP protocol optimization, could have been almost four back and forth to achieve each transmission in the network have dozens of K, so in advance with good words to increase the initial window to make it at the beginning of the maximum flow.

For example, if you look at the CPU first, TCP defer Accept will actually sacrifice some of the immediacy, but the benefit is that Nginx will not be activated to switch when the connection is established for the first time and the content is not coming.

Memory is the system state of memory, the operating system is optimized for high and low memory, and the system memory allocated for each connection can be adjusted dynamically in stress mode and non-stress mode.

Their core network equipment only to solve a problem, change a single processing for batch processing, batch processing throughput must be increased later, because of less consumption of resources, less switching times, their logic is the same, just the logic and the logic is that I have been in the same logic, is used in different levels will produce different effects.

Port reuse, such as Reals, is useful because it allows ports to be used at the level of upstream service connections without risk.

Improve the efficiency of using multiple CPUS. Many things have been mentioned above, but there are two key points. One is CPU binding, which makes caching more effective. Multi-queue network cards, from the hardware level can already do this.

BDP, you definitely know the bandwidth, the bandwidth and the delay determine the bandwidth the delay product, so the throughput is equal to the window or the delay.

Memory allocation speed is also one of our concerns. Memory allocation is worse when there is a lot of concurrency, as you can see there are a lot of competitors.

PCRE optimization, which uses the latest version is good.

This article is based on teacher Tao Hui’s share in GOPS 2018 Shanghai station.