Nginx is one of the most widely used load balancers and Web servers in the world. Cloudflare uses Nginx on a large scale to support its edge nodes. Nginx’s performance has been greatly improved by optimizing some of the problems encountered during its use. This article is a detailed analysis and conclusion of some of Cloudflare’s optimizations, and is worth reading for engineers and architects.

In all, 10 million websites or applications use Cloudflare to speed up their services. At its peak, we (151 data centers) were processing more than 10 million requests per second. Over the years, we have made many improvements to NGINX to accommodate our growth. This blog post is about some of our many improvements.

How NGINX works

NGINX uses event loops to solve C10K problems. Every time a network event occurs (new connection, connection readable/writable, etc.) NGINX wakes up, processes the event, and then continues to process whatever else needs to be done (possibly other events). When an event arrives, the event-related data is ready, allowing NGINX to efficiently process many requests simultaneously without waiting.

For example, the following code reads data from a file descriptor:

When a FD is a socket, incoming data can be read. The last call will return EWOULDBLOCK, which means we have read all of the data from the kernel buffer, so we should not read from the socket again until more data is available.

Disk I/O is different from network I/O

When fd is a file on Linux, EWOULDBLOCK and EAGAIN are never found, and the read function always waits for the entire buffer to be read. This is true even if the file is opened with O_NONBLOCK. Refer to the open (2) :

Note that this flag is not valid for regular files and block devices

In other words, the above code can be reduced to:

This means that if data needs to be read from disk, the entire loop blocks until the file is read, and subsequent event processing is delayed.

This is acceptable for most workloads because reading data from disk is usually fast enough and more predictable than waiting for packets to arrive from the network. Now everyone is using SSDS, and our cache disks are SSDS. Modern SSDS have very low latency, typically 10 μs. Most importantly, we can run NGINX with multiple worker processes so that slow event processing does not block requests in other processes. In most cases, we can rely on NGINX’s event handling to handle requests quickly and efficiently.

SSD performance is not always up to par

As you can probably guess, our assumptions are overly optimistic. If each disk read takes 50μs, then it takes 2ms to read 0.19MB (4KB block size) of data (we read larger blocks). But tests show that it’s generally worse for the 99 and 999 percentiles of read speeds. In other words, the slowest value per 100 (or 1,000) disk reads is usually not small. Solid-state drives are very fast and very complex. SSDS are essentially machines that queue and reorder I/O, and perform various background tasks such as garbage collection and defragmentation. Occasionally there will be requests to slow down enough to warrant attention. My colleague Ivan Babrou has run an I/O benchmark where the slowest disk reads have reached 1s. In addition, some SSDS have more performance outliers than others. Going forward, we will consider the performance of SSDS purchased in the future to be consistent, but in the meantime we need to provide solutions for existing hardware.

Use SO_REUSEPORT to evenly distribute the load

While a single slow read is hard to avoid, we don’t want 1 second of disk I/O to block other requests for the same second. Conceptually, NGINX can process multiple requests in parallel, but it can only handle one event at a time. So I added the following metrics:

Event_loop_blocked took more than 50% of our TTFB (first byte response time). That is, half of the time the service request takes is due to the event loop being blocked by other requests. Since event_loop_blocked measures only about half of blocking (because delayed calls to epoll_wait() are not measured), the actual rate of blocking time is much higher. We run 15 NGINX processes per machine, which means that a slow I/O should only block a maximum of 6% of requests. However, IO events are not evenly distributed, with the worst cases blocking 11% of requests (or twice as many as expected). SO_REUSEPORT can solve the problem of uneven distribution. Marek Majkowski has written about this before, but not in our case, since we use a long connection, the delay caused by opening the connection is negligible. The configuration change alone increased the SO_REUSEPORT peak P99 by 33%.

Move read() to the thread pool

The solution to this problem is to make read() unblocking. In fact, this is a feature already implemented in NGINX! With the following configuration, read() and write() are done in the thread pool and do not block the event loop:

However, when we tested this, we actually saw a slight improvement in p99 rather than a huge improvement in response time. The differences were within the margin of error, and we were discouraged by the results and stopped digging for a while. The desired degree of optimization was not achieved for several reasons. In related tests, they used 200 concurrent connections to request 4MB files on a mechanical hard disk. Mechanical disks increase I/O latency, so optimizing read latency has a greater impact. And we focused on the performance of the P99 (and P999). Solutions that help average performance don’t necessarily help outliers. Finally, in our environment, the typical file is much smaller. 90% of caches hit files smaller than 60KB. Smaller files mean less blocking time (we typically read the entire file in 2 I/ OS). If we look at disk I/O that must be performed for a cache hit:

32KB is not a static number, and if the file header is small, we only need to read 4KB (we don’t use direct IO, so the kernel will round at most). Open () looks fine, but it’s not without its problems. At a minimum, the kernel needs to check that the file exists and that the calling process has the right to open it. Therefore, it must find/cache/prefix/dir/EF/BE/CAFEBEEF inode, must also BE in the/cache/prefix/dir/find CAFEBEEF EF/BE /. To make a long story short, in the worst case, the kernel must perform the following lookup:

That’s six reads to complete open(), while read() only reads one! Fortunately, most of the disk lookups described above are serviced by the Dentry cache and do not require access to SSDS. Clearly, the read() done in the thread pool is only part of the picture.

Non-blocking open() in thread pools

So I modified the NGINX code to use a thread pool to do most of the open() so it doesn’t block the event loop. The results are as follows:

We upgraded our five busiest data centers on June 26 and then went global the next day. The overall peak P99 TTFB (first byte response time) improved by a factor of 6. In fact, by adding up the time we save processing requests in a day (8 million requests per second), we save the Internet 54 years. Our event loop processing is still not completely non-blocking. It will still block the first time a file is cached (open(O_CREAT) and rename()), or when a validation update is redone. However, due to our high cache hit ratio, this situation is relatively rare, so it is not a problem for the time being. In the future, we are also considering moving this code out of the event loop to further improve our P99 latency.

conclusion

NGINX is a powerful platform, but dealing with the extremely high I/O load on Linux can be challenging. Upstream NGINX can handle file reading in a separate thread, but at our size, we need to do better to meet the challenge.

英文原文 :

https://blog.cloudflare.com/how-we-scaled-nginx-and-saved-the-world-54-years-every-day/?ref

The author of this article is Ka-hing Cheung, translated by Fangyuan. Please indicate the source of this article. The author of this article is ka-hing Cheung, translated by Fangyuan.

Highly available architecture

Change the way the Internet is built

Long press the QR code to follow the “HA Architecture” public account