preface
In this article, I’ve summarized the 499 issues I’ve been dealing with in daily business recently, and I hope it’s helpful to describe in detail why 499 is there, and the process of locating it.
HTTP 499 Status code
The 499 status code is defined in the SRC/HTTP /ngx_http_special_response.c file in nginx:
ngx_string(ngx_http_error_494_page), /* 494, request header too large */
ngx_string(ngx_http_error_495_page), /* 495, https certificate error */
ngx_string(ngx_http_error_496_page), /* 496, https no certificate */
ngx_string(ngx_http_error_497_page), /* 497, http to https */
ngx_string(ngx_http_error_404_page), /* 498, canceled */
ngx_null_string, /* 499, client has closed connection */
Copy the code
From the comment, we can see that 499 indicates that the client is actively disconnected.
On the surface, 499 is the client’s active disconnection. However, in the actual business development, when THE STATUS code of HTTP 499 appears, it is mostly because the server request time is too long, resulting in the client’s “impatience”, so the connection is disconnected.
Below, let’s talk about the 4 499 problems I encountered in daily development, and the corresponding problem positioning and handling methods.
Four HTTP 499 issues
1. The server interface request times out
Some interface requests are really slow when the client requests the server interface. Let me give you a random example:
select * from test where test_id in (2.4.6.8.10.12.14.16);
Copy the code
For example, if we have a test table with 5 million pieces of data in it, we query one of the above where in SQL, but test_id has no index, which results in a full table scan. This SQL is very slow, a few seconds of request is normal.
If the client has set a timeout period, it will automatically disconnect after the timeout period, resulting in a 499 problem. If the client does not set a timeout period, such as this SQL request for 5 seconds and phP-FPM timeout for 3 seconds, it will result in a 502 problem.
To solve this problem, it is very simple to find the corresponding slow request and optimize. In general, it is slow SQL, such as the above example, the field does not have an index, add an index.
Of course, there are some indexes that do not walk the index, such as the problem I encountered before, MYSQL chose the wrong index problem analysis, even if there is an index but still do not walk the index problem.
conclusion
In this case, the interface is really slow, not accidental phenomenon, when the request is slow, this is also the best solution, optimize the interface.
2. Nginx causes disconnection
There is also the 499 problem caused by Nginx.
As you can see from the figure above, request_time is too small to be a request interface timeout, so what causes this problem?
In fact, some of the information in the figure is not shown yet, because it relates to the specific interface request of the company. I will describe the phenomenon here: the request parameters of the two requests in the figure are exactly the same when the request time is very close.
The solution
Through Google, I found that some students have met this situation, that is, two consecutive fast POST requests will appear 499 situation, is nginx considered this is not secure connection, the active disconnect client.
The solution is configured in nginx:
proxy_ignore_client_abort on;
Copy the code
This parameter means that the proxy ignores the interruption of the client and waits for the return from the proxy server. If there is no error, the log is 200. If there is a timeout, the log is 504.
After this parameter is set to on, after a period of online observation, it is found that there is basically no problem of 499.
Matters needing attention
-
For example, if there are two Nginx servers, the first nginx server reversely proxy to the second Nginx server, and the second Nginx server points to PHP services, then this parameter will only take effect on the first Nginx server. The configuration does not take effect on the second Nginx server.
-
After this parameter is configured, the server continues to execute even after the client is disconnected, causing the server to be exhausted. So use it according to your own situation online.
3. Problem 499 occurs at fixed time
There is also a fixed time 499 problem.
As can be seen, the above cases of 499 are relatively fixed, all occurring within 10 minutes after 2 a.m.
At this time, we usually quickly think of whether there is some scheduled task occupying resources, in fact, but it is not easy to locate this problem, here is how I locate this problem.
Location steps
-
Crontab scripts: The first thing that comes to mind is that there are scheduled task scripts at 2am that are taking up resources. I checked all the crontab scripts on the machine that reported the error, and found that only one was a script near 2:00 in the morning. After executing it, I found that it was very fast and would not take a long time to execute, occupying resources and thus affecting normal business.
-
Machines: Then I suspected something was wrong with the machines in question, then I looked at the monitor chart of the machines (using Falcon) and there was nothing unusual at 2am. (In fact, it can be inferred that the service appears 499 on multiple machines, it is not possible that all machines have problems, and there are other projects deployed on this machine, other projects are not problems, it is unlikely that this machine has problems.)
-
Database: Since this service is the problem, I looked at the monitor chart of the main database to which this service is connected (also falcon) and found nothing wrong. (There was a problem, but I didn’t see it.)
-
Nginx: If none of the above is a problem, is there something wrong with the upper level Nginx? Because upstream_response_time has no value at all, it is possible that the request did not reach the back-end PHP service at all. If proxy_ignore_client_ABORT had not been set to on, the request to the server would not have continued. Upstream_response_time would have logged -, which is fine. Nginx is ok.
-
Abnormal requests: So is it possible that there was a wave of abnormal requests at 2 a.m. This request is a POST request, which happens to be enabled by NGINx. After reassembling the request parameters recorded by NGINx, it is found that there is no problem. Note Is independent of request parameters.
-
Database:
- I’ve got pretty much everything I can think of at this point. I’ll just have to find out what these requests have in common.
- I found that all of these requests were requesting a database, and then I found the interface with the most requests, printed the execution time of the database request, and found that a simple interface was executing 2 seconds, and there were a dozen requests over 2 seconds a night. Our timeout is 1 second, so we have a 499 problem.
- Whether it was the cause of 499 or not, there was a problem that needed to be addressed. Then I looked again at the Falcon monitor and found that at 2am, one machine had abnormal IO and CPU. I logged in to this machine and looked at the error log.
-
The error log:
- in
error.log
I found 499 request interface SQL statement, found the following error messagemycat Waiting for query cache lock
. - From this log we find that the SQL is locked and waiting for
query cache lock
, the lock is a global lock, we will not introduce more here. query cache
This is not recommended in MYSQL 8.0 because the table will be cleared whenever it is updatedquery cache
For frequently updated tables, it is not much use, if the amount of data is small, infrequently updated tables, directly check the database can be, it is not meaningful. And if you turn it onquery cache
When a query request comes in, you have to go firstquery cache
If you can’t find it, look it up in the database, and then put the data inquery cache
In the.- I checked the database and sure enough it was on
query cache
“I thought it was caused by this reason, as long as you turn it off. Then I checked other databases and found that it was also onquery cache
I can prove itquery cache
Is not the cause of the problem. By the way, checkquery cache
Whether to enable the command isshow variables like '%query_cache%';
.query_cache_type
A value ofON
Is enabled. The value isOFF
Is closed. - After a bunch of messy experiments, finally focus onWhat is occupying the query cache. Looking through the error log, we find the following SQL (processed) :
select A,B,C from test where id > 1 and id < 2000000
. This SQL query takes 2 million pieces of data and puts them inquery cache
Is definitely going to take upquery cache
.
- in
-
Missile. The log:
- in
error.log
The source of the SQL cannot be found in, it should not be business SQL, the business will not request such SQL, throughslow.log
When querying the IP address of the SQL request of 2 million, it is found that the request comes from a Hadoop cluster machine that has a script for synchronizing online data to HIVE. The database IP configured by the script is the IP address of the faulty database. - We then changed the database configuration to replace the IP with that of an off-line slave library and observed for a few days, the last three in the figure above, and it became clear that 499 was less of a problem.
- in
conclusion
The problem is very difficult to locate, took a long time, basically can think of their own situation are one by one test, if MYSQL is not a problem, may also continue to machine level to locate the problem, here is to give you some ideas for reference.
4, occasionally appear 499 problem
We usually see one or two cases of 499, but when we re-execute the request, the response speed is so fast that we can’t reproduce it at all. This kind of occasional situation can be ignored. And here, I’m going to give you a brief account of what might be causing the problem.
MYSQL has an operation to flush dirty page data to disk, which we will cover in a separate article. When MYSQL executes a dirty page brush, it will consume system resources. At this time, our query request may be slower than usual.
In addition, MYSQL also has a mechanism to make your query slower. If you want to flush a dirty page, if the page next to the page is dirty, it will flush the neighbor with it. And the logic that drags the neighbor down continues to spread. For each neighbor page, if the pages next to it are also dirty, they will also be swiped together.
The “neighbor” optimization mechanism makes sense in the mechanical hard disk era. It reduces the seek time and can reduce a lot of random I/O. The random IOPS of mechanical hard disks is usually only in the hundreds. But now THE IOPS is no longer the bottleneck. If you “just brush yourself”, you can finish the necessary dirty page operation faster and reduce THE SQL response time.
Interestingly, this mechanism is also disabled in MYSQL 8.0. This is controlled by the innodb_flush_neighbors parameter, where 1 means flush_flush_neighbors and 0 means flush_neighbors only. MYSQL 8.0 innodb_flush_Neighbors Defaults to 0.
Let’s take a look at our online configuration:
mysql> show variables like "%innodb_flush_neighbors%";
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| innodb_flush_neighbors | 1 |
+------------------------+-------+
1 row in set (0.00 sec)
Copy the code
Unfortunately, we are open online, so it is very possible to suddenly request 499.
conclusion
Above we introduced in the daily business development, 499 will appear in the 4 situation and the positioning problem, the solution to the problem, hope to provide some help to you.