The scene
Last night, I suddenly received a large number of alarms from APM (Application Performance Management for short, we have built such a system to monitor and warn the Performance and reliability of applications online)
(Voiceover: Monitoring is a very important means to find problems, if not, be sure to set up in time.)
Then the operation and maintenance called to inform the four machines deployed online that all OOM (out of memory, insufficient memory), services are not available, quickly check the problem!
Troubleshoot problems
First of all, operation and maintenance restarted the machine to ensure that the online service is available, and then carefully checked the log on the offline page. It is true that the service is unavailable due to OOM
I immediately thought of the memory state at the time of dump. However, in order to restore online services as soon as possible, O&M restarted the machine, so the memory at the time of the dump incident could not be dumped. So I took a look at our APM monitor chart for the JVM
Voice-over: If one way doesn’t work, try another Angle! Again, monitoring is very important! Perfect monitoring can restore the scene of the incident, easy to locate the problem.
Since 16:00, the number of threads created in the application has increased every minute until around 3W. After the restart (blue arrow), the number of threads has continued to increase. What is the normal number of threads? The problem was found, I think it was around 16:00 PM that a bad code was sent, which caused the thread to keep being created, and the created thread never died! A review of the publishing record reveals only a suspicious code diff: an additional evictExpiredConnections configuration was added during HttpClient initialization
The problem is located, it should be caused by this configuration! (The thread went up at exactly the same time as the release!) “, so first put the new configuration to kill online, online after the number of threads restored to normal. So what does evictExpiredConnections do to cause the number of threads to go up every minute? What problem was this configuration added to solve? So I found relevant colleagues to understand the context of adding this configuration
Reconstruct the incident
NoHttpResponseException: NoHttpResponseException: NoHttpResponseException: NoHttpResponseException: NoHttpResponseException: NoHttpResponseException: NoHttpResponseException
Before we get to that, we need to take a look at HTTP’s keep-alive mechanism.
Take a look at the life cycle of a normal TCP connection
It can be seen that each TCP connection needs three handshakes to establish the connection before it can send data, and four waves to disconnect the connection. If each TCP connection is immediately disconnected after the server returns response, This is undoubtedly very performance consuming in the case of many HTTP requests. If the SERVER does not immediately disconnect the TCP link when it returns a response, but reuses this link for the next HTTP request, A lot of the overhead of creating/disconnecting TCP is virtually eliminated, resulting in a significant performance improvement.
As shown in the figure below, multiple HTTP requests are initiated without TCP multiplexing, and TCP multiplexing is initiated for three times. Multiplexing TCP can save the cost of establishing/disconnecting TCP twice. In theory, only one TCP connection is needed to initiate an application. Other HTTP requests can reuse this TCP connection, so that n HTTP requests can save the n-1 overhead of creating/breaking TCP. This is a huge boost in performance.
So if you look at keep-alive, what we do is we reuse connections to make sure that they stay alive.
(Voiceover: Keep-Alive was enabled by default only after Http 1.1. However, most sites now use Http 1.1, which means most of them support link reuse by default.)
There is no such thing as a free lunch. Although keep-alive saves a lot of unnecessary handshake/wave operations, because the connection is kept alive for a long time, if there is no HTTP request, the connection will be idle for a long time, which will occupy system resources and sometimes bring greater performance consumption than the reuse connection. Therefore, we usually set a timeout for keep-alive. In this way, if the connection remains idle (no data transfer takes place) during the timeout period, the connection will be released after the timeout period, thus saving system overhead.
Adding timeout to keep-alive seems to be perfect, but it introduces a new problem. , consider the following:
If the server closes the connection, send a FIN packet. If the server does not receive any request from the client within the specified timeout period, the server will initiate a REQUEST with the Fin flag to disconnect the connection and release resources.) If the client continues to reuse the TCP connection to send HTTP request packets before the Fin packet reaches the client, The server sends an RST packet to the client because it does not receive the packet during the four-wave period. When the client receives the RST packet, it prompts an exception (NoHttpResponseException).
Let’s take a look at the causes of noHttpresponseExceptions using a flow chart to make it a little clearer
NoHttpResponseException** NoHttpResponseException** NoHttpResponseException** NoHttpResponseException**
- Try again. After receiving an exception, try again once or twice. This situation can be avoided because the client will use a valid connection to request after retry.
- Set a timer thread to periodically clean up the idle connections mentioned above. You can set the timer to half of the Keep Alive timeout time to ensure that the idle connections are recovered before the timeout.
EvictExpiredConnections is the second of the above strategies, read the official usage instructions
Makes this instance of HttpClient proactively evict idle connections from the
connection pool using a background thread.
Copy the code
Calling this method only generates one timed thread, so why does the application keep adding threads because we create an HttpClient for each request! Since every HttpClient instance calls evictExpiredConnections, as a result, as many timed threads will be created for every request!
One more question, why are all four machines on line hanging at about the same time? Because of load balancing, the weight of the four machines is the same, the hardware configuration is the same, and the requests received can be considered to be about the same, so the background threads generated by the four machines due to the creation of HttpClient also peak at the same time, and then OOM.
To solve the problem
If the number of threads exceeds a certain threshold, a direct alarm will be generated, so that it can be detected and processed before applying OOM. Voiceover: Again, monitoring is important to nip problems in the bud!
conclusion
This article through the online four machines at the same time OOM phenomenon, to detailed analysis of the production location of the cause of the problem, we can see in the application of a library first to this library to have a full understanding of the above HttpClient creation without singleton is obviously a problem), followed by the necessary network knowledge is still needed, So to become a qualified programmer, none of the language itself, but also to the network, database and so on also to browse, these to troubleshoot problems and performance tuning will be a very big help, again, perfect monitoring is very important, by triggering a certain threshold warning ahead of time, can be nipped it in the bud!