Record a network request connection timeout accident

This section analyzes the cause of the HTTP request timeout, retry mechanism, and operating system network to solve the service problem.

Here are two questions: 1) Have you ever had a production accident due to network connection or request timeout? 2) Do you know the default operating system timeout for network connections?

The problem background

Recently, a colleague appeared such a problem, simple business scenario:

Service A requests service B interface M using HTTP. Service A has A scheduled Task:

There are 1200+ pieces of data queried from DB, and each record corresponds to a request, which circulates to m interface. Upon receiving the request, service B uses TCP to connect to other servers for command interaction. Note that this is not an asynchronous concurrent request interface, because if it were, service B might run out of processing threads quickly, not be able to handle more requests, or even cause a large number of request timeouts or service outages.

At this time, the scheduled task runs out of time. After A while, service A requests service B to Hand live and finally times out.

Read timed out:

Although it is normal for service A to query the DB itself, the interaction between service A and service B is also important. If the two services fail, the service processing or system will be affected.

So why? What are the issues involved here?

Problem solving

1. Retry mechanism accelerates problems

In this case, service A performs troubleshooting and finds abnormal logs based on ELK logs. The number of abnormal logs increases rapidly. Screenshot below:

Exception log details:

Org. Apache. HTTP. Impl. Execchain. RetryExec, so it should be associated with HTTP retry mechanism.

According to the RetryExec source, when HTTP executes a request, it returns immediately if the request is normal; Otherwise, IOException is abnormal, and retry is performed.

The number of retries in the for loop is controlled by the implementer. If no retries are required, an exception is thrown by default.

See the RetryHandler’s custom implementation source code:

@Component
public class HttpRequestRetryHandlerServer implements HttpRequestRetryHandler {
   protected static final Logger LOG = LoggerFactory.getLogger(HttpRequestRetryHandlerServer.class);
   @Override
   public boolean retryRequest(IOException e, int retryCount, HttpContext httpCtx) {
      if (retryCount >= 3) {
         LOG.warn("Maximum tries reached, exception would be thrown to outer block");
         return false;
      }
      if (e instanceof org.apache.http.NoHttpResponseException) {
         LOG.warn("No response from server on {} call", retryCount);
         return true;
      }
      return false;
   }
}
Copy the code

NoHttpResponseException = NoHttpResponseException; NoHttpResponseException = NoHttpResponseException; Signals that the target server failed to respond with a valid HTTP response.

So why does service A enter the retry process?

According to the above exception, it can be ruled out that the exception is due to network connection timeout. Instead, it is a normal request. However, due to some reasons, the normal response result is not received. Read timed out Read timed out Read timed out Read timed out Read timed out Read timed out

To view the default configuration:

Therefore, 6 seconds is the maximum time for data transfer (read timeout). If an HTTP request waits for more than 6 seconds for data results, the current request is interrupted and a Read timed out exception is thrown. So you basically know what’s causing this exception.

2, retry mechanism to speed up the problem – solution:

Analyze the current situation and make a change:

1) Because HTTP requests in this scenario do not require retries, close them:

@bean public CloseableHttpClient noRetryHttpClient(HttpClientBuilder clientBuilder) {// Number of retries is 0, Don't retry clientBuilder. SetRetryHandler (new DefaultHttpRequestRetryHandler (0, false)); return clientBuilder.build(); }Copy the code

2) Service B may process more than 6 seconds due to this service request, so socketTimeout is set to 15 seconds:

# http pool config
http:
  maxTotal: 500
  defaultMaxPerRoute: 100
  connectTimeout: 5000
  connectionRequestTimeout: 3000
  socketTimeout: 15000
  maxIdleTime: 1
  keepAliveTimeOut: 65
Copy the code

3, the machine connection timeout pot

What’s going on with service B? That’s “lightning” on the right. Why does it take so long to process.

When service A initiates an Http request, service B receives the request and connects to the server to exchange data. As long as service B communicates with the service machine in Tcp SSH mode, that is, network communication is carried out.

After checking the logs of service B:

An exception occurs when connecting to the server. Note that the connection takes about 63 seconds. And confirm that the target server is not working properly, but has been down for a long time.

Due to the Unix platform network connection, the current operating system is Linux CentOS. So why is the timeout 63 seconds, and not 5, 15, 60, etc., which are more neat numbers?

At this point look at the network connection code:

connection.connect();
Copy the code

The connection timeout parameter is not specified, so the default parameter of the operating system kernel is used.

The default timeout period for establishing a TCP connection on Linux is 127 seconds. This is usually too long for clients. In most service scenarios, the default value is not used, but a reasonable connection timeout period is set based on the service scenario. So where did this time come from? Why 127?

The time parameter is determined by the level configured on net.ipv4.tcp_syn_retries.

Net.ipv4. tcp_syn_retries sets the kernel to retry at most a few times after the first CONNECT () system call without an ACK. And it determines the waiting time.

The default value on Linux is net.ipv4.tcp_syn_retries = 6. That is, if the local host initiates a connection (that is, initiates the first SYN packet in the TCP three-way handshake), the host returns a SYN + ACK if it fails to receive the packet , the maximum application timeout is 127 seconds.

After the SYN packet is sent for the first time, wait 1s (the power of 2 is 0). If the SYN packet times out, try again. After the second send, wait for 2s (power 1 of 2), and retry if timeout occurs; After the third send, wait for 4s (the second power of 2), if the timeout, retry; Wait 8s (power of 2 to the third) after the fourth send, retry if timeout occurs; Wait for 16s (the fourth power of 2) after sending the 5th time, and retry if timeout occurs. Wait for 32s (a power of 2 to the fifth) after the sixth sending, and retry if timeout occurs. Wait 64s (2 to the sixth power) after the seventh send. If timeout occurs, the timeout fails.

Next look at the TCP SYN parameter on our machine:

Our server sets the TCP_syn_retries to 5, which means the default timeout is =1+2+4+8+16+32=63 seconds. It fits perfectly with the current problem, which is why the 63-second timeout occurred.

4. What about Windows?

(Originally this part is not prepared to elaborate, I hope the reader to consult the material, but do a complete.)

Since I’m using Windows10 as a development machine, I’d like to know what the timeout is on Windows. I wrote a test case, and one test, it was about 21 seconds. How does this work?

For relevant information:

TcpMaxConnectRetransmissions

Determines how many times TCP retransmits an unanswered request for a new connection. TCP retransmits new connection requests until they are answered, or until this value expires.

TCP/IP adjusts the frequency of retransmissions over time. The delay between the original transmission and the first retransmissions for each interface is determined by the value of TcpInitialRTT (by default, it is 3 seconds). This delay is doubled after each attempt. After the final attempt, TCP/IP waits for an interval equal to double the last delay and then abandons the connection request.

TcpInitialRTT

Determines how quickly TCP retransmits a connection request if it doesn’t receive a response to the original request for a new connection.

By default, the retransmission timer is initialized to 3 seconds, and the request (SYN) is sent twice, as specified in the value of TcpMaxConnectRetransmissions.

From the data, it can be seen that on Windows platform, this parameter is: TcpMaxConnectRetransmissions and TcpInitialRTT control, TcpMaxConnectRetransmissions default value is 2, commonly TcpInitialRTT default is 3 seconds.

That is, there will be two retries, each time twice as long as the last time, that is, 21 seconds: 3+3

2 + (3

2) * 2 = 3 + 12 + 6 = 21 seconds.

Run the following command to query Windows parameters:

netsh interface tcp show global
Copy the code

The maximum number of SYN retransmissions is 2 on my corporate development machine but 4 on my personal machine (so the default connection timeout is 3+6+12+24+48=93 seconds), although it is not clear why this is different on both Windows10 systems.

5, machine connection timeout pot – solution:

Set the connection timeout period to 5 seconds when service B connects to the server:

connection.connect(null, 5000, kexTimout);
Copy the code

In this case, if the connection fails for more than 5 seconds, the timeout exception will be handled to release resources as soon as possible and stop blocking the current processing thread.

6. Results:

Through relevant adjustment and optimization, the service verification is released again. Finally, the service runs stably without any exceptions.

perfect!

conclusion

1) Although the HTTP retry mechanism of service A was not to blame for the crash, it also accelerated the problem.

So we need to know whether we need a retry mechanism, if not, do not set up, otherwise it will be a waste of resources, even overwhelm the service provider system and other problems.

2) Network connections generally include TCP and HTTP. A reasonable timeout period (connection timeout period and data transmission time) should be set to prevent serious problems such as service interruption or service downtime.

Because the operating system sets the default parameters in a general way, it does not consider specific service scenarios.

Network data transmission time: For example, the default data transmission time is 6 seconds, which is not appropriate in actual scenarios. In this case, you need to adjust the time based on the actual situation. For example, in my case, the data transmission time is 15 seconds.

Network connection timeout: For example, the default network connection timeout on Windows is 21 seconds. Linux(Centos) has the default step timeout mechanism, which is 127 seconds by default. On my machine, it is 63 seconds.

3) Learn the operating system timeout mechanism. For example, in Linux or Windows, when the connection times out, you can increase the previous timeout time by multiple, which can be applied to our business development.

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

Like, forward, have your “like and comment”, is the motivation of my creation.
Follow the public account “Java rotten pigskin” and share original knowledge from time to time.
Also look forward to the follow-up article ing🚀

Author: cocodroid

Reference: club.perfma.com/article/203…

Record a network request connection timeout accident

The problem background

Problem solving

conclusion

Three things to watch ❤️

Related Posts

Brief Account (open source bookkeeping software)- Introduction and deployment of the back-end environment

How to scale enterprise Java applications based on Anolis OS? Dragon lizard technology

Playing with the Spring Family Bucket (1)