1. Background

In the process of multi-activity pressure test, there often appear very strange periodic requests for a large number of abnormal problems, the phenomenon is in a long period of pressure test (eg: 10 minutes), will regularly (1 minute) request a large number of error, and with the increase of traffic, the probability and duration of the problem will increase a lot, in the whole EN live pressure measurement period, often appear, once in multiple directions suspect investigation, but have no results, troubled for a long time, as shown in the following figure

In the process of solving this problem, due to the lack of experience, I took many detours. At the beginning, I doubted in many directions, but all of them were eliminated one by one. Finally, I found the real problem

1.1 Summary of the phenomenon:

As can be seen from the figure, during the pressure test, the traffic and request duration and the number of exceptions are normal at the beginning, but after one minute, a large number of request errors will appear, but the average request time is not normal.

1.2 Pressure test Background:

You can take a look at our entire pressure deployment or flow chart. At first, the pressure test platform initiates the pressure test task, and then the pressure test cluster initiates the pressure test, and finally sends the traffic to the SLB and NGINX in front of our application, and finally leads to our application.

The reason why we draw two identical pictures is that our pressure survey is based on cloudy and living more. In order to let you know more about our basic architecture of cloudy and living more, we draw them for you (note: the real cloudy picture is much more detailed than this).

2. Troubleshooting process

  1. When the problem is found, the first time suspected is not the background application cluster pressure is too large, after a large wave of traffic bombing, the server can not withstand the problem. So I immediately went to check the APM surveillance, and sure enough, I found some suspicions

    During the pressure test, there is a period of time when the server is not processing requests and appears to be idle. The comparison time points correspond to the abnormal time points shown in the pressure test results

  2. After the first step of the investigation, I subconsciously thought that there should be a problem with the application server, which felt a bit like full GC. After the application ran for a period of time, the memory pressure was too high, and the whole world stopped, but it was also a bit strange. Theoretically, the request duration would be affected during GC. Why is the overall time very stable and there is no timeout, regardless of the APM first check, the number of threads is relatively normal

    Embarrassing, memory, CPU all fine, no full GC at all

  3. In that case, there’s nothing more to say. Instead, we use thread analysis, where we dump the thread stack during a pressure error and see what the thread is doing

    jsatck -l pid > benchmark-2022-01-23-01.stack

    Recommend a decent enough thread stack analysis tool (perfma) : thread. The console. The heapdumps. Cn/detail / 3281…

    You can see that there are almost no HTTP running threads, and all HTTP threads are idle at this point, waiting for new requests

    Conclusion: The server is truly idle, proving that traffic is really not hitting the back-end instance

  4. In fact, at this point in time I only have broilers and nginx (layer 1 and layer 2).

    Let’s look at the flow chart

In fact, from the comparison of pressure flow and nginx traffic chart, it can be seen that the QPS of Nginx side at the time of the pressure display error is significantly reduced, which proves that in fact the traffic is indeed no. OK, so it’s pretty obvious what the problem is, just the broiler

  1. However, not much is known about broilers (JMTER clusters) except for the use of JMTER, and there is no other better way at this point in time. Take master Zhang and I communicate the most problems of a word “catch a bag, catch a bag.” Ha ha ha ha, really is a trick, really found the problem, in the broiler catch package as follows

    Large number OF TCP Port Resused

  2. There are a large number of TIME_WAIT state connections, and the port usage is very large. Obviously, in the case of the request time is not high, there is such a large port usage, it must be the port reuse problem

    Netstat – an | awk ‘/ TCP / {print $6}’ | sort | uniq -c: look at the number of the TCP connection of each state

    Check the port opening range from 32768 to 60999, which is more than 28000 ports (check cat /proc/sys/net/ipv4/tcp_fin_timeout). This is why the request is always abnormal when we pressure test. TCP Port Numbers Resused because TCP Port is not enough

    Next, we will analyze why there is insufficient TCP Port number Resused, in fact, no Port is reused, that is, the previous connection is not resignedin time, to understand this, we must be clear about TCP Port number resignedin time

    Steal the drawings from Master Zhang’s booklet

    From the figure above, we can easily see that the TIME_WAIT state mainly occurs in the party that actively closes the connection, that is, the client automatically enters the TIME_WAIT state after receiving the second FIN packet sent by the server, and then automatically closes after 2 MSLS. The MSL of our meat machine is 60 seconds. Run the cat /proc/sys/net/ipv4/tcp_fin_timeout command to view the value

    • What is MSL?
      • – MSL Indicates the maximum segment lifetime
      • – MSL is the maximum time that any IP packet can live in the network. After this time, packets are discarded and the server can set it
      • – MSL is usually 30 seconds or 2 minutes
      • Why TIME_WAIT
        • Gracefully close the TCP connection, that is, ensure that the passively closed end receives the ACK packet of the FIN packet sent by itself.
        • The processing of delayed duplicate packets is mainly to prevent packets from the first connection of two connections that use the same quple from interfering with the second connection.
  3. OK, then to solve the problem of insufficient port, a very simple way is to expand the scope of port use, but in fact, in our pressure measurement of a large flow, no matter how many ports are used up at that moment, just a little later, obviously can not solve the problem.

    Tcp_tw_reuse allows connections in TIME_WAIT state to be reused, but there is a certain amount of waiting time, so it doesn’t improve performance all at once. Tcp_tw_reuse also defines a new TCP option – two four-byte timestamp Fields, the first being the current clock timestamp of the TCP sender and the second the latest timestamp received from the remote host. The timestamp comparison determines whether the connection can be multiplexed and the order of messages

  4. Actually think carefully, this problem should mainly distributing requests jmter should do not have to reuse the connection, connect down colleagues of the test platform, simple request will be caught and TCP connection state diagram to him, he immediately realized jmter send request without reuse point, through the following adjusted with respect to OK

3. Problem solving:

The jmeter client is implemented in Java and httpclient4. The jMter default is httpClient4.

Java: Links are reused (HTTP calls are pooled in the code) when you select pressure test

Httpclient4: A new link is created every time a request is pressed. (Connection reuse was turned off by default before Jmeter5.0, on 5.0: a new link is created every time a request is made.)

If the httpClient4 throughput is too low, you need to consider the network bandwidth limit. If the httpClient4 throughput is too low, you need to consider the network bandwidth limit

Java implementations are suitable for crushing testing, and HttpClient4 is suitable for real-world simulation

Here is a snapshot of the switch to Java. You can see that the compression data is smooth and there are no errors, and there are not many TIME_OUT connections on the broiler instance

4. To summarize

In fact, the question is very simple, many bosses or experienced students, may see at a glance the problem, but I think it’s necessary to share this out, because in the real environment, the whole business system is very complex, originating from the user request to the server finally received, may have had a lot of intermediate components, each may become the bottleneck of the system, In fact, we should have a clear idea of the flow to have a clear idea of the chest, so that even if there is no experience, but also from the elimination of point by point, and finally find the problem

Secondly, it is strongly recommended to dig the pit master Zhang’s booklet in-depth understanding of TCP protocol: from the principle to the actual combat, this booklet is really nice, before learning this booklet, this investigation process a lot of things just seem so easy