Check Dubbo interface timeout records from the application layer to the network layer

Hello, I’m Kun Ge

We often say the interview made rockets, a lot of people to this question, I believe you wouldn’t be able to understand this article interview rocket, this screen problem skills involved to the index, GC, container, and network caught, all the basic skills, such as links to track without these rocket, trying to identify this kind of problem often do not know how to start, this also can answer a lot of friend’s question: Why do you need to learn Java instead of networking, MySQL and other knowledge systems that will make you a better engineer

1. Problem phenomenon

According to the feedback from the product team, the duBBO interface of some members occasionally has abnormal timeout, and the time is irregular, almost every day. The error of timeout of the product and service is shown in the following figure:

The timeout interface takes only 4-5 milliseconds on average. Check the interface configuration of Dubbo. The timeout period for invoking the member interface is one second. The failure policy is Failfast.

The member deployed a total of 8 machines, all are Java applications, Java version using JDK 8, all run in the Docker container.

Ii. Problem analysis

At first, I thought it was just a simple interface timeout and started to troubleshoot. First look at the interface logic, just a simple database call that wraps the parameter return. SQL takes an index query and should return quickly.

As a result, the elapsed logs printed by ElapsedFilter, the interceptor that searches for Dubbo (ElapsedFilte R is one of the extension points of the Dubbo SPI mechanism, which prints the elapsed time of any interface over 300 milliseconds), part of the elapsed time of this interface is indeed very long.

Query the slow SQL platform of the database, and no slow SQL is found.

1. Database timeout?

It is suspected that obtaining the database connection takes time. Call database query in the code can be divided into two steps, to obtain database connection and SQL query, slow SQL platform monitoring SQL execution time, since the slow SQL platform did not find, it may be a long time to establish the connection.

The Druid connection pool is used in the application. The connection to the database is established during initial use, rather than every SQL query. Druid initial-size = 5; max-active = 50; min-idle = 1; Druid initial-size = 5; max-active = 50; It takes a long time.

Therefore, change min-idle to 10, add the Druid connection pool usage log, and print the number of active connection pools every three seconds.

After the change goes online, a timeout still occurs. When the timeout occurs, check the connection pool log, and the connection number is normal and does not fluctuate.

2. The STW?

Therefore, the time – consuming logs generated by ElapsedFilter have disappeared. Because the logs were printed at the entry and exit of the business method, and before and after the database operation, the entire business operation was found to be extremely short, all in milliseconds.

However, an error is reported when the calling end starts to timeout after a period of time. Instead of calling one interface, it calls several interfaces at the same time. In addition, several machines on the calling end report timeouts at the same time (see filtering hostname on the ELK platform), while the same machine provides service timeouts.

This makes it clear that at some point in time something global happened on one of the machines providing the service causing almost all interfaces to time out.

Continue to observe the log, capture one of the timeout request from the calling side to the server side all logs (should be able to trace the distributed ID, context ID can only trace a request within a single application, cross-application invalid, so we have to find their own way, here is based on the call IP+ time + interface + parameter in elK). The time when the server received the request was one second later than the time when the caller initiated the call. More strictly speaking, dubbo printed startTime (denoted as T1) and endTime in the timeout log of the caller, and added a log in the interface method entry of the server. It is possible to locate the time at which the request came in (denoted as T2) and compare the two times. T2 is more than a second behind T1, and two or three seconds longer. The network call time from the Intranet is almost negligible, and this delay time is extremely abnormal. It is easy to think of STW(Stop The World) for Java applications, where garbage collection, in particular, can cause a brief application pause, unable to respond to requests.

The first thing to do is to open the garbage collection log. The garbage collection log prints the time when the garbage collection occurred, the type of garbage collection, and the amount of time the garbage collection took. Add JVM startup parameters

– XX: + PrintGCDetails Xloggc: ${APPLICATION_LOG_DIR} / gc log – XX: XX: + PrintGCDateStamps – + PrintGCApplicationStoppedTime.

In order to facilitate the investigation, the pause time of all applications is printed. In addition to garbage collection causing application pauses, there are many other actions that can cause pauses, such as canceling biased locks. Since the application is in the Docker environment, GC logs will be cleared every time the application is released, and a report program is written to regularly report GC logs to the ELK platform for easy observation.

The following is what happens when the interface times out:

It can be seen that the time of GC is close to two seconds, and the pause time of application is also close to two seconds. Moreover, the garbage collection this time is ParNew algorithm, which occurs in the new generation. Therefore, it is almost certain that the pause of garbage collection caused the application to be unavailable, which in turn caused the interface to time out.

2.1 Safety points and FinalReferecne

The following start to investigate the new generation of waste recycling time is too long. First, the JVM parameter configuration for the application is

-Xmx5g -Xms5g -Xmn1g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -Xss256k -XX:SurvivorRatio=8 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70

By looking at the Marvin platform (our monitoring platform), we found that the heap utilization was consistently low and the GC logs did not even show a CMS GC, so we could rule out heap undersize problems.

The ParNew collection algorithm is relatively simple and not as complex as the CMS used in the old days. ParNew adopts mark-copy algorithm to divide the new generation into Eden and Survivor zone, and Survivor zone is divided into S0 and S1. New objects are first allocated to Eden and S0 zone, and when all of them are full and cannot be allocated, the surviving objects are copied to S1 zone and the remaining objects are cleared. Make room. There are three time-consuming operations: find and mark the live object, recycle the live object, and copy the live object (this involves more garbage collection knowledge, the details will not be expanded).

The JVM needs to suspend all threads until the live object is found and marked, and this is not done overnight until all threads have reached a safe point. Some threads may not be able to respond immediately to a safe-point request due to long loops or other operations. The JVM has a parameter that prints the time taken for all safe-point operations, plus the parameter

-XX:+PrintSafepointStatistics -XX: PrintSafepointStatisticsCount=1 -XX:+UnlockDiagnosticVMOptions -XX: – DisplayVMOutput – XX: + LogVMOutput – XX: LogFile = / opt/apps/logs/safepoint log.

The spin+block time in the safe point log is the time for the thread to run to the safe point and respond. It can be seen from the log that the time is very short and most of the time is spent in vMOP operation, that is, the operation time after reaching the safe point, excluding the possibility that it takes too long for the thread to wait to enter the safe point.

Java object types can be divided into various types of Reference according to the strength and weakness of Reference. FinalReference is special. Objects have a Finalize method, which can be called before objects are recycled to give objects a chance to struggle. Some frameworks will do some resource collection in the object Finalize method, closing the connection operation, which will increase the garbage collection time. Therefore, by increasing the JVM parameter -xx :+PrintReferenceGC, we print the time of each Reference collection.

As you can see, the reclaim time of FinalReference is also very short.

Finally, the replication time is mainly determined by the size of the surviving objects. According to GC log, more than 90% of the surviving objects can be recovered each time the new generation is recycled, and the surviving objects are very few. So you can pretty much rule it out. Troubleshooting of the problem is deadlocked. The ParNew algorithm is simple, so the JVM does not have much logging, so troubleshooting is more by experience.

2.2 Garbage collection threads

The size of objects recovered by GC log is almost the same for two times, but the time difference is 10 times. Therefore, it is recommended to confirm the CPU status of the system, whether it is a virtual environment. There may be fierce competition for CPU resources.

In fact, it was suspected that the CPU resource problem was due to the garbage collection problem. According to Marvin monitoring, the CPU did not fluctuate much during garbage collection. The number of GC threads in the JVM is determined by the number of CPU cores. If it is less than eight cores, the number of GC threads is equal to the number of CPU cores. Our application runs in the Docker container, the core of allocation is six cores, but the number of new generation GC threads reaches 53, which is obviously abnormal. In the docker environment, the CPU information obtained by the JVM is from the host machine (the /proc directory in the container is not isolated, and the CPU information is stored in this directory), instead of the specified six cores, the host machine is 80 cores. Therefore, the number of garbage collection threads created is much larger than the CPU resources of the Docker environment, resulting in fierce competition between each new generation of garbage collection threads and time-consuming increase.

Increase JVM parameters by limiting the number of JVM garbage collection threads to the number of CPU cores

-XX:ParallelGCThreads=6 -XX:ConcGCThreads=6

The results were immediate! The garbage collection time of the new generation is basically reduced to less than 50 milliseconds, which successfully solves the problem of long garbage collection time.

Problem recurrence

I thought the timeout problem should be solved completely, but I still received an alarm indicating that the interface timed out. The symptom is the same: Multiple interfaces time out at the same time. When you look at the GC log, there are no pauses longer than a second, or even hundreds of milliseconds.

1. Arthas and Dubbo

Go back to the code and reanalyze. Start to disassemble the whole process of Dubbo interface receiving requests. We plan to trace requests with Alibaba’s open source Arthas and print the time of each step of the whole link. (Arthas and Dubbo) the server receives dubbo requests, starting at the network layer and then on to the application layer. Specifically, the request is received from the Worker thread of Netty and then delivered to the business thread pool of Dubbo (the application uses the all-type thread pool). Since multiple threads are involved, trace can only be performed in two segments, one for netty to receive the request and the other for Dubbo to process. Arthas trace can run in the background while printing only links that take longer than a certain threshold. (Trace uses instrument+ASM, which will cause a temporary application suspension. The suspension time depends on the number of classes that are implanted.) Because there is no print parameter, the timeout request cannot be accurately located, and Trace can only see the time spent in invoking the first layer. Finally, the trace method was abandoned.

Inspired by the idea of the number of GC threads, since the application running basically does not involve the operation of flushing, it is CPU operation + memory read, the time should still be caused by thread contention or lock. A review of the jStack shows that the total number of threads is still at a high level of 900+, with more forkJoin threads and Netty worker threads. So I searched the code where threads count is set by CPU core count (Runtime.getruntime ().availableProcessors()), and found that many frameworks still use this parameter. This cannot be solved by setting JVM parameters.

So consult with the container group to see if you can pass the correct number of CPU cores to the container application. The container group confirms that it can limit the number of cpus a container can use by upgrading JDK versions and enabling CPU SET technology, starting with Java 8U131 and Java 9, Cpusets can be understood and used by JVMS to determine the size of available processors (See Java and CPU sets at the end of this article). We are using JDK 8 now, and the smaller version is less risky to upgrade. Therefore, after testing, we upgraded the 8 containers in the application to Java 8U221, and enabled the CPU set to specify the number of cpus that the container can use. After modifying the online again, there is still a timeout situation.

2. Network packet capture

Observing the specific timeout machines and time, it is found that there is not multiple machines timeout, but a period of time with a more intensive timeout of a certain machine, and the frequency is much more intensive than before. Previous timeouts were more random from machine to machine and much less frequent. Strangely enough, after Wukong was released, the original machines didn’t time out, but other machines started to time out. It looked like a relay race. (Host of Docker application may change with each release)

In the face of such a strange phenomenon, only the host or network problems can be suspected, so the network began to capture packets.

Because the timeout occurs on a single machine, you can log in to the machine and run the tcpdump command to capture packets that host is the caller and save the capture records. Then, you can use the Wireshark to capture some abnormal packets.

Note that TCP Out_of_Order, TCP Dup ACK, and TCP Retransmission are defined as follows:

TCP Out_of_Order: Generally, network congestion causes sequential packets to arrive at different times, delay too long, or packet loss, requiring regrouping of data units because they may have taken a different route to your computer.

TCP dup ack XXX#X: indicates the sequence number to which the packet is lost, and the sequence number after # indicates the number of lost packets

TCP Retransmission: data Retransmission caused by timeout

Finally, using the contextid generated in the elK timeout log, the Wireshark filters the TCP packet that times out and finds that packet loss and retransmission occur. Packet loss occurs when multiple timeout logs are captured. Confirm that the packet loss is caused by network problems.

At this point, the problem is basically over, and then the container group continues to investigate the network occasional packet loss problem.

Shoulders of giants

Stackoverflow question post: aakn.cn/YbFsl
Jstack visual analysis tool: fastThread. IO /
Docker and JVM for 2018: ‘www.jdon.com/51214