Stop downloading the JDK: Elasticsearch's first pit in a native ARM environment

In order to solve a JDK performance problem, the author conducted stack analysis, GC analysis, and Safepoint cause analysis, and finally found that the root cause of the problem was related to the JDK version used. And write up, and share this experience with all students related to Java development.

Author: Bi Jieshan

Source: Tencent Cloud middleware official account

The question

Recently, I encountered such a problem in my work: a customer launched a new Elasticsearch application, but after running for a period of time, it became very slow, and even the query timed out. After the restart, the service is restored, but the problem recurs every 3 to 4 hours.

To solve this problem, my colleague also did a simple analysis and found that there are a large number of Merge threads. What should I do? Based on my previous experience in fault locating, Thread Dump log analysis is generally used to find the correct direction of the cause of the problem, and then analyze the cause of the repeated problem. According to this way of thinking, the problem should not be complicated to analyze. But then the plot took a turn for the worse.

Confused stack logs

Due to network isolation, Thread Dump logs can be obtained only by customers. I emphasized the technique of getting Thread Dump logs every few seconds for each node and exporting them to a separate file. The cluster involves three nodes, which we will temporarily call 39,158, 211. After the problem recur, we get the first batch of Thread Dump files:

From the file size, it can be easily seen that 39 nodes are most likely to be a problem node, and their Thread Dump logs are significantly larger: the query is slow or jammed, which is usually manifested as a large number of Worker threads are busy, that is to say, the number of active threads is significantly increased. The default query behavior of ES(Elasticsearch) is that if one node fails, the entire query will be affected.

Let’s compare the overall situation of threads in the 1 Thread Dump file selected by the three nodes:

Statistics are classified by thread pool:

It can be found that there are significantly more Lucene Merge threads on node 39, while the other two nodes do not have any Merge threads.

The Thread Dump file of 39 nodes was further analyzed, and the anomalies found were summarized as follows:

There are 77 Lucene Merge threads, and the call stack for one Thread is as follows:

There are 8 threads competing to lock ExpiringCache:

There are eight threads doing HashMap#hash:

Symptom 1 shows that there are 77 Merge threads, but it is not clear whether the Merge tasks are triggered at the same time or are accumulated because the system processes them too slowly. Anyway, it looks like an important lead. Considering that this is a new app, research about the environment is as important as posture:

The cluster has 3 nodes and 500+ Indices at present. The number of active shards on each node is about 70.
Create Index by tenant. Each tenant creates 3 Indices per day. At the initial stage, the write throughput is low. Each index Flush Segment between KB and MB per minute.

I began to suspect this particular usage: there were multiple active write indexes in the cluster, but the writes per minute were small, in the KB to Megabyte range. This means that Flush may be triggered periodically, rather than after a preset threshold is exceeded. This writing method will result in a large number of small files. We sampled newly generated Segment files from several indexes, and indeed each generated file was very small.

For the second phenomenon, I read the source code of the Java.io.UnixFileSystem carefully:

The path of a new file in the UnixFileSystem needs to be standardized according to operating system standards.
Standardized results are stored in a ExpiringCache object.

Multiple threads are racing to invoke ExpiringCache# PUT, which reflects the high frequency of file list changes, indicating high frequency Flush and Merge operations.

This heightens my suspicions about the use of posture: “drizzle” writes, passively triggered Flush, if the cycle is the same, means simultaneous Flush, and the probability of multiple shards merging at the same time increases.

So I started simulating this usage in my test environment, creating a similar number of shards and controlling the write frequency. Plan to run the test program for at least a day to see if the problem can be reproduced. While the program runs, I continue to investigate Thread Dump logs.

In the third case, just doing a hash appears to be extremely slow. If you put these three phenomena together, you can see that all call points are doing CPU calculations. By rights, the CPU should be extremely busy.

When the problem recurs on site, the customer helps to obtain CPU usage and load information. The result shows that CPU resources are idle. Before this, my colleague also investigated IO resources, which was also very free. This excludes the impact on system resources. At this time, it was also found that the nodes reproduced each time were random and had nothing to do with the machine.

A day later, the problem failed to reproduce in the local test environment. Although the gesture is not elegant, it doesn’t seem to be the problem.

Weird STW interruption

To obtain Thread Dump logs using the jstack command, the JVM process needs to enter Safepoint, which means that the entire process is suspended first. The obtained Thread Dump log is also the transient state of each Thread when the process is suspended.

All the busy threads happen to be doing CPU calculations, but the CPU is not busy. This prompts further investigation into the GC logs.

GC logs are not enabled for onsite applications. Considering that the problem has not receded at the moment, it is not meaningful to look at the number of GC counts versus the time of the GC count using the jstat tool. Have the field staff manually add the following parameters in jvm.options to enable GC logging:

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=32m
Copy the code

Add PrintGCApplicationStoppedTime happened to every JVM process STW (Stop – The – World) break records in The GC log. Typically, such STW interrupts are caused by GC and may be related to biased locking.

Just after the restart, the field staff sent the result of the GC log tail to confirm that the configuration had taken effect. Strangely, the newly restarted process is printing STW logs:

About STW logs (“… Total time for which application threads were stopped…” ), a brief explanation is necessary:

The JVM sometimes needs to perform global operations, such as GC and partial lock collection, that suspend all running threads, depending on the JVM’s Safepoint mechanism, which is like setting a red light on a large road. Each time The JVM enters The STW (stop-the-world) phase, it prints a line like this:

2020-09-10T13:59:43.210+0800: 73032.559: Total time for which application threads were stopped: 0.0002853 seconds, Stopping threads took: 0.0000217 seconds
Copy the code

In this line of log, it is indicated that the STW phase lasts 0.0002853 seconds, while Stopping all threads takes 0.0000217 seconds, which includes the latter. In general, Stopping threads takes a very small amount of time, but if it takes too long it may be related to the implementation details of the code, so I won’t go into details here.

Going back to the problem, printing a lot of STW logs in the first place is easy to imagine related to biased lock collection. Until the problem recurs, a complete GC log is obtained for the three nodes and it is found that both YGC and FGC are triggered very infrequently, which rules out the GC aspect.

The large number of STW logs that appeared made me realize that this phenomenon was extremely unreasonable. Some students have raised doubts that each interruption is very short, right? I wrote a simple tool to make statistics of STW interruption frequency and total interruption time per minute:

Statistical results show that:

There is a normal interruption of about five seconds per minute.
The frequency of outages increases sharply between 11:29 and 11:30, which is exactly when problems begin to appear. The total interruption time can even be as high as 20 ~ 30 seconds per minute.

It is just like that there are dozens of traffic lights on a one-kilometer road, where normally there are no traffic lights. These interrupts are a good explanation for the phenomenon that “all the threads are doing CPU calculations, but the CPU resources are idle.”

Safepoint investigation

Open JVM logs by adding the following parameters to jvm.options:

-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=10
-XX:+UnlockDiagnosticVMOptions
-XX:-DisplayVMOutput
-XX:+LogVMOutput
-XX:LogFile=<vm_log_path>
Copy the code

In the process of waiting for the problem to recur again, I counted the read and write request frequency per minute based on the ES run log:

The number of read and write requests is relatively flat, which rules out the sudden increase in user requests.

When I get the JVM log, I see a large number of SafePoints of type ForceSafepoint:

Safepoint, on the other hand, looks like this:

It makes me wonder. I searched the Internet for the cause of the ForceSafepoint trigger, and found nothing.

Looking at hotspot’s source code, there are at least five related scenarios:

As you can see, ForceSafepoint is a “hodgepodge”, just like “Others” in the statistics chart. From there, I added the JDK to my list of “key suspects.”

Continue analyzing the JVM logs. In each Safepoint log, the total number of threads at that time is recorded:

As a review, there was a significant difference in the total number of Threads on the three nodes when the problem occurred, and the total number of Threads increased after the problem occurred, especially Lucene Merge Threads.

“Multiple Lucene Merge tasks” or “abrupt increase of ForceSafepoint/STW outage”, which is “cause” or “effect”?

The number of Forcesafepoints per minute and the change in the total number of threads are counted from the JVM logs. Add the two curves together to get the following image:

From the diagram, it seems that the ForceSafepoint is gradually increased before the subsequent threads are gradually increased. In other words, STW interrupts become more frequent first, and then multiple Merge task threads begin to accumulate, just as a directory suddenly becomes clogged with traffic lights.

At this time, I began to consult my classmates in Kona JDK team. I shared GC logs and thread dump logs with him, and told him all my findings so far. The biggest suspicion is these abnormal ForceSafepoints, and I need to know the cause of ForceSafepoints.

After a while, he replied: This is the ARM version of the JDK. Due to the lack of arm compiler emergency cabinet, I cannot provide a DEBUG version of ARM JDK for the time being.

It suddenly dawned on me that I had made the mistake of being “preconceived”, even though I had recognized the need to investigate the environment from the start.

No wonder it is never replicated in a native X86 environment. No wonder an online search for ForceSafepoint turned up nothing.

When communicating with the customer on a conference call, I learned:

Similar services were not found in another X86 environment.
In this ARM environment, there is another Elasticsearch cluster with lower requests, but no such problems were found.
The ARM JDK used in the environment was downloaded from the Internet, and the vendor behind it is unknown.

As for the other Elasticsearch cluster in this environment mentioned in point 2, I’m more interested in seeing a similar phenomenon in its GC logs. It is easy to understand why such problems do not occur, because they are often systemic problems compounded by multiple factors of business load characteristics and environment. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

At this point, the cause of the problem clearly points to the ARM environment and JDK version.

Later, Zang Lin from the TSF team of the micro-service platform stepped in and provided a VERSION of Kona JDK with debugging information. Although this version was much slower than the normal version, we found that the recurrence period of the problem was significantly longer after the replacement.

After receiving the latest JVM logs, Zanglini analyzed that these forcesafepoints were related to Inline Cache buffers. Of course, this could be a problem common to all JDK versions in an ARM environment, or it could simply be a problem with the JDK version you downloaded earlier.

Next, we replaced the JDK version in the environment with the official Release Kona JDK. Since then, the problem has never recurred. That said, with the Replacement of the Kona JDK, the problem is solved.

I counted the number of STW outages and outages after using KonaJ DK and found an order of magnitude increase:

Original JDK version: STW interrupts 5000 to 18000 times per minute, total data interrupts per minute may reach 10 to 30 seconds.
Kona JDK: STW interrupt in single digits per minute, total interrupt time in 100 ~ 200ms per minute.

As you can see, the Kona JDK represents an order of magnitude improvement in performance over the original problematic JDK version.

Summary review

Let’s comb through the analysis of the whole problem again:

Through stack analysis, it is found that a large number of threads are doing CPU calculations, but CPU resources are idle. The phenomenon of Merge Threads is somewhat confusing, but it is the “effect” of the problem, not the “cause”.
Through GC log analysis, it is found that GC frequency and GC time are very low, but there are a large number of STW-related logs in GC logs. You need to confirm the associated Safepoint type.
Based on VM/Safepoint log analysis, the Safepoint type is ForceSafepoint. By comparing the phenomena of different environments and different clusters, I began to suspect that the JDK version had problems.
After replacing the Kona JDK, the ForceSafepoints frequency was reduced from more than 5000 times per minute to single digits.

Through the analysis of this problem, get a lesson: the initial analysis of the problem should be serious research cluster environment. Of course, the biggest lesson is: don’t download the JDK.

portal

Tencent Service Framework (TSF) is a PaaS platform focusing on applications and micro-services. It provides one-stop application life-cycle management capability and data-based operation support, and provides monitoring data of multi-dimensional applications and services to facilitate Service performance optimization. Support for commercialization of two microservice architectures based on Spring Cloud (click to learn about Spring Cloud development framework) and Service Mesh (click to learn about TSF Mesh development framework).
Feedback and exchange of your valuable opinions:Tencent Cloud middleware Q&A community

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Stop downloading the JDK: Elasticsearch’s first pit in a native ARM environment

The question

Confused stack logs

Weird STW interruption

Safepoint investigation

Summary review

portal

Stop downloading the JDK: Elasticsearch’s first pit in a native ARM environment

The question

Confused stack logs

Weird STW interruption

Safepoint investigation

Summary review

portal

Related Posts

NIO’s off-heap memory – How is it allocated and reclaimed

JVM article memory structure (JVM memory model)

Eclipse to connect to CDH6.2.1 Cluster Hadoop cluster (WordCount simple test) | August More Challenge