The author | Mason MA

Arthas official Community is holding an essay contest to win prizes. Click to submit.

Arthas is a good tool to use, but there are other tools like MAT and YourKIT (which you pay for) that are used in combination to help you find and solve problems. During the period, I also communicated with the development leaders for many times and obtained different information respectively.

One click to install and start Arthas

Method 1: Implement Arthas one-click remote diagnosis using Cloud Toolkit

Cloud Toolkit is a free local IDE plug-in released by AliYun to help developers develop, test, diagnose and deploy applications more efficiently. Plugins enable one-click deployment of native applications to any server, even the cloud (ECS, EDAS, ACK, ACR, applets, etc.); There are also built-in Arthas diagnostics, Dubbo tools, Terminal terminals, file uploads, function calculations, and MySQL executators. Not only the mainstream version of IntelliJ IDEA, but also Eclipse, Pycharm, Maven and others.

It is recommended to download Cloud Toolkit using the IDEA plugin to use Arthas: t.tb.cn/2A5CbHWveOX…

Method 2: Download directly

Address: github.com/alibaba/art… .

The phenomenon of

Build index background application, feel that does not use so much memory, now used and every two days will oom, need to restart;
There are full data and incremental data. In OOM, most of the data is written to the full data, and the full data update is triggered for the first time in the early morning.
Business applications use the G1 collector (Advanced advanced…) .

Inner OS: bad, G1 still not familiar how to do, first think of a way to the big guys off, I slowly study.

I have something else to do. I’ll look at it later

If gallbladder hypothesis

Make assumptions based on the information you have and you won’t lose money if you guess wrong.

Is it because full data write exceeds the load capacity of the heap, resulting in OOM?
Is there a static container that needs OOM every two days?
I don’t have enough memory. Which objects are taking up the most space?
Any big ones?

Trembling and verification

The basic information

Process start parameter

-Xms12g 
-Xmx12g
-XX:+UseG1GC 
-XX:InitiatingHeapOccupancyPercent=70 
-XX:MaxGCPauseMillis=200 
-XX:G1HeapWastePercent=20 
-XX:+PrintAdaptiveSizePolicy 
-XX:+UseStringDeduplication 
-XX:+PrintGCDetails 
-XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintTenuringDistribution 
-Xloggc:/home/search/fse2/gc.log
-XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=2 
-XX:GCLogFileSize=512M 
-XX:+UnlockDiagnosticVMOptions 
-XX:+PrintNMTStatistics 
-XX:NativeMemoryTracking=summary
Copy the code

As you can see, the G1 collector was used, which was not common in the business development days, after all, it was a low-latency garbage collector built for large memory. You can collect some information about the basic features of the G1 collector. It mainly includes the following items:

Region Partitioning mechanism
SATB, which stands for snapshot-at-the-beginning, is literally a Snapshot of an object that is alive At The Beginning of GC
Rset, which stands for Remembered Set, is a structure that assists the GC process, the idea of space for time
Pause Prediction Model

From G1 of these attributes, it is expected to decrease artificial tuning operation, automatic tuning (here, feel the OOM seems to be associated with the garbage collector itself is not big, not because the volume of memory use, possible root cause at the code logic level), and to adapt to the hardware cost is reduced, A scenario where the large heap grows (and later ZGC and Shenandoah, also magical garbage collectors that manage very large memory with very low pauses).

The GC log is already available, so let’s look at the log and find an exception:

#### The Heap here is still just over 300 MEgabytes of memory[Eden: 316.0M(956.0M)-> 0.0b (1012.0M) Survivors: 48.0M->44.0M Heap: 359.2M(12.0G)->42.0M(12.0G)] [Times: User =0.31 SYS =0.01, REAL =0.05 SECS] 2020-06-09T2:59:23.489 + 0800:2020-06-09T2:59:23.489 + 0800:35.922: Total timefor whichApplication Threads were stopped: Stopping Threads took: 0.0000940 seconds [GC concurrent-root-region-scan-start] ...... . .#### 11GB of memory is recovered by the point Heap[Eden: 724.0M(1012.0M)-> 0.0b (540.0M) Survivors: 44.0M->72.0M Heap: 11.5G(12.0G)->355.6M(12.0G)] [Times: User =1.51 sys=0.07, real= 0.26secs] 2020-06-09T03:12:36.329+ 0800:828.762: Total timefor whichApplication Threads were stopped: 0.2605902 seconds, Stopping Threads took: 0.0000600 secondsCopy the code

For the first time debugging

Increase the value of -xx :G1ReservePercent to increase the reserved memory for target space.

Reduce – XX: InitiatingHeapOccupancyPercent start marking cycles ahead of time

It is suspected that at the time of GC, a large amount of data was written in full, and a large number of objects were written into the memory, which led to OOM. Therefore, the startup cycle was adjusted to decrease the ratio from 70 to 55, triggering in advance and reserving more heap space.

GC became frequent, but the memory footprint was not explained. And whether OOM will appear again in the wee hours remains to be seen… So I’m going to continue to see if there are other problems.

Keep digging

There is no answer to the doubt, must continue to dig, copy up the keyboard is dry, a fierce operation such as tiger.

Arthas

I don’t know who has installed it, first use it to see, probably use the following commands:

dashboard

Look at the base case, number of threads, heap size, old age size, usage ratio. When I first saw this place, the occupancy of Eden and old area was quite high, about 70~80% (I forgot the screenshots at that time). The memory usage is high.

thread

Check whether the number of threads is abnormal, the number of threads is normal, but find out the thread with high resource consumption, one of the threads will appear in the following other information statistics:

YJPAgent-Telemetry 
ctrl-bidsearch-rank-shard1
YJPAgent-RequestListener
ctrl-bidsearch-rank-shard1
Copy the code

Two threads are agents of YourKit monitoring tool, which seems to have a significant impact on performance.

profiler

Collect the flame map data of memory and CPU respectively:

# memory
profiler start --event alloc
# # #
profiler stop

# the default CPU
profiler start
# # #
profiler stop
Copy the code

memory

As you can see from the CPU flame chart, the G1 collector thread is taking up half of the resources, probably while collecting GC, but other than that, you can basically locate a class called IncrementIndexService that uses a lot of CPU and memory resources.

If the business code is really defective, it must be in this process. After communication, I found that the main entrance of the application processing task is indeed in this process. Take the benefit of the doubt.

The thread seen by thread and the class seen by profiler are both ultimately identified as the same business process.

Start testing your previous conjecture:

1. Do a large number of objects flood into the memory when full data is written?

After calculating the amount of data acquired by some business codes, the metadata is about 1.3g. Even if the full data is written, it should not occupy 12G of heap memory. Therefore, it is speculated that there may be some defect in the code logic when the full data is written, resulting in the original data of 1.3g being encapsulated into objects much larger than 1.3g.

2. Is there a static container?

Yes, but after observation by Watch, it is not found that the container is only put but not remove. The command is as follows:

watch com.xxx.classname method "{params,target}" -x 3
Copy the code

3. Do you have a big date?

For G1, the default is to divide the heap into 2048 regions. For a 12GB heap, a Region is 6M, and large objects are larger than 3M.

jmap histo 30123
Copy the code

At least in the output data, large objects are not large objects defined by G1.

MAT

Since nothing is found, let’s dump the heap. If you don’t want to dump or can’t dump, you can also use Jmap Histo to check memory usage. The advantage is that you don’t dump, but the disadvantage is that you usually don’t have a good correlation with business code.

Warning: JMAP or Arthas must disconnect traffic before heapdump operation. Fortunately, there is no online traffic for our service, so there is a delay in indexing, which may temporarily affect the search experience.

After dump, it was found that there were 7 G files, such a large file is generally difficult to upload to the local for analysis, so MAT was used to occupy 1 G memory of the server for analysis, and the analysis result was downloaded to the local.

The thread name matches the information found earlier, and the detail is a surprise.

A string of strange strings, a bit like XML, seems to be assembling something, so I went to the business manager for advice, found that solrDocument is indeed assembled logic, and YourKit output part of the object values of the fragment, you can find that most of the repeated assembly, roughly meaning as follows:

<xml> ... <country>CA</country> <country>CA</country> <country>CA</country> <country>CA</country> <country>CA</country> ... N times <country>US</country> <country>US</country> <country>US</country> <country>US</country> <country>US</country> <country>US</country> <country>US</country> ... N times <country>UK</country> <country>UK</country>... </xml>Copy the code

Code logic changes are actually relatively simple, business development leaders are familiar with the business process, and quickly come up with changes.

After the adjustment, memory usage went down a lot.

It goes down to about 4-5G, which, if so, should still be sufficient even if the full amount of data is being garbage collected while it’s being written. But feel this code logic, there should be room for optimization, but first solve the problem, optimization can do the next operation.

analyse

Data from either tool indicated that the logic of the IncrementIndexService class might be faulty. The root of the problem did not seem to have anything to do with the G1 parameter Settings, but rather a flawed code logic, which could only be alleviated by expanding memory or adjusting JVM parameters. But it can’t be solved.

From process to thread to code
Get basic JVM information, collector, startup parameters, and more
View existing logs, GC logs, and business logs
Communicate business scenarios, understand the size of input data, and so on
Guess what the cause might be. Make a bold guess
Use tools (Arthas, MAT, YourKit, JDK built-in commands, etc.) to mine information, fire maps, energy consuming threads, thread stack dump, proportion analysis of large objects…
Recomb the data to find the correlation and possible defects of the business code and try to adjust the parameters
If there are bugs in the business code, fix them

Arthas’s essay campaign is in full swing

Arthas is officially holding an essay call if you have:

Problems identified using Arthas
Source code interpretation of Arthas
Advise Arthas
No limit, other Arthas related content

Welcome to participate in the essay activity, there are prizes to win oh ~ click to submit

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”

“Report” Alibaba Arthas, significantly lower OOM Trouble shooting threshold