The phenomenon of

Through jstat-gcutil PID 5000, FGC was found to be frequent and frequent. At this time, the proportion of the old age was about 70%, and the memory could not be reclaimed. The FGC threshold set by us was 70% of the old age. At this time because there is still 30% of the old space, so the overall memory is relatively stable, CPU is also relatively stable, but there is a great potential risk, is the memory has been rising, not released.

[service@ZQ-SE-331-V05 ~]$jstat -gcutil 1087 5000 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 55.09 88.41 72.10 92.64 85.22 9223 1169.442 435 168.866 1338.307 57.54 0.00 72.24 72.31 92.64 85.22 9224 1169.542 436 168.877 1338.418 0.00 63.75 5.33 72.50 92.64 85.22 9225 1169.642 436 168.877 1338.519 0.00 63.75 34.02 72.50 92.64 85.22 9225 1169.642 436 168.877 1338.519 0.00 63.75 59.26 72.50 92.64 85.22 9225 1169.642 436 168.877 1338.519 0.00 63.75 81.37 72.50 92.64 85.22 9225 1169.642 436 168.877 1338.519 55.60 0.00 11.75 72.71 92.64 85.22 9226 1169.742 436 168.877 1338.619 55.60 0.00 40.07 72.71 92.64 85.22 9226 1169.742 436 168.877 1338.619 55.60 0.00 67.86 72.70 92.64 85.22 9226 1169.742 437 169.541 1339.284 0.00 56.04 4.21 72.59 92.64 85.22 9227 1169.838 437 169.541 1339.379 0.00 56.04 30.01 71.73 92.64 85.22 9227 1169.838 438 169.649 1339.487 0.00 56.04 57.75 71.73 92.64 85.22 9227 1169.838 438 169.649 1339.487 0.00 56.04 79.01 71.73 92.64 85.22 9227 1169.838 438 169.649 1339.487 55.39 0.00 2.54 71.92 92.64 85.22 9228 1169.988 438 169.649 1339.638 55.39 0.00 24.70 71.92 92.64 85.22 9228 1169.988 438 169.649 1339.638 55.39 0.00 47.89 71.92 92.64 85.22 9228 1169.988 438 169.649 1339.638 55.39 0.00 82.01 71.89 92.64 85.22 9228 1169.988 439 170.207 1340.196Copy the code

Jmap-histo /-histo:live PID >> log indicates that a large number of instances of the CarnivalOneDayInfo object were CarnivalOneDayInfo (2kW) and increased continuously

num     #instances         #bytes  class name
----------------------------------------------
   1:      28906840     1387528320  java.util.HashMap
   2:      38675870     1237627840  java.util.HashMap$Node
   3:      18631826      745273040  xxx.CarnivalOneDayInfo

    num     #instances         #bytes  class name
----------------------------------------------
   1:      31092889     1492458672  java.util.HashMap
   2:      35749328     1143978496  java.util.HashMap$Node
   3:      20355334      814213360  xxx.CarnivalOneDayInfo
Copy the code

screening

CarnivalOneDayInfo: The code for CarnivalOneDayInfo will clone a CarnivalOneDayInfo when checking all online players’ time activity in MinuteJob. Therefore, the initial location is CarnivalOneDayInfo continuous clone problem

At the back end, y looked at the related logic, annotated the related clone logic locally, observed histo again, and found that the object instances were not increasing, and confirmed the inference of 1 again. But after looking at the code related to activities, all activities were clone copies, why only carnival activities were leaked? Clone objects are temporary objects and should be released by FGC.

CarnivalOneDayInfo = CarnivalOneDayInfo = CarnivalOneDayInfo = CarnivalOneDayInfo = CarnivalOneDayInfo = CarnivalOneDayInfo However, there are many online players and a lot of heap memory, so exporting once takes a long time and will cause STW. Therefore, directly connect to the development server and check, there is also the problem of CarnivalOneDayInfo leak. Therefore, run jmap-dump :live,format=b,file=2388_heap.bin 2388 to export the heap memory snapshot of the Java process on the development server.

Open directly with MAT. If you are experienced in using MAT, the operation steps are

  1. Choose dominator_tree

  2. Search CarnivalOneDayInfo

  3. List Objects

  4. Select With Incoming References, who holds its references

Direct discovery is by

queue-executor-handler-5

java.lang.ThreadLocal$ThreadLocalMap @ 0x8104eec0

java.lang.ThreadLocal$ThreadLocalMap$Entry[64] @ 0x866710f0

java.lang.ThreadLocal$ThreadLocalMap$Entry @ 0x86671608

java.util.IdentityHashMap @ 0x86671628
Copy the code

Java.lang.Object[], which holds a lot of CarnivalOneDayInfo and HashSet, is directly held by the logical thread’s ThreadLocal.

The hard road of analysis

The first step is to directly query all threadLocal references in the IDE and find the corresponding CarnivalOneDayInfo

Logback Protobuf service’s own threadLocal BeanUtils threadLocal

After the investigation, I did a simple test and checked relevant codes, but found no ideas or complaints. CarnivalOneDayInfo: CarnivalOneDayInfo: CarnivalOneDayInfo: CarnivalOneDayInfo: CarnivalOneDayInfo: CarnivalOneDayInfo: CarnivalOneDayInfo: MinuteJob (MinuteJob, MinuteJob, MinuteJob, MinuteJob, MinuteJob) Observe each ThreadLocal$ThreadLocalMap$Entry and focus on whether or not IdentityHashMap is present because mat analyzes that this map is finally located on the stack.

The stack

MinuJob -> run through all players online, Post a message to the logical thread -> ActivityManager#checkTimeActivity traverses all individual activities -> CarnivalActivityInfo#checkActivityState CarnivalActivityInfo Diff = playerinfo.clone () ActivityBaseInfo. GetCarnivalDaysMap (.) the equals (carnivalDaysMap) / / calls equals to compare to compare in turn each CarnivalOneDayInfo carnivalDaysMap Call CarnivalOneDayInfo -> BaseCarnivalOneDayInfo # equals beanutils.isdirty

Specific code

private static Set<String> getDirtyNamesByBean(AugmentedBean augmentedBean) {
        IdentityHashMap<AugmentedBean, Set<String>> dirtyNamesMap = dirtyNames.get();
        if (dirtyNamesMap == null) {
            dirtyNamesMap = new IdentityHashMap<>();
            dirtyNames.set(dirtyNamesMap);
        }
        return dirtyNamesMap.computeIfAbsent(augmentedBean, k -> Sets.newHashSet());
    }

    public static boolean isDirty(AugmentedBean augmentedBean) {
        return getDirtyNamesByBean(augmentedBean).size() > 0;
    }
Copy the code

It’s very easy to see from the above that we’ve created an IdentityHashMap, and then set it to threadLocal exactly like we did in the previous analysis so it’s exactly where the problem is, it’s the BeanUtils pot.

Review and summary

The beanutils problem was very much suspected, but I didn’t pay much attention to it. Now, it is the same as the above analysis.

  • Comparing hisTO before and after FGC, confirm that the number of object instances has been increasing and is significantly larger
  • Analyze the code, if directly locate the problem, best
  • If you cannot locate the object directly and need to confirm who holds the object reference, dump the heap memory snapshot
  • Instead of dumping online, you need to reproduce on the development server (memory leaks are usually easy to reproduce) and dump on the development server
  • After dump, use MAT tool to analyze the leak, List Objects With incoming References, find references
  • The next step is you need to confirm why it’s referenced here, you can analyze the code, solve the problem, best
  • If not, you can use debug to confirm the thread stack by referring to the code of the related object in the previous step and adding a breakpoint

It also explains why only the Carnival activity was exposed, because it called basexx’s equals method generated by Beanutils, which was a solution to clone objects with rollback and diff incremental updates, but was later abandoned. As the complexity of the object increases, the cost of clone is high, but a large part of generated code is left behind, and the bug this time is also because the method of discarded generated code is called. So the next version will definitely clean up all the generated deprecated code.

Once you’ve found this problem with Beanutils, it’s easy to fix the leak. I wrote a Beanshell script that posted a message to the logical thread and called beanutils.clean to clean up all threadlocals

import x.BeanUtils; import y.HandlerModule; for (int i = 1; i <= 16; i++) { HandlerModule.instance.addQueueTask(i, new Runnable() { public void run() { BeanUtils.clean(); }}); }Copy the code

CarnivalOneDayInfo from 1KW to 1million after FGC, old generation usage from 70 to 30

$jstat -gcutil 1087 5000 74.73 0.00 16.02 72.48 92.61 85.04 10156 1313.117 575 271.355 1584.472 74.73 0.00 34.71 72.48 92.61 85.04 10156 1313.117 575 271.355 1584.472 74.73 0.00 54.42 72.48 92.61 85.04 10156 1313.117 575 271.355 1584.472 74.73 0.00 73.29 72.48 92.61 85.04 10156 1313.117 575 271.355 1584.472 74.73 0.00 89.41 72.48 92.61 85.04 10156 1313.117 575 271.355 1584.472 0.00 71.54 9.25 72.74 92.64 85.06 10157 1313.303 576 272.188 1585.492 0.00 71.54 28.30 72.73 92.64 85.06 10157 1313.303 577 272.188 1585.492 0.00 71.54 55.85 72.73 92.64 85.06 10157 1313.303 577 272.463 1585.766 0.00 71.54 78.05 72.73 92.64 85.06 10157 1313.303 577 272.463 1585.766 69.21 0.00 1.70 70.98 92.64 85.06 10158 1313.438 578 273.320 1586.758 69.21 0.00 19.97 63.09 92.64 85.06 10158 1313.438 578 273.320 1586.758 69.21 0.00 39.82 53.33 92.64 85.06 10158 1313.438 578 273.320 1586.758 69.21 0.00 59.75 41.61 92.64 85.06 10158 1313.438 578 273.320 1586.758 69.21 0.00 75.12 31.79 92.64 85.06 10158 1313.438 578 273.320 1586.758 69.21 0.00 94.13 31.79 92.64 85.06 10158 1313.438 578 273.320 1586.758 0.00 86.02 15.60 32.07 92.64 85.06 10159 1313.761 578 273.320 1587.081 0.00 86.02 94.86 32.07 92.64 85.06 10159 1313.761 578 273.320 1587.081 [service @ ZQ - SE - 331 - V05 config] $jmap histo - 1087 | grep CarnivalOneDayInfo 10:  1408627 56345080 xxx.CarnivalOneDayInfoCopy the code

Another optimization solution is to refactor the active code and avoid leaks from BeanUtils threadlocals by using equals instead of beanUtils.

In terms of memory, before the official launch, it is necessary to accurately estimate the overall memory usage according to the number of concurrent online users and DAU, for example, the actual memory usage of a player, and global static data such as the actual memory usage of a leaderboard. It can be obtained from code and tools. This is a quick way to determine if there is a memory leak or a real memory hog.

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

  1. Like, forward, have your “like and comment”, is the motivation of my creation.

  2. Follow the public account “Java rotten pigskin” and share original knowledge from time to time.

  3. Also look forward to the follow-up article ing🚀