Online Case 1

Frequent full GC scenarios

Let’s talk about which scenarios lead to frequent Full GC:

  1. Memory leak (a code problem where the object reference was not released in time, so that the object could not be collected in time).
  2. Infinite loop.
  3. Large objects.

Especially for big players, he is more than 80 percent of the time. So where do big objects come from?

  1. The result set of databases (including NoSQL databases such as MySQL and MongoDB) is too large.
  2. A large object transmitted by a third-party interface.
  3. Message queue, message is too large.

The vast majority of cases are caused by database large result sets.

In the absence of any release, the system suddenly alerts fullGC, watches the heap for no memory leak, rolls back to the previous version, and the problem still exists.

On the one hand, one person watches the system monitoring, and one person exports the heap memory snapshot (jmap-dump :format=b,file= file name [PID]) to Jmap. Then, use mat and other tools to analyze what objects occupy a large amount of space, and then check the relevant references to find the problem code. This method takes a long time to locate a problem. If the problem is a key service, the fault cannot be located and resolved for a long time.

To further investigate the cause, we turned -xx :+HeapDumpBeforeFullGC on online and -xx :HeapDumpBeforeFullGC on one of the machines. The overall JVM parameters are as follows:

-Xmx2g
-XX:+HeapDumpBeforeFullGC
-XX:HeapDumpPath=.
-Xloggc:gc.log
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100m
-XX:HeapDumpOnOutOfMemoryError
Copy the code

Note that a STOP the Word event occurs during a JVM dump operation, which means that all user threads are suspended. To provide services properly during this period, you are advised to use distributed deployment and appropriate load balancing algorithms

  • Top: Find the PID of the Java process with the highest memory usage (RES column).

  • Jmap-heap PID: Displays heap memory usage.

Prints a summary of the heap, including the GC algorithm used, heap configuration information, and memory usage information for each memory region

C:\Users\jjs>jmap -heap 5932 Attaching to process ID 5932, please wait... Debugger attached successfully. Server Compiler detected. JVM version is 25.91-B15 using Ththread -local object allocation.  Parallel GC with 4 thread(s) Heap Configuration: MinHeapFreeRatio = 0 MaxHeapFreeRatio = 100 MaxHeapSize = 1073741824 (1024.0MB) NewSize = 42991616 (41.0MB) MaxNewSize = NewRatio = 2 SurvivorRatio = 8 MetaspaceSize = 21807104 (20.796875MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 17592186044415 MB G1HeapRegionSize = 0 (0.0MB) Heap Usage: PS Young Generation Eden Space: Capacity = 60293120 (57.5MB) Used = 44166744 (42.120689392089844MB) Free = 16126376 (15.379310607910156MB) 73.25337285580842%, informs the From Space: Capacity = 5242880 (5.0MB) Used = 0 (5.0MB) Free = 5242880 (5.0MB) 0.0% Used To Space: Capacity = 14680064 (14.0MB) Used = 0 (0.0MB) Free = 14680064 (14.0MB) 0.0% Used PS Old Generation Capacity = 120061952 (114.5MB) Used = 19805592 (18.888084411621094MB) Free = 100256360 (95.6119155883789MB) 16.496143590935453% Used 20342 interned Strings occupying 1863208 bytes.Copy the code
  • Jps-lv: Displays JVM parameter configurations.

  • Jstat -gc PID 1000: Collects GC information about the specific size of each area of the heap per second.

  • Jmap-dump :live,format=b,file=heap_dump.hprof PID: export heap files.

  • Use MAT to open the heap file and analyze the problem.

On the one hand, check the service monitoring, DB,redis and other information to see if there is an exception

If the network IO of the database server has increased significantly and the time points match, it is almost certain that the Full GC is caused by the large result set of the database. If the network IO of the database server has increased significantly and the time points match, it is certain that the Full GC is caused by the large result set of the database. Locating the code after the SQL is located is very simple.

In this way, we quickly located the problem. Originally is an interface must be transmitted parameters did not pass in, also did not add verification, resulting in SQL statement behind two conditions, a check tens of thousands of records out, really pits ah! Isn’t that much faster? Ha ha, 5 minutes

The reason is that the cid of the device is stored in the data inventory (each user downloads an APP identifier). During the query, some device CID is not available, and a lot of data is not stored. As a result, tens of thousands of empty CID data are found all at once.

Online Case 2 Memory leak

Before introducing an example, consider the difference between a memory leak and a memory overflow.

Out of memory: An out of memory occurs when a program does not have enough memory to use. When the memory runs out, the program basically doesn’t work.

Memory leak: A memory leak occurs when a program fails to release memory in a timely manner, resulting in a gradual increase in memory usage. Memory leaks generally do not cause programs to fail. However, continuous memory leaks, accumulated to the memory limit, memory overflow occurs. In Java, if a memory leak occurs, the GC collection is not complete, and the heap memory usage increases gradually after each GC.

The following is a monitor of memory leaks in the JVM, and we can see that the heap usage is higher after each GC than before:

The memory leak scenario was that local caches (a framework developed by the company’s infrastructure team) were used to store product data, which was not too many, but hundreds of thousands of items. If you store only hot items, you won’t have much memory, but if you store full items, you won’t have enough memory.

We initially added a 7-day expiration date to each cache record to ensure that most of the items in the cache were hot. However, after a refactoring of the local cache framework, the expiration time was removed. With no expiration time, over time the local cache grows larger and a lot of cold data is loaded into the cache.

One day, an alarm message is received indicating that the heap memory is too high. Quickly download heap memory snapshot through jmap (jmap-dump :format=b,file= file name [PID]), and then use Eclipse mat tool to analyze the snapshot, found a large number of commodity records in the local cache. After locating the problem, the architecture team quickly added an expiration time and restarted the service node by node.

Thanks to the addition of server memory and JVM heap memory monitoring, we caught the memory leak in time. Otherwise, as the leakage problems accumulate over time, if one day OOM will be miserable.

So the technical team in addition to CPU, memory and other operations monitoring, JVM monitoring is also very important