This is the 11th day of my participation in the August More text Challenge. For details, see: August More Text Challenge
Jvm-related exceptions have always been a headache for first-line development. Because the JVM is basically a black box for business code, when an exception occurs, it is difficult to visually see and find the problem, which is why we have been studying its internal logic.
This article is based on a recent example of online JVM memory leak, with you forced analysis of a wave ~
Part1 online server has reported the alarm
One day, a colleague came to me for help. It turned out to be a series of alarms from a system without any warning. A wave of old memory usage of the machine exceeded the threshold
1.1 Performance first
Old memory footprint
As you can see, until mid-July, memory usage was relatively normal, and each GC was able to reclaim a significant portion of the older objects.
After the middle of the year, the old memory grew slowly and could not be released. Obviously, the object could not be recycled properly.
Memory leaks
1.2 What to Do
If this kind of problem occurs in the project that just went online, because the impact is relatively small, the code can be rolled back directly, and hemostasis is the first priority.
However, this project has obviously been online for more than N days, and I do not know how much demand has been passed in the middle. Moreover, since the recent increase in traffic has led to problems, it indicates that the traffic has been opened to customers.
Rollback is not possible, hurry up to locate the problem, online fix.
Part2 Positioning problems
General steps:
-
Get dump file
-
Use MAT and other tools to find out the exception object that occupies too much memory, and the reference relationship
-
Analyze possible problems with the exception object associated code
However, because the dump file is more than 10 gb, MAT basically can’t do anything, can only print out the manual analysis
2.1 Locating the faulty code
View the jMAP result
Fortunately, the exception object is pretty obvious. The _Point_ object and the _GeoDispLocal_ object have millions of instances. Let’s take a look at how these two objects are used in the code.
private static final CacheMap<String, List<GeoDispLocal>> NEAR_DISTRICT_CACHE = new CacheMap<String, List<GeoDispLocal>>(3600 * 1000, 1000);
private static final CacheMap<Integer, Point> LOCAL_POINT_CACHE = new CacheMap<Integer, Point>(3600 * 1000, 6000);
Copy the code
Are stored in the current cache CacheMap (a common cause of memory leaks due to being held by static collections and not being recycled), and _cachemap.entry_ in the dump file is also very high.
CacheMap is our number one suspect. Let’s take a look at this cache class:
public class CacheMap<K, V> { private final long expireMs; private LRUMap<K, CacheMap.Entry<V>> valueMap; // Others omitted}Copy the code
Internally rely on a map with LRU functionality.
public class LRUMap<K, V> extends LinkedHashMap<K, V> { private static final long serialVersionUID = 1L; private final int maxCapacity; Private static final float LOAD_FACTOR = 0.99f; private final ReadWriteLock lock = new ReentrantReadWriteLock(); public LRUMap(int maxCapacity) { super(maxCapacity, LOAD_FACTOR, true); this.maxCapacity = maxCapacity; } @Override protected boolean removeEldestEntry(java.util.Map.Entry<K, V> eldest) { return size() > maxCapacity; } @Override public V get(Object key) { try { lock.readLock().lock(); return super.get(key); } finally { lock.readLock().unlock(); } } @Override public V put(K key, V value) { try { lock.writeLock().lock(); return super.put(key, value); } finally { lock.writeLock().unlock(); } //remove clear;Copy the code
Inside is an LRU cache that relies on the LinkedHashMap implementation. The goal is to build a MAP with a limited capacity and no expansion (just like the online implementation). So, is it really what it looks like? .
2.2 How to implement LRUMap on LinkedHashMap
Let’s look at the capacity and expansion Settings: Why does the designer think that LRUMap will not be expanded?
Private final int maxCapacity; private final int maxCapacity; private final int maxCapacity; Private static final float LOAD_FACTOR = 0.99f; // The constructor calls LinkedHashMap to initialize super(maxCapacity, LOAD_FACTOR, true); @override // protected Boolean removeEldestEntry(java.util.map.entry <K, V> John) {// Return size() > maxCapacity when linkedhashmap. size is larger than our limit; }Copy the code
Let’s instantiate it for our actual use:
-
MaxCapacity =6000, which is the maximum element capacity we want.
-
Load_factor =0.99 Load factor.
-
Map internal threshold=8192*0.99=8110 is the capacity size for the next expansion. (The actual size of the table capacity in map is the nearest power of 2 to the N, that is, 8192).
Because the LRU condition function is copied, the LRU substitution is performed when size>6000. Therefore, in theory, size will never reach 8110.
How to resolve concurrent read/write conflicts?
Private final ReadWriteLock lock = new ReentrantReadWriteLock(); private final ReadWriteLock lock = new ReentrantReadWriteLock(); public V get(Object key) { try { lock.readLock().lock(); return super.get(key); } finally { lock.readLock().unlock(); } } public V put(K key, V value) { try { lock.writeLock().lock(); return super.put(key, value); } finally { lock.writeLock().unlock(); }}Copy the code
In order to solve concurrent read/write conflicts, the designer added locks to the query and modification methods. For performance, read locks are used: read locks are added to get, and write locks are added to PUT /remove.
It looks like the whole design solves LRUMap’s fixed capacity and concurrent operation problems nicely, but what does it really look like?
In fact, this problem has been analyzed a long time ago [1], because the LinkedHashMap will change the element to maintain LRU during the get read operation, that is, the element will be transferred to the end of the linked list. This leads to the problem of concurrent reading and writing, but the explanation feels vague, so I decided to go into more detail on it.
2.3 Disconnecting the LinkedHashMap memory Leak
Why doesn’t it work if you have read and write locks?
A read-write lock, which allows multiple threads to share a read lock, is used when there is more read than write. (Provided that the read operation does not change the storage structure)
So, the problem is with the GET operation. The GET operation of LinkedHashMap has been rewritten to implement the LRU function. After the GET, the current node _ moves _ to the end of the list.
Move, guys, this is obviously a write operation, so does the read lock still work?
If multiple threads are allowed and modified, how does that work? Can concurrency be avoided?
Here’s a detailed breakdown of the concurrency problem with multiple threads compared to the code for node movement:
The node after get moves, moving the node to the end
The actual disassembly analysis is as follows, why the memory leak occurs in the case of multithreading:
Multithreaded GET execution under time slices
As we can see, after thread 1 completes the first two sentences, the time slice is released, and after thread 2 executes p.archer =null, the time slice is released again. In this way, originally A should be the following <2,B> node, but it becomes NULL under the multithreading. Finally, the last two nodes are kicked out of the linked list, and the deletion operation cannot be reached, resulting in memory leak.
Verification of the code is not posted, everyone interested can try their own ~
Part3 summary
After all, now that the problem has been located, how can this memory leak be fixed?
You can change the read-write lock to a mutex. Or just use distributed storage, how much slower can it be, right, convenient, simple, and not to build your own LRUMap in order to save machine memory.
Each eight-part essay is not just for an interview, but a cornerstone of every online inquiry. Don’t get the role of the eight-part essay wrong…
The resources
[1] LinkedHashMap caused by memory leaks: “blog.csdn.net/yejingtao70…”