Phenomenon of the problem
On a working day, I suddenly receive an online service alarm, and a large number of requests are delayed. When I check the online service, I find that the access to the database has timed out, and the impact time is only 3-4 seconds. The service is restored. After a few minutes, a large number of alarms are generated again, and the alarm returns to normal after 3-4 seconds. Since we are a low-level service, dependent on a much heavier upper-level service, such frequent abnormal fluctuations have severely impacted business usage. Start troubleshooting
The screening process
DB influence?
- When the first alarm is generated, the number of connections to the database may be insufficient due to a large number of interface calls in upper-layer services and some complex SQL queries are involved. However, after analyzing the interface calls, the requests before and after the exception do not change significantly, and the impact caused by sudden traffic is excluded
- DB query, good load, no slow query, eliminate the impact caused by DB
Container or JVM impact?
After eliminating the influence of DB, check the influence of containers. We looked back at the abnormal alarms again, and found that in each alarm period, the same CONTAINER IP was generated. At this time, there was almost 80% probability that GC was a problem. The CPU load of the container during the alarm period is normal. Looking at the JVM’s memory and GC, the overall memory usage curve looks something like this:
Heap
Old Gen
As you can see from the figure above, there are objects in memory that have been referenced for a long time and cannot be reclaimed by YongGC, and the object size keeps growing. Objects are not collected until FullGC is triggered after Old Gen is stacked, and a single FullGC takes an extremely long time.
Temporary measures
Now the problem has been found, so far only 3 instances have triggered FullGC, but when we look at the memory usage of other instances, we find that almost all instances of Old Gen are nearing the tipping point. So the temporary solution is to keep one instance live and roll over all the others to avoid having a large number of instances FullGC at the same time. Otherwise, an avalanche of services is likely.
The JVM is configured to monitor alarms. Theoretically, an alarm will be generated when the memory usage reaches a certain value. However, the alarm configuration is invalid due to a service migration, and the problem is not detected in advance.
Problem analysis
What objects are not being reclaimed
YoungGC memory cannot be reclaimed by YoungGC, and it grows indefinitely. Only FullGC can reclaim this batch of objects, and FullGC takes a long time.
jmap -histo:live pid
Copy the code
After a brief online look, the # 2 HashMap$Node looks like an exception, but I can’t tell you anything more. The best way is to dump the memory snapshot and use MAT to analyze a wave
jmap -dump:format=b,file=filename pid
Copy the code
After opening it with MAT, obvious problems can be found:
class com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
Copy the code
This class takes up more than 80% of memory, so what is this class for? This is a thread used by MySQL Driver to clean up stale connections. Let’s take a look at the source code:
This class is a singleton that opens only one thread to clean up database connections that have not been explicitly closed.
You can see that this class maintains a Set
private static final Set<ConnectionFinalizerPhantomReference> connectionFinalizerPhantomRefs = ConcurrentHashMap.newKeySet();
Copy the code
For HashMap$Node, which has the second highest memory usage, it is almost certain that there is a memory leak. On MAT use list_Object to confirm a round:
Sure enough, the culprit was found! So what does it contain? Why is it growing and not being collected by YoungGC? The name ConnectionFinalizerPhantomReference we can guess it save the phantom reference should be a database connection
What is Phantom Reference? When an object has only phantom Reference references, it is reclaimed during the VIRTUAL machine GC and the Phantom Reference object is placed in a referenceQueue.
Let’s trace the source code to make sure
It wasPhantomReferenceMySQL > create MySQL connection;
As you can see, each time you create a new database connection, create a connection will be packaged into PhantomReference into connectionFinalizerPhantomRefs, after then this will clear the thread in an infinite loop, Get the connection in the referenceQueue and close it
A connection object can only be GC and placed in the referenceQueue if there are no other references, only phantom references
Why does Connection grow infinitely?
Now problems found, the database connection is created, will be put into connectionFinalizerPhantomRefs, but for some reason, the early stage of the connection used normally, after several minor GC could not be recycled, promotion to old age. However, after a period of time, the connection fails for some reason, causing a new connection to be created in the connection pool.
The connection pool used in our project is Druid. The following is the connection pool configuration:
You can see it’s setkeepAliveAnd,minEvictableIdleTimeMillisAfter connection initialization, the connection pool should maintain a minimum of 30 connections without frequent fluctuations in the number of DB requests, and a keepAlive operation should be performed when the connection idle time exceeds 5 minutes: In theory, connection pools do not create connections frequently, unless there are few active connections, and there are fluctuations, and the keepAlive operation does not take effect, the MySQL connection is invalid when the keepAlive operation is performed on the connection pool, and the invalid connection is discarded and re-created next time.
To test this hypothesis, we first looked at our active connection count and found that the number of active connections in the single-instance database fluctuated around 3 20 most of the time, and there were scheduled tasks in the business, with a large number of DB requests every 30 minutes and an hour. Druid has a heartbeat every 5 minutes, so why does the connection fail? The default wait_timeout on MySQL server is 8 hours. Is there a change in the corresponding configuration?
show global variables like '%timeout%'
Copy the code
Sure enough, the database timeout is set to 5 minutes! Then the problem is obvious
The problem summary
- DB connection rebuilds frequently, and the memory cannot be reclaimed. As a result, the memory grows too fast.
- At minorGC, the connection was still active with strong referencesconnectionFinalizerPhantomRefsIt grows infinitely and can’t be emptied until FullGC takes too long.
conclusion
- Idle connections rely on Druid’s Timed keepAlive task for heartbeat detection and keepAlive detection. Timed tasks are checked every 60 seconds by default and only when the idle time of the connection is greater than or equal tominEvictableIdleTimeMillisBefore heartbeat detection is performed.
- Due to theminEvictableIdleTimeMillisSet to 5 minutes, theoretically idle connections will perform heartbeat detection within 5 minutes ±60 seconds. However, since the timeout on the MySQL server is only 5 minutes, there is a high probability that the connection will fail when Druid performs keepAlive.
- Because database active connections are fluctuating, andmin-idleSet to 30. When active connections are at peak, a large number of connections need to be created and maintained in the connection pool. However, when activity drops to a low level, a large number of connections are removed from the connection pool due to keepAlive failures. Over and over again.
- Each time a Connection is created, a Connection object is put inconnectionFinalizerPhantomRefsAnd because the connection is active after it is created, it will not be reclaimed by miniorGC for a short time until it is promoted to the old age, causing the SET to become bigger and bigger, resulting in the FullGC with a long pause
To solve
Knowing the cause of the problem, the solution is simple:
- Reduce invalid connections, reduce memory growth rate: willminEvictableIdleTimeMillisSet this parameter to 3 minutes to ensure the keepAlive validity and avoid reestablishing connections all the time.
- Reduce the collection time of FullGC: change the garbage collector to G1.