Phenomenon of the problem

On a working day, I suddenly receive an online service alarm, and a large number of requests are delayed. When I check the online service, I find that the access to the database has timed out, and the impact time is only 3-4 seconds. The service is restored. After a few minutes, a large number of alarms are generated again, and the alarm returns to normal after 3-4 seconds. Since we are a low-level service, dependent on a much heavier upper-level service, such frequent abnormal fluctuations have severely impacted business usage. Start troubleshooting

The screening process

DB influence?

  1. When the first alarm is generated, the number of connections to the database may be insufficient due to a large number of interface calls in upper-layer services and some complex SQL queries are involved. However, after analyzing the interface calls, the requests before and after the exception do not change significantly, and the impact caused by sudden traffic is excluded
  2. DB query, good load, no slow query, eliminate the impact caused by DB

Container or JVM impact?

After eliminating the influence of DB, check the influence of containers. We looked back at the abnormal alarms again, and found that in each alarm period, the same CONTAINER IP was generated. At this time, there was almost 80% probability that GC was a problem. The CPU load of the container during the alarm period is normal. Looking at the JVM’s memory and GC, the overall memory usage curve looks something like this:

Heap

Old Gen

As you can see from the figure above, there are objects in memory that have been referenced for a long time and cannot be reclaimed by YongGC, and the object size keeps growing. Objects are not collected until FullGC is triggered after Old Gen is stacked, and a single FullGC takes an extremely long time.

Temporary measures

Now the problem has been found, so far only 3 instances have triggered FullGC, but when we look at the memory usage of other instances, we find that almost all instances of Old Gen are nearing the tipping point. So the temporary solution is to keep one instance live and roll over all the others to avoid having a large number of instances FullGC at the same time. Otherwise, an avalanche of services is likely.

The JVM is configured to monitor alarms. Theoretically, an alarm will be generated when the memory usage reaches a certain value. However, the alarm configuration is invalid due to a service migration, and the problem is not detected in advance.

Problem analysis

What objects are not being reclaimed

YoungGC memory cannot be reclaimed by YoungGC, and it grows indefinitely. Only FullGC can reclaim this batch of objects, and FullGC takes a long time.

 jmap -histo:live pid
Copy the code

After a brief online look, the # 2 HashMap$Node looks like an exception, but I can’t tell you anything more. The best way is to dump the memory snapshot and use MAT to analyze a wave

jmap -dump:format=b,file=filename pid 
Copy the code

After opening it with MAT, obvious problems can be found:

class com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
Copy the code

This class takes up more than 80% of memory, so what is this class for? This is a thread used by MySQL Driver to clean up stale connections. Let’s take a look at the source code:

This class is a singleton that opens only one thread to clean up database connections that have not been explicitly closed.

You can see that this class maintains a Set

private static final Set<ConnectionFinalizerPhantomReference> connectionFinalizerPhantomRefs = ConcurrentHashMap.newKeySet();
Copy the code

For HashMap$Node, which has the second highest memory usage, it is almost certain that there is a memory leak. On MAT use list_Object to confirm a round:

Sure enough, the culprit was found! So what does it contain? Why is it growing and not being collected by YoungGC? The name ConnectionFinalizerPhantomReference we can guess it save the phantom reference should be a database connection

What is Phantom Reference? When an object has only phantom Reference references, it is reclaimed during the VIRTUAL machine GC and the Phantom Reference object is placed in a referenceQueue.

Let’s trace the source code to make sure

It wasPhantomReferenceMySQL > create MySQL connection;

As you can see, each time you create a new database connection, create a connection will be packaged into PhantomReference into connectionFinalizerPhantomRefs, after then this will clear the thread in an infinite loop, Get the connection in the referenceQueue and close it

A connection object can only be GC and placed in the referenceQueue if there are no other references, only phantom references

Why does Connection grow infinitely?

Now problems found, the database connection is created, will be put into connectionFinalizerPhantomRefs, but for some reason, the early stage of the connection used normally, after several minor GC could not be recycled, promotion to old age. However, after a period of time, the connection fails for some reason, causing a new connection to be created in the connection pool.

The connection pool used in our project is Druid. The following is the connection pool configuration:

You can see it’s setkeepAliveAnd,minEvictableIdleTimeMillisAfter connection initialization, the connection pool should maintain a minimum of 30 connections without frequent fluctuations in the number of DB requests, and a keepAlive operation should be performed when the connection idle time exceeds 5 minutes: In theory, connection pools do not create connections frequently, unless there are few active connections, and there are fluctuations, and the keepAlive operation does not take effect, the MySQL connection is invalid when the keepAlive operation is performed on the connection pool, and the invalid connection is discarded and re-created next time.

To test this hypothesis, we first looked at our active connection count and found that the number of active connections in the single-instance database fluctuated around 3 20 most of the time, and there were scheduled tasks in the business, with a large number of DB requests every 30 minutes and an hour. Druid has a heartbeat every 5 minutes, so why does the connection fail? The default wait_timeout on MySQL server is 8 hours. Is there a change in the corresponding configuration?

show global variables like '%timeout%'
Copy the code

Sure enough, the database timeout is set to 5 minutes! Then the problem is obvious

The problem summary

  1. DB connection rebuilds frequently, and the memory cannot be reclaimed. As a result, the memory grows too fast.
  2. At minorGC, the connection was still active with strong referencesconnectionFinalizerPhantomRefsIt grows infinitely and can’t be emptied until FullGC takes too long.

conclusion

  1. Idle connections rely on Druid’s Timed keepAlive task for heartbeat detection and keepAlive detection. Timed tasks are checked every 60 seconds by default and only when the idle time of the connection is greater than or equal tominEvictableIdleTimeMillisBefore heartbeat detection is performed.
  2. Due to theminEvictableIdleTimeMillisSet to 5 minutes, theoretically idle connections will perform heartbeat detection within 5 minutes ±60 seconds. However, since the timeout on the MySQL server is only 5 minutes, there is a high probability that the connection will fail when Druid performs keepAlive.
  3. Because database active connections are fluctuating, andmin-idleSet to 30. When active connections are at peak, a large number of connections need to be created and maintained in the connection pool. However, when activity drops to a low level, a large number of connections are removed from the connection pool due to keepAlive failures. Over and over again.
  4. Each time a Connection is created, a Connection object is put inconnectionFinalizerPhantomRefsAnd because the connection is active after it is created, it will not be reclaimed by miniorGC for a short time until it is promoted to the old age, causing the SET to become bigger and bigger, resulting in the FullGC with a long pause

To solve

Knowing the cause of the problem, the solution is simple:

  1. Reduce invalid connections, reduce memory growth rate: willminEvictableIdleTimeMillisSet this parameter to 3 minutes to ensure the keepAlive validity and avoid reestablishing connections all the time.
  2. Reduce the collection time of FullGC: change the garbage collector to G1.

Further reading

1. Phantom Reference, Weak Reference, etc

Druid connection pooling

3. Use of memory analysis tool MAT

Analysis of FullGC Problems caused by Database Connection Pool (Continued)

【 Important 】 Reprint please indicate the source:Juejin. Cn/post / 684490…