Druid Application suspension caused by a Druid connection exception

1. The background

1.1 the phenomenon

2021.12.16 early in the morning, we have taken place in the database for the case of the application of switching, after a continuous error GetConnectionTimeoutException Pod, and the process of the Pod always hangs, unable to provide normal service.

The odd thing is that other pods connected to the same database also reported a few errors (mainly the database connection timeout error), but soon all other pods recovered, except this one.

1.2 Error Information is displayed

Some error information about abnormal Pod is as follows:

[the 2021-12-16 02:10:06. 617] [ERROR] [thread - 30849] com. Alibaba. Druid. Pool. DruidPooledStatement errorCheck: 367 - CommunicationsException, druid version 1.2.5, jdbcUrl: JDBC: mysql: / / XXX: 3306 / XXX? useUnicode=true&characterEncoding=UTF-8&allowMultiQueries=true, testWhileIdle true, idle millis 1785, minIdle 5, poolingCount 0, timeBetweenEvictionRunsMillis 300000, lastValidIdleMillis 1785, driver com.mysql.cj.jdbc.Driver, exceptionSorter com.alibaba.druid.pool.vendor.MySqlExceptionSorter ...... The 2021-12-16 02:10:06. 617] [ERROR] thread - 30849] under Caused by: com. Alibaba. Druid. Pool. GetConnectionTimeoutException: wait millis 100000, active 399, maxActive 100, creating 0 at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1745) at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1415) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1395) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1385) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:100) at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:182) ... 26 common frames omittedCopy the code

From the point of error message, GetConnectionTimeoutException obtaining the connection timeout. However, both the database and the network are normal. Why did the database connection time out?

MaxActive = 100; active = 399; active = 399;

In the issues of Druid, GetConnectionTimeoutException is also a very common problem, that this exception of how to produce, and how to avoid?

With these questions in mind, let’s take a closer look at the Druid source code.

2.1.3 Druid configuration

Before analyzing the problem, take a look at the current Druid configuration. Druid’s version is 1.2.5, so I’ll be analyzing the 1.2.5 source code for the rest of this article. Druid’s configuration is as follows:

<bean id="dataSource" class="com.alibaba.druid.pool.DruidDataSource" init-method="init" destroy-method="close"> 
    <property name="url" value="${jdbc_url}" />
    <property name="username" value="${jdbc_user}" />
    <property name="password" value="${jdbc_password}" />
    <property name="maxActive" value="100"/>
    <property name="initialSize" value="1"/>
    <property name="maxWait" value="100000"/>
    <property name="minIdle" value="1"/>
    <property name="timeBetweenEvictionRunsMillis" value="300000"/>
    <property name="validationQuery" value="SELECT 1"/>
    <property name="testWhileIdle" value="true"/>
    <property name="removeAbandoned" value="false"/>
    <property name="removeAbandonedTimeout" value="600"/>
</bean>
Copy the code

2. Key variables

In the error stack, we see pollingCount, active, maxActive, etc. What do these variables mean?

This chapter will take a look at it first. At the same time understand the meaning of these variables, but also more conducive to the subsequent reading of the source code analysis.

2.1 configuration items

Here are some of the key configurations covered in this article. See the DruidDataSource configuration property list for more configurations.

2.1.1 minIdle

The minimum number of connections to be maintained in the connection pool.

2.1.2 maxActive

Maximum number of connections in a connection pool.

2.1.3 maxWait

The maximum waiting time for a single fetch connection.

2.2 DruidDataSource

DruidDataSource is a data source. Each data source has a DruidDataSource instance.

2.2.1 activeCount

Number of current active connections. ActiveCount is incremented each time a connection is successfully acquired and the Holder’s active property is set to true.

private DruidPooledConnection getConnectionInternal(long maxWait) {
    // ...
    if(holder ! =null) {
        if (holder.discard) {
            continue;
        }
        activeCount++;
        holder.active = true; }}Copy the code

2.2.2 connections

The connection pool currently owned by the data source. The type is DruidConnectionHolder[].

2.2.3 poolingCount

The number of connections in the connection pool.

PoolingCount is incrementing each time a connection is placed into the connection pool, for example:

private boolean put(DruidConnectionHolder holder, long createTaskId) {
    // ...
    connections[poolingCount] = holder;
    incrementPoolingCount();
    // ...
}

private final void incrementPoolingCount(a) {
    poolingCount++;
}
Copy the code

Whenever a connection is removed from the pool, the poolingCount is reduced by one, for example:

private DruidConnectionHolder pollLast(long nanos) throws InterruptedException, SQLException {
    // ...
    decrementPoolingCount();
    DruidConnectionHolder last = connections[poolingCount];
    connections[poolingCount] = null;
    // ...
    return last;
}

private final void decrementPoolingCount(a) {
    poolingCount--;
}


Copy the code

2.3 DruidConnectionHolder

Database connection and related properties.

2.3.1 conn

Database Connection of type Connection.

2.3.2 active

Used to indicate whether the connection is active.

2.3.3 lastActiveTimeMillis

Joins the last active time.

2.3.4 lastKeepTimeMillis

Joins the last time it was active. This time is updated each time a connection is kept active through keepAlive.

2. Source code analysis

Next, let’s dig into the source code from the error stack, layer by layer, to find the root of the problem.

2.2.1 How is an exception thrown?

First GetConnectionTimeoutException is thrown by getConnectionInternal method, this method is the main purpose of access to database connection, the current didn’t get to the connection that is, when the holder is null, it throws an exception. Print both the current active and maxActive values.

if (holder == null) {
  // ...
  StringBuilder buf = new StringBuilder(128);
  buf.append("wait millis ")//
    .append(waitNanos / (1000 * 1000))//
    .append(", active ").append(activeCount)//
    .append(", maxActive ").append(maxActive)//
    .append(", creating ").append(creatingCount)//
    ;

  // ...
  String errorMessage = buf.toString();

  if (this.createError ! =null) {
    throw new GetConnectionTimeoutException(errorMessage, createError);
  } else {
    throw newGetConnectionTimeoutException(errorMessage); }}Copy the code

Holder is an instance of DruidConnectionHolder. Why is holder null? Let’s take a look at how the holder is created.

2.2 How are connections created?

Next, look at how the holder is created.

In getConnectionInternal, the holder is created in the for loop and exits the loop once it is successfully created.

When a holder is created, it can be done synchronously or asynchronously.

2.2.1 Creating connections synchronously

if(createScheduler ! =null
    && poolingCount == 0
    && activeCount < maxActive
		&& creatingCountUpdater.get(this) = =0
    && createScheduler instanceof ScheduledThreadPoolExecutor) {
  ScheduledThreadPoolExecutor executor = (ScheduledThreadPoolExecutor) createScheduler;
  if (executor.getQueue().size() > 0) {
    createDirect = true;
    continue; }}Copy the code

To create a connection synchronously, the following conditions must be met:

The thread pool for creating the connection is not configured, that iscreateScheduler ! = null
No connection is currently available in the connection pool, that ispoolingCount == 0
Currently, the number of active connections is less than the maximum number of connections, i.eactiveCount < maxActive
Currently, no connection is being created synchronously. That is, only one thread is creating a connection synchronously at a timecreatingCountUpdater.get(this) == 0

When the above conditions are met, createDirect is set to true and the connection is created synchronously in the next loop. Because of these conditions, fewer threads will enter the process of creating a connection synchronously.

private DruidPooledConnection getConnectionInternal(long maxWait) {
	// ...
    if (createDirect) {
        if (creatingCountUpdater.compareAndSet(this.0.1)) {
            PhysicalConnectionInfo pyConnInfo = DruidDataSource.this.createPhysicalConnection();
            holder = new DruidConnectionHolder(this, pyConnInfo);
            holder.lastActiveTimeMillis = System.currentTimeMillis();
            // ...
            boolean discard = false;
            lock.lock();
            try {
                if (activeCount < maxActive) {
                    activeCount++;
                    holder.active = true;
                    if (activeCount > activePeak) {
                        activePeak = activeCount;
                        activePeakTime = System.currentTimeMillis();
                    }
                    break;
                } else {
                    discard = true; }}finally{ lock.unlock(); }}}}Copy the code

2.2.2 Creating a Connection Asynchronously

If the connection is not created synchronously, it is then created asynchronously through pollLast(Nanos) or takeLast().

Nanos is calculated from maxWait, pollLast(nanOS) is called if maxWait is set, takeLast() is called otherwise.

private DruidPooledConnection getConnectionInternal(long maxWait) {

    final long nanos = TimeUnit.MILLISECONDS.toNanos(maxWait);
    // ...

    if (maxWait > 0) {
        holder = pollLast(nanos);
    } else{ holder = takeLast(); }}Copy the code

2.2.3 `pollLast` 和 `takeLast`

In the pollLast method, if poolingCount is 0 in the current connection pool, emptySignal is called to use the semaphore Empty and an asynchronous thread is created to create the connection.

private DruidConnectionHolder pollLast(long nanos) throws InterruptedException, SQLException {
    long estimate = nanos;

    for (;;) {
        if (poolingCount == 0) {
            emptySignal(); // send signal to CreateThread create connection
            // ...}}}Copy the code

PollLast itself then uses the notEmpty semaphore to enter the wait:

estimate = notEmpty.awaitNanos(estimate); // signal by
                                        // recycle or
                                        // creator
Copy the code

The maximum time estimate for this wait is calculated from maxWait.

It will wake up after a connection has been reclaimed and created. If the thread fails to create a connection after waking up and the total wait time has not reached maxWait, it enters await again.

If maxWait is reached and the connection is not successfully created, poolingCount is 0, then NULL is returned.

if (poolingCount == 0) {
    if (estimate > 0) {
        continue;
    }

    return null;
}
Copy the code

The connection created by the signal is placed into the connection pool connections, and pollLast(Nanos) can retrieve the connection from the pool after receiving a successful signal:

private DruidConnectionHolder pollLast(long nanos) throws InterruptedException, SQLException {
    // ...
    decrementPoolingCount();
    DruidConnectionHolder last = connections[poolingCount];
    connections[poolingCount] = null;
    // ...
    return last;
}
Copy the code

TakeLast (), with no arguments, is much simpler. Its process is almost the same as pollLast(Nanos), except that it does not wait for a timeout and only resumes execution after it has waited for the reclaim thread to wake up.

DruidConnectionHolder takeLast(a) throws InterruptedException, SQLException {
    try {
        while (poolingCount == 0) {
            emptySignal(); // send signal to CreateThread create connection

            try {
                notEmpty.await(); // signal by recycle or creator
            } 
        }
    }

    decrementPoolingCount();
    DruidConnectionHolder last = connections[poolingCount];
    connections[poolingCount] = null;
    return last;
}
Copy the code

2.2.4 emptySignal

The emptySignal() method that creates a connection asynchronously using signals reads as follows:

private void emptySignal(a) {
    if (createScheduler == null) {
        empty.signal();
        return;
    }

    if (createTaskCount >= maxCreateTaskCount) {
        return;
    }

    if (activeCount + poolingCount + createTaskCount >= maxActive) {
        return;
    }
    submitCreateTask(false);
}
Copy the code

If the creation thread pool createScheduler is set to NULL, a signal is sent to create a connection. Otherwise, create a CreateConnectionTask to create the connection.

The creation signal sent here is received by the CreateConnectionThread, which is created when the data source is initialized.

public void init(a) throws SQLException {
    // ...
    createAndStartCreatorThread();
    createAndStartDestroyThread();
}
Copy the code

After receiving the signal, the system checks whether the current number of active connections and the number of connections in the connection pool are greater than maxActive. If the number is greater than maxActive, the system does not create connections. If less, the physical connection is created and then put into the connection pool by calling DruidDataSource’s PUT method.

public void run(a) {

    for (;;) {
        // ...
        try {
            boolean emptyWait = true;

            if(createError ! =null
                && poolingCount == 0
                && !discardChanged) {
                emptyWait = false;
            }

            if (emptyWait
                && asyncInit && createCount < initialSize) {
                emptyWait = false;
            }

            if (emptyWait) {
                // There must be a thread wait before a connection is created
                if (poolingCount >= notEmptyWaitThreadCount //&& (! (keepAlive && activeCount + poolingCount < minIdle)) && ! isFailContinuous() ) { empty.await(); }// Prevent the creation of more connections than maxActive
                if (activeCount + poolingCount >= maxActive) {
                    empty.await();
                    continue; }}}// ...
        PhysicalConnectionInfo connection = null;

        try {
            connection = createPhysicalConnection();
        } 
        // ...
        booleanresult = put(connection); }}Copy the code

CreateConnectionTask has similar logic.

2.2.5 DruidDataSource# put

In the PUT method, it first determines whether the current number of connections is greater than the maximum number, if less, it puts the connection into Connections, and then increments the poolingCount value by one.

private boolean put(DruidConnectionHolder holder, long createTaskId) {
    if (poolingCount >= maxActive) {
        if(createScheduler ! =null) {
            clearCreateTask(createTaskId);
        }
        return false;
    }
    // ...
    connections[poolingCount] = holder;
    incrementPoolingCount();
    / /...
}
Copy the code

2.2.6 summary

This is the basic connection creation process.

In general, when an application needs to execute SQL, it uses getConnectionInternal to fetch connections, which can be created synchronously or asynchronously, or asynchronously via semaphores. Once the connection is successfully acquired, activeCount is incremented by one.

Combined with the exception information in 1.2.1, poolingCount is 0 when an error is reported, so emptySignal() is called to create the connection. However, activeCount + poolingCount + createTaskCount is obviously greater than maxActive, so pollLast will eventually wait for maxWait and then timeout. So eventually raise the GetConnectionTimeoutException getConnectionInternal anomalies.

So the key question then is, why is activeCount greater than maxActive?

2.3 How is the connection released?

To answer this question, we need to know when activeCount decreases.

It has been known that activeCount increases when a connection is acquired, and accordingly, activeCount decreases when a connection is released. So let’s take a closer look at how connections are released.

2.3.1 The Application Actively disconnects the Connection

When it comes to connection release, the first thing that comes to mind is that we actively call connection.close() in our application to release the connection.

Does connection.close() immediately release the physical connection? Obviously not.

public void close(a) throws SQLException {
    // ...
    recycle();
}

public void recycle(a) throws SQLException {
    // ...
    if (!this.abandoned) {
        DruidAbstractDataSource dataSource = holder.getDataSource();
        dataSource.recycle(this); }}Copy the code

Connection.close () will eventually call dataSource. Recycle (DruidPooledConnection pooledConnection) to release the connection. What about dataSource. Recycle?

2.3.2 recycle

In recyle, there are a series of connection state judgments, such as whether the physical connection is closed or not, and then decide whether to release the connection immediately. Most of the time, however, this will run to the logic that if the connection state is active, subtract activeCount by one and set the state back to false. And then putLast.


/** * reclaim connection */
protected void recycle(DruidPooledConnection pooledConnection) throws SQLException {
    // ...
    if (holder.active) {
        activeCount--;
        holder.active = false;
    }
    result = putLast(holder, currentTimeMillis);
    // ...
}
Copy the code

PutLast updates the last active time of the connection, then puts the connection back into the connection pool and increments poolingCount by one.

boolean putLast(DruidConnectionHolder e, long lastActiveTimeMillis) {
    if (poolingCount >= maxActive || e.discard) {
        return false;
    }

    e.lastActiveTimeMillis = lastActiveTimeMillis;
    connections[poolingCount] = e;
    incrementPoolingCount();
    // ...
    return true;
}
Copy the code

That is, actively releasing a connection through the.close method does not actually release the connection, but rather returns it to the connection pool.

2.3.3 Connection Release

Recycle does not close a connection, but the connection pool cannot be infinitely large. This brings us to Druid’s connection release task.

Like CreateConnectionThread, Druid creates a DestroyThread to release the connection when it initializes the data source, and then creates a DestroyTask to release the connection.

DestroyTask is run in destroyScheduler when the destroyScheduler thread pool that the application sets is not empty. Otherwise, a thread pool that consumes connections is created, and the task of releasing the connection is run in that thread pool. Similar to createScheduler, in most cases we will not set destroyScheduler.

protected void createAndStartDestroyThread(a) {
    destroyTask = new DestroyTask();

    if(destroyScheduler ! =null) {
        // ...
        destroySchedulerFuture = destroyScheduler.scheduleAtFixedRate(destroyTask, period, period,
                                                                      TimeUnit.MILLISECONDS);
        return;
    }
    // ...
    destroyConnectionThread = new DestroyConnectionThread(threadName);
    destroyConnectionThread.start();
}
Copy the code

DestroyConnectionThread is essentially an infinite loop, in a loop every timeBetweenEvictionRunsMillis perform a mission to release the connection.

public class DestroyConnectionThread extends Thread {
    // ...
    public void run(a) {
        for (;;) {
            // Delete from the front
            try {
                // ...
                if (timeBetweenEvictionRunsMillis > 0) {
                    Thread.sleep(timeBetweenEvictionRunsMillis);
                } else {
                    Thread.sleep(1000); //
                }

                destroyTask.run();
            } catch (InterruptedException e) {
                break; }}}}Copy the code

The key logic for releasing the connection is in DestroyTask.

2.3.4 DestroyTask

In DestroyTask, the main way to release the connection is through the Shrink method.

    public class DestroyTask implements Runnable {
        public DestroyTask(a) {}@Override
        public void run(a) {
            shrink(true, keepAlive);

            if(isRemoveAbandoned()) { removeAbandoned(); }}}Copy the code

So the next key is the shrink(Boolean checkTime, Boolean keepAlive) method.

2.3.5 DestroyTask# the shrink

Shrink takes two parameters:

checkTimeWhether to verify connection time, such as connection idle time;
keepAliveWhether keepAlive is enabled

In Shrink, all connections in the connection pool are first iterated and processed through the for loop.

In the loop, it first determines whether there is currently a fatalError, and if so, it puts the connection into a list of keepAliveConnections, which stores the connections that need to be kept active. When does a fatalError occur? It may occur during SQL execution, such as database network disconnection or database restart.

for (int i = 0; i < poolingCount; ++i) {
    DruidConnectionHolder connection = connections[i];
    if ((onFatalError || fatalErrorIncrement > 0) && (lastFatalErrorTimeMillis > connection.connectTimeMillis))  {
        keepAliveConnections[keepAliveCount++] = connection;
        continue; }}Copy the code

If fatalError is not present and checkTime is true is turned on, it continues to determine whether the connection should be placed in the keepAliveConnections or evictConnections queue based on the connection time.

The specific logic is as follows:

Check whether the physical connection times out. If it does, put evictConnections. The default phyTimeoutMillis is- 1 ；
And then determine whether connection idle time at the same time less than minEvictableIdleTimeMillis and keepAliveBetweenTimeMillis, if does not meet the conditions, retreat to the release of the task, also is other connection will not be released in the connection pool;
If the connection is idle time is greater than the minEvictableIdleTimeMillis, judge the connection release should be released, the judgment condition isi < checkCount, that is, if the current number of connection pools is smaller than minIdle, the connections in the front of the pool will not be released, but will be released from the back of the pool.
If does not meet the conditions 3, and open the keepAlive, connection idle time is greater than keepAliveBetweenTimeMillis, will be connected into the keepAliveConnections queue. Combined with condition 3, that is, the connections in front of the pool are kept active.

final int checkCount = poolingCount - minIdle;
// ...
for (int i = 0; i < poolingCount; ++i) {
    // ...
    if (phyTimeoutMillis > 0) {
	    long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis;
    	if (phyConnectTimeMillis > phyTimeoutMillis) {
        	evictConnections[evictCount++] = connection;
	        continue; }}long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;

    if (idleMillis < minEvictableIdleTimeMillis
        && idleMillis < keepAliveBetweenTimeMillis
    ) {
        break;
    }

    if (idleMillis >= minEvictableIdleTimeMillis) {
        if (checkTime && i < checkCount) {
            evictConnections[evictCount++] = connection;
            continue;
        } else if (idleMillis > maxEvictableIdleTimeMillis) {
            evictConnections[evictCount++] = connection;
            continue; }}if(keepAlive && idleMillis >= keepAliveBetweenTimeMillis) { keepAliveConnections[keepAliveCount++] = connection; }}Copy the code

Normally, connections are arranged in lastActiveTimeMillis order, so when releasing connections, release them from front to back, Once found connection free time is less than minEvictableIdleTimeMillis keepAliveBetweenTimeMillis ends and the release process, so that you can reduce cycle times, improve the performance. And when a connection is saved, it is also saved from front to back, because the previous connection is more likely to time out. That’s where Druid’s design comes in.

Of course, this is the “normal case”, so there must be an anomaly. What is the anomaly? I’m not going to go into that for the moment, but I’ll go into that later.

Once you have determined how many connections you need to release and keep active, the next step is to remove connections from the pool by removing the first removeCount connections from connections and then placing the remaining connections back into connections. Finally, null is used to populate connections so that future connections can be stored alive.

int removeCount = evictCount + keepAliveCount;
if (removeCount > 0) {
    System.arraycopy(connections, removeCount, connections, 0, poolingCount - removeCount);
    Arrays.fill(connections, poolingCount - removeCount, poolingCount, null);
    poolingCount -= removeCount;
}

Copy the code

You can then close the connection in evictConnections.

DruidConnectionHolder item = evictConnections[i];
Connection connection = item.getConnection();
JdbcUtils.close(connection);
destroyCountUpdater.incrementAndGet(this);
Copy the code

For connections in keepAliveConnections, it determines whether the connection is available, and if so, it calls the PUT method to put the connection back into the connection pool. Unlike putLast, which was called earlier when the active release of the connection was applied, put does not update the lastActiveTimeMillis of the connection.

If the connection is not available, then call emptySignal() to create a new connection through the signal.

DruidConnectionHolder holer = keepAliveConnections[i];
Connection connection = holer.getConnection();
holer.incrementKeepAliveCheckCount();

boolean validate = false;
try {
    this.validateConnection(connection);
    validate = true;
} catch (Throwable error) {
    // skip
}

booleandiscard = ! validate;if (validate) {
    holer.lastKeepTimeMillis = System.currentTimeMillis();
    boolean putOk = put(holer, 0L);
    if(! putOk) { discard =true; }}if (discard) {
    try {
        connection.close();
    }
    
    // ...
    try {

        if(activeCount + poolingCount <= minIdle) { emptySignal(); }}// ...
}
Copy the code

When the connection is preserved, the earlier connection is preferentially preserved. But what if we expect a connection to be idle for less than a minute, for example, and then release it even if it becomes available?

TestWhileIdle Druid takes this into account by setting testWhileIdle to true.

2.3.6 Idle Connections Are released

In our application we get the connection through getConnection, which calls getConnectionDirect, which in turn calls getConnectionInternal.

public DruidPooledConnection getConnectionDirect(long maxWaitMillis) throws SQLException {
    	// ...
        for (;;) {
            // handle notFullTimeoutRetry
            DruidPooledConnection poolableConnection;
            try {
                poolableConnection = getConnectionInternal(maxWaitMillis);
            }
            
            if (testOnBorrow) {
                // ...
            } else {
                // ...
                if (testWhileIdle) {
                    final DruidConnectionHolder holder = poolableConnection.holder;
                    long currentTimeMillis             = System.currentTimeMillis();
                    long lastActiveTimeMillis          = holder.lastActiveTimeMillis;
                    long lastExecTimeMillis            = holder.lastExecTimeMillis;
                    long lastKeepTimeMillis            = holder.lastKeepTimeMillis;

                    if(checkExecuteTime && lastExecTimeMillis ! = lastActiveTimeMillis) { lastActiveTimeMillis = lastExecTimeMillis; }if (lastKeepTimeMillis > lastActiveTimeMillis) {
                        lastActiveTimeMillis = lastKeepTimeMillis;
                    }

                    long idleMillis                    = currentTimeMillis - lastActiveTimeMillis;

                    long timeBetweenEvictionRunsMillis = this.timeBetweenEvictionRunsMillis;

                    if (timeBetweenEvictionRunsMillis <= 0) {
                        timeBetweenEvictionRunsMillis = DEFAULT_TIME_BETWEEN_EVICTION_RUNS_MILLIS;
                    }

                    if (idleMillis >= timeBetweenEvictionRunsMillis
                            || idleMillis < 0 // unexcepted branch
                            ) {
                        boolean validate = testConnectionInternal(poolableConnection.holder, poolableConnection.conn);
                        if(! validate) { discardConnection(poolableConnection.holder);continue; }}}}returnpoolableConnection; }}Copy the code

In getConnectionDirect, after a connection is acquired, testWhileIdle continues to determine if the connection is idle if it is, discardConnection is called to release the connection, and a new connection is created through the next loop. Connection is idle, mainly according to lastActiveTimeMillis and timeBetweenEvictionRunsMillis judgment.

In discardConnection, if the active state of the connection is true, activeCount is subtracted by one and active is set to false.

public void discardConnection(DruidConnectionHolder holder) {
    // ...
    Connection conn = holder.getConnection();
    if(conn ! =null) {
        JdbcUtils.close(conn);
    }

    lock.lock();
    try {
        if (holder.discard) {
            return;
        }

        if (holder.active) {
            activeCount--;
            holder.active = false;
        }
        discardCount++;

        holder.discard = true;

        if(activeCount <= minIdle) { emptySignal(); }}finally{ lock.unlock(); }}Copy the code

2.3.7 summary

This is the process of connection release.

When the application actively calls.close, it does not really connect or not, but updates the lastActiveTimeMillis of the connection, then returns the connection to the connection pool, and then decreases the activeCount by one.

Then DestroyTask periodically cleans up the connection and determines whether it should be released or saved. If the connection should be saved, the connection is put back into the connection pool.

In addition, when the application obtains the connection, it will also judge the idle time of the connection. If the connection is idle, it will release the connection and reduce the activeCount by one.

So why is activeCount greater than maxActive? Perhaps the only explanation is that activeCount is inaccurate.

Conclusion first, yes. This is a BUG in Druid prior to 1.2.8.

Based on the previous analysis, we can roughly infer that when there are two identical connections in the connection pool, the activeCount will decrease by one. When the application releases the connection for the first time, it sets the active value of the connection to false and activeCount– is executed. When the application releases the connection for the second time, it does not execute activeCount– because active is already false.

So how do you prove the conjecture?

3. Problems recur

To prove the above hypothesis, I wrote a code to verify it.

3.1 Duplicate code

Druid version 1.2.5:

public class Test {
    public static void main(String[] args) throws Exception {
        String jdbcUrl = "jdbc:mysql://localhost:3306/test? useUnicode=true&characterEncoding=UTF-8&allowMultiQueries=true";
        String user = "root";
        String password = "11111111";

        DruidDataSource druidDataSource = new DruidDataSource();
        druidDataSource.setUsername(user);
        druidDataSource.setPassword(password);
        druidDataSource.setUrl(jdbcUrl);
        druidDataSource.setValidationQuery("select 1");
        druidDataSource.setMinEvictableIdleTimeMillis(10 * 1000);
        druidDataSource.setMaxEvictableIdleTimeMillis(DEFAULT_MAX_EVICTABLE_IDLE_TIME_MILLIS);
        druidDataSource.setKeepAliveBetweenTimeMillis(12 * 1000);
        druidDataSource.setMinIdle(2);
        druidDataSource.setMaxWait(1000);
        druidDataSource.setMaxActive(4);
        druidDataSource.setTimeBetweenEvictionRunsMillis(7 * 1000);
        druidDataSource.setKeepAlive(true);
        DruidPooledConnection connection1 = druidDataSource.getConnection();
        DruidPooledConnection connection2 = druidDataSource.getConnection();
        connection2.close();
        Thread.sleep(9 * 1000);
        connection1.close();
        Thread.sleep(14 * 1000); DruidPooledConnection connection3 = druidDataSource.getConnection(); DruidPooledConnection connection4 = druidDataSource.getConnection(); System.out.println(connection3.getConnectionHolder() == connection4.getConnectionHolder()); connection3.close(); connection4.close(); }}Copy the code

The final output is true, which means that connection3 and connection4 get the same connection.

Why is it true? Next, combined with the previous source code understanding, comb through the entire timeline.

3.2 a time line

3.2.1 Point 00: Initialize and close connection2

Get connection1 and connection2 with the following holder:

connection1 {id: 110, useCount: 1, lastActiveTime: 00, active: true}
connection2 {id: 100, useCount: 1, lastActiveTime: 00, active: true}

Then close connection2 with connections set to [connection2] :

connection2 {id: 100, useCount: 1, lastActiveTime: 00, active: false}

The final activeAcount is 1.

3.2.2 Time point 07: First collection

DestroyTask performed for the first time, at this time are less than minEvictableIdleTimeMillis connection2 free time and keepAliveBetweenTimeMillis, skip the recycling.

3.2.3 Time point 09: Disable Connection1

If connection1 is disabled,connection1 returns the connection pool, and connections is [connection2,connection1].

connection2 {id: 100, useCount: 1, lastActiveTime: 00, active: false}
connection1 {id: 110, useCount: 1, lastActiveTime: 09, active: false}

The final activeAcount is 0.

3.2.4 Time point 14: Second collection

DestroyTask execution connection2 leisure time is greater than the minEvictableIdleTimeMillis at this time, but checkCount 0 (poolingCount – minIdle), If I < checkCount is not met, connection2 is placed in the keepAliveConnections list.

And connection1 idle time are less than minEvictableIdleTimeMillis and keepAliveBetweenTimeMillis, not processing.

Connection1,connection2

connection1 {id: 110, useCount: 1, lastActiveTime: 09, active: false}
connection2 {id: 100, useCount: 1, lastActiveTime: 00, lastKeepTime: 14, active: false}

In this case, activeAcount is still 0. But lastActiveTime in Connections is not chronological.

3.2.5 Time point 21: The third collection

DestroyTask executes again, and Connection1 is idle for 11s. Because the conditions are not met, neither evictConnections nor keepAliveConnections are placed. Connection2 free time for 21 s, more than keepAliveBetweenTimeMillis, therefore put keepAliveConnections.

Since removeCount is 1 (removeCount = evictCount + keepAliveCount), the 0th element of connections, connection1, will be removed. Then add keepAliveConnections to Connections.

Connection2,connection2

connection2 {id: 100, useCount: 1, lastActiveTime: 00, lastKeepTime: 14, active: false}
connection2 {id: 100, useCount: 1, lastActiveTime: 00, lastKeepTime: 14, active: false}

3.2.6 Point 23: Connection3 is obtained

Continue to fetch conenction3, preferably from the end of connections.

After obtaining the connection whose ID is 100, continue to determine the idle time of the connection. Because of less than lastKeepTime lastActiveTime lastKeepTime will be used to calculate the free time, leisure time is 9, less than timeBetweenEvictionRunsMillis, therefore, that connection is available, and return.

Conenction3 is {id: 100, useCount: 2, lastActiveTime: 00, lastKeepTime: 14, active: true}

At the same time, activeCount will be increased by one, and finally activeCount is 1.

Now connections is:

connection2 {id: 100, useCount: 1, lastActiveTime: 00, lastKeepTime: 14, active: false}

3.2.7 Point 23: Obtain connection4

{id: 100, useCount: 3, lastActiveTime: 00, lastKeepTime: 14, active: {id: 100, useCount: 3, lastActiveTime: 00, lastKeepTime: 14, active: True}, and will increment activeCount by one, finally activeCount is 2.

So connection3 and connection4 are essentially the same holder.

Also, connections is empty at this point.

3.2.8 Point 23: Disable Connection3

Then call the close method to close connection3, return the connection to the connection pool, and subtract one from the activeCount. Connections is [connections3] :

connection3 {id: 100, useCount: 3, lastActiveTime: 23, lastKeepTime: 14, active: false}

The final activeCount is 1, and the Active property of the holder is false.

3.2.9 Time point 23: Disable Connection4

After closing Connection3, close Connection4 immediately, but since its holder active is already false, activeCount does not decrease by one.

Eventually the connection will be released, but activeCount is reduced one less time.

3.3 summary

Under certain conditions, the keepAlive mechanism can put a connection back into the connection pool, resulting in multiple identical connection holders in connections, which can be acquired by multiple application threads and then closed. Since the first thread to close has marked the holder’s active as false, other threads do not subtract one from activeCount when they close the connection. Finally, activeCount is inaccurate.

When will this BUG be triggered? This BUG is a small probability event. The probability of BUG occurrence will be increased in the following situations:

Database exceptions, such as network interruption, active/standby switchover, and abnormal restart, may result in a fatalError. In this case, all the links in the shrink method are placed into keepAliveConnections, resulting in multiple identical holders in the connection pool.
TimeBetweenEvictionRunsMillis than keepAliveBetweenTimeMillis, may lead to obtain connection, the connection idle time wrong, finally to the normal connection cannot be recycled;
MinEvictableIdleTimeMillis less than keepAliveBetweenTimeMillis, may lead to obtain connection, the connection idle time wrong, finally to the normal connection cannot be recycled.

4. Troubleshooting

So how can this be fixed? Just make sure there are no two identical holders in Connections. Since the problem is triggered by the keepAlive mechanism when the connection is put back, we can add a judgment here that if the connection is already in the connection pool, it will not be put in.

Druid /compare/1.2.5… 1.2.8.

So the ultimate fix for our application is to upgrade Druid to 1.2.8.

5. To summarize

Through in-depth analysis of source code layer by layer, we finally figured out why the application would be suspended after reporting errors, why activeCount is not accurate, and how to solve it.

It appears that the connection has timed out, but in fact the activeCount calculation is not accurate so that a new connection cannot be created.

The reason for inaccurate calculation of activeCount is that there are multiple identical connections in the connection pool. Therefore, when the connection is released, the same connection can only be released once with activeCount–.

There may be multiple identical connections in the connection pool. The main reason is that keepAlive repeatedly adds connections to the connection pool when the database is abnormal or Druid parameters are incorrectly configured. When obtaining the connection again, it happened that the idle time of the connection was incorrectly judged, so the connection was not released but returned to the application.

This is essentially a BUG from before Druid 1.2.8, triggered by a variety of coincidencies such as database exceptions and misjudgment of free time due to configuration. However, it is difficult to locate a problem because the problem can be solved by restarting it.

Finally, upgrade Druid to 1.2.8 to solve the problem for good. Also note the following Druid configuration:

Added connectTimeout and socketTimeout configurations for JDBC URL
Set the maxWait
Less than keepAliveBetweenTimeMillis timeBetweenEvictionRunsMillis
Less than minEvictableIdleTimeMillis keepAliveBetweenTimeMillis

This concludes the article. Thanks for reading!

reference

Github.com/alibaba/dru…
Github.com/alibaba/dru…
Github.com/alibaba/dru…