Recently, when the online database was moved to HaProxy, many database connection failures occurred. Druid has a bug in its implementation for checking Mysql Replication connections. When I was preparing to put forward the issue, I found that many people had encountered this problem, so I wanted to write an article to record it. Without further ado, let’s formally review this problem and the process of locating the solution.

The DBA of the previous company launched a new set of HAProxy to replace the high availability act of VIP, so we also moved the connection from VIP to HAProxy. But shortly after we went live, we found that database connection errors started to appear online, most often in the wee hours of the morning.

com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure 
Copy the code

I believe experienced Java developers are not unfamiliar with this error, the cause of this problem is very easy to locate, nothing more than the client and the database connection was actively disconnected by the server, and the client also foolishly use this connection has been disconnected to request the database, resulting in failure.

We know that Mysql has a wait_timeout configuration that automatically disconnects connections with idle times beyond this value (typically 8 hours), so this assumption is first ruled out because the configuration of the physical database has not changed and the minimum idle time configured in the connection pool in the code is much smaller than the database wait_TIMEOUT.

So could HAProxy have disconnected us (most likely)? So we looked at the HAProxy configuration

Timeout Connect: 60s // Defines the timeout duration for haProxy to forward client requests to the back-end server. Timeout Client: 120s // Timeout duration for the inactive client. Timeout server: 120S // Timeout duration of waiting for the server after the client establishes a connection with the serverCopy the code

We found that HAProxy would voluntarily disconnect connections that were idle for more than a minute, so we changed the configuration of Druid, Amend the database validation of free time to timeBetweenEvictionRunsMillis modification of 50 s (is 60 s, considering the limiting cases if set to 60 s will still exist cannot keep alive), but after testing we found that the problem is still there.

Since the keepalive time is set to less than 60 seconds, why does the connection still appear to be disconnected? On second thought, there are only two possible reasons

  • The connection maintained by HAProxy is faulty

  • The survival policy is not in effect

Following this idea, we first eliminate the problem of HAProxy (in short, HAProxy will maintain the session between the client and the server to ensure that the connection between the client and HAProxy is consistent with the connection between HAProxy and the server. Their free time is always the same.) So is the survival strategy not working? We currently use Druid to manage our database connection pool. To understand this, we need to see how Druid works and how activity checks are performed.

The diagram above shows the druid logic for acquiring a thread pool: Druid in the initialization will create two daemon thread, respectively, undertake the task of thread creation and destruction, when a user threads appear waiting for access to and operation of the thread (and the number of threads in thread pool is not greater than the maximum number of active threads), create a thread will automatically create a new connection and on the thread pool, so when a user thread needs a new connection, Just get it directly from the thread pool. The user thread receives a connection from the thread pool and, depending on the user’s configuration, decides whether the thread validates, returns the connection if it is valid, and closes the connection if it is not (DestoryConnectionThread automatically reclaims closed connections), and then attempts to retrieve the connection from the pool again. Knows to get to a valid connection and return the connection. Let’s look at the implementation of the code.

Through debugging, I found the clazz. IsAssignableFrom (conn. GetClass ()) is false, that is to say, here the conn isn’t com. Mysql. JDBC. MySQLConnection, Originally to DB, speaking, reading and writing separation project is using a database Driver is RepliationDriver instead of the default Driver (JDBC: mysql: replication: / /). So the connection is com. Mysql. JDBC. ReplicationConnection, While ReplicationConnection directly inherited from com. Mysql.. JDBC Connection didn’t inherit com. Mysql. JDBC. MySQLConnection (limited to mysql Connection – Java Before version 1.5.38, we are using version 1.5.35 online.

Com. Mysql. JDBC. ReplicationConnection in mysql connector – Java 1.5.38 version before (ReplicationConnection source) is a public class ReplicationConnection implements Connection, PingTarget

1.5.38 ([making source code] started to com. Mysql. JDBC. ReplicationConnection into abstract interface (directly inherited from com. Mysql. JDBC. MySQLConnection), And use his subclasses of com. Mysql. JDBC. JDBC4ReplicationMySQLConnection (JDBC4ReplicationMySQLConnection source), Com. Mysql. JDBC. JDBC4ReplicationMySQLConnection internal function is by the addition of the proxy class com. Mysql. JDBC. ReplicationConnectionProxy (ReplicationConnectionProxy realized originally com. Mysql. JDBC. ReplicationConnection class implements the function of)

5.1.35 version

5.1.47 version

So here no matter what usePingMethod setting value is MySqlValidConnectionChecker are executed SELECT 1 operation. Let’s take a look at the implementation process

Mysql > open Mysql Mysql > log (Mysql > open Mysql > log)

set global general_log = on; 
Copy the code

We get the following log (here I just keep the core log information and do some desensitization)

We can easily find business queries executed check the threads and the threads are not the same, so we can conclude that connect to check with and execute the business database connection is not the same, execute the business operation of the database connection not keep alive, free time has not been refreshed, so once the connection for a long time no access will be disconnected, The connection is unavailable.

The problem was solved and the solution was relatively simple. Higher versions of Druid already support custom ValidConnectionChecker

1.5 Due to the limitation of wechat layout, I converted the code into pictures. If you want to see the complete source code, please visit my blog blog.sunwaiting.com