The accident

A sunny afternoon, programmer Xiao Qi as usual, is writing bugs…

Write the code

Suddenly received the customer service side of the message, said that received a large number of user complaints, the page can not open. Xiao Qi heart a clunky, recently released their own new code, add a new function, is not that part of the code is wrong? !!!!!

Pretend not to see

Cut the stream to the standby library and roll back the code. Then check the error log and find that the database connection pool reported a large number of timeout errors. This situation generally has two possibilities:

  • One is that the database or the network connected to the database has some kind of accident, resulting in the database can not be connected, reaching the timeout time;
  • Another possibility is that there are a large number of threads executing a slow query, the old thread is still executing the query, and the new thread is stuck waiting so long that it reaches the timeout.

The fault was eventually traced to a slow query in the database. A high frequency query “no index hit, resulting in full table scan” takes at least one second. As a result, a large number of query requests accumulate and time out.

analyse

Rethinking the pain, Xiaoqi decided to reproduce this fault in the local.

Start with a very simple demo table and create a false index age, score:

create table demo

(

    id    int auto_increment

        primary key.

    name  varchar(255null.

    age   int          null.

    score int          null

);



create index idx_age_score

    on demo (age, score);

Copy the code

Enable slow SQL logging:

SET GLOBAL slow_query_log=1;

Copy the code

Then, using Python to extract a 500W random data SQL file, the table on the line in question is about the same magnitude:

import random

if __name__ == '__main__':

    SQL_file = open('./batch_jq.SQL'.'w', encoding='utf-8')

    a1 = ['张'.'king'.'li'.'the king'.'赵']

    a2 = ['jade'.'Ming'.'dragon'.'fang'.'the'.'玲']

    a3 = [' '.'set'.'玲'.' '.'the'.' ']

    _len = 5000 # 5K cycles

    while _len >= 1:

        line = 'insert into demo(name, age, score) values '

        arr = []

        Insert 1K strips in each batch

        for i in range(1.1001) :

            name=random.choice(a1)+random.choice(a2)+random.choice(a3)

            arr.append((name, random.randint(1.100), random.randint(1.10000000)))

        _SQL = line + str(arr).strip('[]')

        SQL_file.write(_SQL + '; \n')

        _len -= 1

Copy the code

PS: This uses batch inserts instead of inserting data one by one, which makes it a little faster when running SQL.

Then run SQL to insert 500W data:

.

[2020-04-19 20:05:22] 24000 row(s) affected in 636 ms

.

[2020-04-19 20:05:23] 24000 row(s) affected in 638 ms

.

[2020-04-19 20:05:23] 8000 row(s) affected in 193 ms

[2020-04-19 20:05:23] Summary: 5000 of 5000 statements executed in 3 m 42 s 989 ms (106742400 symbols in file)

Copy the code

Then use SpringBoot + JdbcTemplate to create a simple application:

@RestController

public class DemoController {



    private static final Logger LOGGER = LoggerFactory.getLogger(DemoController.class);



    @Resource

    private JdbcTemplate jdbcTemplate;



    // Trigger an entry for slow query services

    @GetMapping("trigger")

    public String trigger(a) {

        long before = System.currentTimeMillis();

        jdbcTemplate.query("select * from demo.demo where score < 20 limit 50", (set) -> {

        });

        long after = System.currentTimeMillis();

        LOGGER.info(Call time: {} ms, after - before);

        return "success";

    }

}

Copy the code

Try to call the http://localhost:8080/trigger, found that almost took more than a second. It’s a little slow, but it’s acceptable.

So the ab pressure test:

$ ab -n500 -c20 http://localhost:8080/trigger

# represents 500 requests, 20 at a time

Copy the code

In this pressure test, it was found that the log printing time was about 15 seconds, although it was very slow, no error was reported, the business could be used normally, and there was no slow query in the database:

The 2020-04-19 20:56:21. 18908-665 the INFO [nio - 8080 - exec - 3] C.E.S.C ontroller. DemoController: call time: 15260 ms

The 2020-04-19 20:56:21. 18908-779 the INFO [IO - 8080 - exec - 10] C.E.S.C ontroller. DemoController: call time: 15445 ms

.

Copy the code

Add a little more concurrency:

$ ab -n500 -c50 http://localhost:8080/trigger

# represents a total of 500 requests with 50 concurrent requests

Copy the code

At this point, you can see that the call times printed by the console slowly increase and then start printing some exception messages:

ERROR 17100 -- [IO-8080-exec-45] O.A.C.C.C. [.[.[/].[dispatcherServlet] : servlet.service ()for servlet [dispatcherServlet] incontext with path [] threw exception [Request processing failed;  nested exception is org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.SQL.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.] with root cause



java.SQL.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.

At com. Zaxxer. Hikari. Pool. HikariPool. CreateTimeoutException (HikariPool. Java: 689) ~ [HikariCP - 3.4.2. Jar: na]

At com. Zaxxer. Hikari. Pool. HikariPool. GetConnection (HikariPool. Java: 196) ~ [HikariCP - 3.4.2. Jar: na]

Copy the code

The sample code uses SpringBoot’s own database connection pool Hikari, which has a default timeout of 30 seconds and will throw an exception if it exceeds 30 seconds. Any connection pool will have a timeout configuration, although its functionality may be slightly different.

Page error

There are also some slow SQL records in the database at this time:

SHOW GLOBAL STATUS LIKE '%Slow_queries%';

| Slow_queries | 51 |

Copy the code

To locate the problem, run the following SQL statement to print the current connection status. You can see what SQL statement is taking time:

SHOW FULL processlist;

Copy the code
SQL details

It is easy to see that our SQL execution time exceeds 1 second. We took this SQL to explain, found that go is full table scan.

Of course, the actual project table is not so simple, and SQL statements and indexes are more complex, so I created a simple example here just for demonstration purposes.

And there are a lot of great database monitoring tools out there that make it easier and prettily to display logs and troubleshoot database problems, such as Druid.

After tuning, the same 50 concurrent pressure test was performed locally, and the response time was basically maintained in the tens of milliseconds, with no pressure at all.

tuning

Using the index

Many times, slow SQL can be solved by using indexes.

Through problem location, we found that we had a high-frequency query demand for a certain field, but did not build an index for it. MySQL indexes are based on the “leftmost matching principle”, so the existing joint indexes age and score cannot match our high-frequency query.

Of course, there are drawbacks to having too many indexes, depending on your business.

Use the cache

Through analysis, we found that the query conditions of these slow SQL are exactly the same. In other words, we can cache query results so that subsequent queries can be directly cached, which can greatly improve performance.

In fact, the mainstream ORM frameworks support caching, even multi-level caching, Spring also provides a Cache framework “Spring Cache” can be configured and used according to their own needs.

reflection

With the recirculation and tuning done, it’s time to think.

By a wall

Make good use of the explain

“Our SQL statement, before use can try to explain”, to see if there is a hit index, if not, consider whether it is a high-frequency statement, whether it needs to be tuned.

Perform adequate pressure measurement

There is no trivial matter online, do not blindly believe that you write the program must be fine, directly deployed to the production environment. If you can do some pressure testing before going live, you can find performance problems early and stop your losses in time.

Take advantage of logging and monitoring

Usually we launch in the evening when user usage is low. If we can configure error log collection, database monitoring, and alarms, we may be able to notice the problem before a large number of users notice it. That can be resolved early, reducing the cost to users and the company.

The public,

Now that you’re here, it means you approve of my article. Support me if it helps you:

The public,