Online database hung, how should you check? How to guard against it?

Wechat public number [Huan Shao growth path] send sword point offer, MySQL high-performance books, algorithm tutorial

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

introduce

Hello everyone, MY name is Leo. I am working as a Java backend engineer in Changzhou. In the last article, we introduced the problems of read-write separation from concept, purpose, single to multiple evolution, security evolution, and six solutions. Today we are going to talk about how you can quickly position yourself if you fail.

Train of thought

According to the feedback of readers and users, draw a writing thinking diagram. Through this figure can better analyze the current article writing knowledge points. Can help readers quickly in the shortest time to determine whether it is effective article!

External statistical

select 1

normal

Select 1 (select 1, select 1); select 1 (select 1, select 1)

If the library is fine, it will return 1, because output 1 is bound to return 1
If the library hangs, output 1 is definitely unresponsive because MySQL is no longer available

Mysql is usually used for stand-alone services when select1 is executed. For a very simple example, if you enter mysql from a CMD console and execute SQL statements, you will only know if the current library is working properly. It is not known whether the entire database cluster is healthy. So in the state of a single machine this scheme is more commonly used, once on some cluster size generally will not use this scheme!

unexpected

Let’s start with the innodb_thread_concurrency parameter that sets the maximum number of concurrent threads. Set it to 3. Once the number of concurrent threads reaches this value, InnoDB will wait for new requests until a thread exits.

Here we can simulate a worst-case scenario where three threads are accessing the database normally to perform a large query operation. Does a select 1 execute successfully?

It will be successful! However, if the user sends another query table request after the test, it will be blocked because the other three threads will be waiting.

Select 1 (); select 1 (); select 1 ();

The innodb_thread_concurrency parameter is 0 by default. Indicates that the maximum number of concurrent threads is not limited. If the number of concurrent threads is too high, the overall performance of MySQL will be affected. So we generally recommend 64 to 128.

The extension 64 to 128 here refers to concurrent query threads, which some people may confuse with concurrent connections.

show processlist
Copy the code

SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > execute SQL > Concurrent connections are just a waste of memory, and concurrent queries are a waste of MySQL’s limit on the number of concurrent threads.

Let’s talk about hot updates and deadlock detection. If I set innodb_thread_concurrency to 128, would I run out of 128 when a hot update occurs on the same row and crash the system?

No, MySQL would never allow this to happen. So while the lock is waiting, the concurrent thread subtracts by one. So the lock wait doesn’t count in thread 128.

A special case

Some lock waits are definitely not included in concurrent threads, so what about queries like the one above that consume a lot of time?

If you do get to 128, using SELECT 1 will cause problems, so the next solution is born

Look-up table to determine

Select 1 has its drawbacks, which gradually evolve into table lookup judgment

Where is the watch?
Certainly can’t put casually in a database!

The table location is set up in the database shown below. We can create a health_check with only one row of data and execute it periodically.

select * from mysql.health_check; 
Copy the code

This does solve the current state of the database from InnoDB’s side. The problem is that InnoDB writes logs, i.e. binlogs, so when the disk usage reaches 100%. All update statements and commit statements are blocked. However, the system can still read data normally.

Above query judgment, obviously is not.

Updating data is simply logging to a transaction. If the disk is full, write binlog.

So the execution is unsuccessful, but the read data is still available. Obviously it’s not going to work.

Update the judgment

I passed another one

Since you want to update, you need to put a meaningful field, the common practice is to put a timestamp field, which indicates the last time the detection was performed. This update statement looks something like:

update mysql.health_check set t_modified=now(a);Copy the code

All master and slave libraries that involve update operations must deal with synchronization issues

Detection of node availability should include both primary and standby libraries. If updates are used to detect the primary database, then updates are also used to detect the standby database. Check the standby database also write binlog. Since we generally design the master/slave relationship between database A and database B as A double M structure, the detection command executed on the standby database B is also sent back to the master database A.

If both primary database A and secondary database B use the same update command, A row conflict may occur, which may cause the synchronization between the primary and secondary databases to stop. So now it seems that mysql.health_check can’t have just one row of data.

If multiple rows are stored, the server_id must be considered in a host

MySQL specifies that the server_id of the primary and secondary databases must be different (otherwise, an error will be reported during the creation of the active/standby relationship). This ensures that the detection commands of the primary and secondary databases do not conflict.

Update judgment is a relatively common scheme, but there are still some problems. Like “Slow judgment.”

As described in our previous articles, when an update operation is slow or fails. You can switch from master to slave, why will there be a slow decision?

I/O Resource Allocation

First, all detection logic requires a timeout N. If an UPDATE statement is executed and does not return after more than N seconds, the system is considered unavailable.

The fault is caused by I/O resource allocation and the I/O utilization of the log disk is 100%. At this time, the whole system response is very slow, and the master/standby switchover is required.

I/O utilization 100% Indicates that the I/O of the system is working. Each request has the opportunity to obtain I/O resources to perform its own tasks. However, the update command used in our detection requires few resources, so it may be able to successfully submit the IO resources when it gets them and return them to the detection system before the timeout period of N seconds is reached.

After checking the system, the update command has not timed out, and you conclude that the system is healthy.

IO problem, SQL execution is very slow, but this time the system is normal is certainly not

Internal statistics

External statistics cannot judge the real demand. We moved on to internal statistical programs.

If MySQL can provide this kind of data, it will be more reliable.

From the performance_SCHEMA library, the time of each IO request is counted in the file_SUMMARy_by_event_name table.

COUNT_STAR: total number of IO counts
SUM_NUMBER_OF_BYTES_READ: Total number of bytes read from the redo log.

The fields in the above table are briefly introduced to the most commonly used. Search for the rest as you use it.

Thinking of the February

Once the table is found, all we need is event_name = “wait/ IO /file/ SQL /binlog”.

Performance_schema requires additional statistics every time we operate on the database, so there is a performance penalty when we turn on the statistics feature.

To enable time monitoring of the redo log, you can execute this statement:

update setup_instruments set ENABLED='YES', Timed='YES' where name like '%wait/io/file/innodb/innodb_log_file%';
Copy the code

When it’s open, it’s for real

The value of MAX_TIMER can be used to determine whether the database is in trouble. For example, you can set a threshold for a single I/O request that takes more than 200 milliseconds to be an exception, and then use a statement like this as the detection logic.

select event_name,MAX_TIMER_WAIT  FROM performance_schema.file_summary_by_event_name where event_name in ('wait/io/file/innodb/innodb_log_file'.'wait/io/file/sql/binlog') and MAX_TIMER_WAIT>200*1000000000;
Copy the code

When you find the exception, get the information you need and use the following statement:

truncate table performance_schema.file_summary_by_event_name;
Copy the code

Clear the previous statistics. This way, if the exception occurs again in later monitoring, you can add the monitoring cumulative value.

conclusion

We started with the basic select 1 method. This method works well with standalone MySQL, but not with a single master and multiple slave cluster.

Table lookup judgment involves innoDB writing transaction logs. If the disk is full, write transactions cannot be written but can be read, resulting in inconsistency.

To update the judgment. I/O utilization 100% Indicates that the I/O of the system is working and each request has a chance to obtain I/O resources. Therefore, the update does not time out, which is considered normal by the system. Therefore, the service can not respond to the side, while the judgment is normal, resulting in inconsistency.

Finally, internal statistics. Using the system library scheme. The event_NAME and MAX_TIMER fields are used to determine whether there is a problem