“This is the fifth day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.
This article tells a true story that happened at work (accident).
The story took place
Here’s what happened, a very simple beginning:
- Day 1:
- I am in normal coding, completing the development task schedule
- A 600M DB active/standby delay alarm was received at 5:20 p.m
- In urgent coding, I wish that standby machine will not affect online business.. Wait and see, maybe it will be all right… Have a meal first 🐶
- Continue busy in the evening…
- Day 2:
-
Came to work in the morning, forgot about it…
-
The system starts to habitually check the monitoring alarm emails of various indicators
-
Find email data strange.. These tasks are not done right ah, that timing data calculation task report is not out? .
-
Look at the table in the DB. Everything is fine
-
So why isn’t monitoring alarm emails right? Remind report system read standby DB.. So the master and slave are inconsistent!! (Finally remembered yesterday’s alarm master/slave delay has not recovered)
-
Why so much delay? Did you do any big transaction operation to update and delete data yesterday afternoon? I don’t think so. Where are the other students?
-
Check the monitoring indicator of the DB and find that the active/standby delay is 17 hours. That is to say, the binlog has not recovered at all since the alarm started yesterday.
To find the reason
Everyone said they didn’t do anything big.
We use MYSQL DB services on the cloud. The DB topology we use is very simple, as shown below:
At this point, however, there are still a large number of binlogs between the Master and Slave.
We asked the helper on the cloud to help locate the problem and locate the binlog playback situation on the standby machine. A table that does not have a primary key is still running update statements.
So the delay of 17 hours is due to this..
This table is a statistical table, the same one I was working on yesterday stat_table…
Quick solution: Let the DBA rebuild the standby machine
The analysis reason
The synchronization between the master and the slave was restored, and I began to analyze the reasons:
- I did add a field to stat_table yesterday before the alarm was generated, and I did update it once or twice.
- How can such a simple operation as 5S +5s cause a 17 hour delay in the active/standby mode?
The operation was not taken seriously, the operation of 500,000 data was common, thought to have no consequences.
Describe the situation again:
- Our binlog format is ROW format (because there are other listening requirements as well)
- Stat_table does not have a primary key
- Do it once on the Master
update stat_table set c=0
It produces 500,000 binlogs, 1 million twice - The throughput between the active and standby nodes is not large
So the secondary machine playback each binlog slowly! Why is the playback slow
View the active and standby binlogs
You must have the root permission to view binlog.
First look at the binlog on the Master
show binary logs;
- The last line is the latest file and location of the binlog file, and the file name Log_name and File_size are used for the following events
show binlog events in 'Log_name' from File_size limit 100
- This shows the binlog events since the last point in time, but does not show the specific binlog operation information
mysqlbinlog -uroot -p****** -h****** --read-from-remote-server -vv Log_name --start-position=File_size >binlog.sql
- This can generate a binlog for each row of the specific execution SQL
If you open binlog.sql with vim, you can see that the binlog data looks something like this
### UPDATE `db`.`stat_table`
### WHERE# # #@ 1=85481 /* INT meta=0 nullable=0 is_null=0 */# # #@ 2=419263 /* INT meta=0 nullable=0 is_null=0 */# # #@ 3=1 /* INT meta=0 nullable=0 is_null=0 */# # #SET# # #@ 1=85481 /* INT meta=0 nullable=0 is_null=0 */# # #@ 2=419263 /* INT meta=0 nullable=0 is_null=0 */# # #@ 3=0 /* INT meta=0 nullable=0 is_null=0 */
Copy the code
- Stat_table has three columns: A, B, and C. Note that a, B, and C are not primary keys.
- Because it isROWFormat, so the master will generate a binlog
update stat_table set c=0
We split it into 500,000 statements, so we have a million statements in our binlog
You can see the binlog execution on the Slave in the same way, except that its binlog is copied from the Master.
Analysis of the playback efficiency of the standby server
Assuming a primary key, each update playback execution on the standby machine takes about 0.1ms.
This is already assuming good performance, but this is just an assumption, depending on machine performance.
A million update takes 100,000 ms,100s, or about 1.5 minutes, but this is acceptable (the use of read-only standby machines is limited to reporting and some data monitoring). This is also the cost of the delay incurred by the secondary machine for the occasional update or DELETE operation with the primary key table.
But if there’s no primary key, there’s a normal index (there’s a normal index on A), UPDATE db.stat_table WHERE @1=85481 @2=419263 @3=0 SET @1=85481 @2=419263 @3=1
As shown below:
- Stat_table has columns A, B, and C. A has a common index
- Although stat_table does not have a primary key, a primary key (row_id) and corresponding index are created for it inside the DB
- The ordinary index A finds a matching a value through the B+ tree and obtains a Collection(R1, R2, R3…). R1 is the primary key row_id created in DB
- Assuming an average of 1000 rows of data per a value of index A, this would be 1000*1 million lookups back to the main table
- Collection(dr1,r2,r3,…) Select * from the main table where columns B and C meet where conditions. If so, update; otherwise, no processing is performed
The flow of executing a statement looks something like this. Let’s see what happens in different cases
- A total of 1 million update rows are returned to the table 1000 times each. The execution time is 1000 times larger than that of a primary key table, so it takes 1.5*1000 minutes!! The execution time is longer than 25 hours due to random I/O. Procedure
- If index A averages 100 rows per value, that’s more than 100 minutes, which is a lot better.
- If the table has no indexes at all, that’s a million times, or 2 years!
Follow-up and summary
Our normal index A has about 1000 rows per value on average, but it was wise to rebuild the standby machine without analyzing it, since we didn’t have to wait for a remote master/slave synchronization!
I’ve found that in practice, there are some mistakes that lead to today’s story:
- Although we all have good DB knowledge and literacy, remember to add the primary key when creating tables, but we tend to generate many tables when doing data calculation, sometimes may forget the primary key.
- Another type is log tables, where tasks are only added and not updated, so perhaps not having a primary key is not a problem. But in case of new functions, update data, the master and standby synchronization will also cause a great impact!
Select * from db where no primary key exists and add primary key to db. Looks a lot safer again