The author is Dmall Talker Zhu, Dmall-OS R&d Department

background

In June 2020, the commodity system is accessing more and more commodity data from SAP, middle layer and so on and updating frequently. The volume of primary and secondary update data in the commodity database is large, about 540,000 updates per minute, and more than 1G Binlog files will be generated in about eight minutes. When the IO capacity of the database is certain, data synchronization delay will occur. It affects the timeliness of writes and readouts, which in turn affects the availability of commodity-based systems.

It takes a lot of people to analyze it just from the perspective of flipping through code. Regardless of the system itself, it is a good direction to analyze this point when multiple applications of goods are reading and writing to the goods library and providing data aggregation and centralized feedback at the database layer.

The analysis model

Parse Binlog into plain Sql text. The following Sql text is parsed:

The key to looking at the parsed text to determine if it is a valid update is to find out which fields are actually being updated. That is, which fields in the WHERE block and SET block have values updated.

Through comparison, it is found that:

In fact, only the 7th and 8th time fields are updated. If no other values are refreshed, only the time is refreshed. In addition, when the commodity system receives the data, it is estimated that we get the data and directly use Modified =now() to generate update statement, so that the time changes, must generate Binlog.

Based on the above logic, as long as you can analyze which fields are updated in an UPDATE statement and whether the updated fields are meaningful to the service, you can determine whether binlogs should be generated. The expected result model is as follows:

By analyzing table X,Combination of fieldAnd the correspondingUpdate frequency, can be concluded:

1. Determine whether the updated field combination is valid;

2. The corresponding number of updates reflects the number and proportion of valid and invalid updates;

Combination of statistical

Due to the large volume of text corresponding to Binlog, the script cuts each Insert, Delete, Update statement into a separate file. Then calculate the field differences in THE Sql in each file and do the aggregation. The process is as follows:

Count whether each field of each table is valid if it is updated separately: 1 indicates that each field is valid if it is updated separately; 0 indicates that each field is invalid if it is updated separately, as shown in the following sample table:

According to the above statistics table, the “update valid expression” of each update statement is calculated as follows:

Each update statement is valid if the individual updates of the combination of update fields are numeric additions.

If the value of whether the individual updates of all the fields in the update field combination are valid is 0, then the sum is equal to 0;

As long as the value of whether a single update of a field is valid is 1 in the update field combination, the sum is greater than 0;

According to whether the sum of “update valid expression” is greater than 0, the corresponding update statement is valid update.

Take commodity database table and field as an example to judge whether the update statement is valid. The statistical table reference is as follows:

According to the above table, the proportion of invalid updates is: sum(invalid updates =0)/sum(updates). Examples of the resulting model are shown below:

This provides a qualitative and quantitative basis for analysis and knowledge of the expected effects of subsequent optimizations. The commodity system has been optimized to reduce invalid updates by more than 90% and reduce the data update pressure on downstream systems subscribing to the commodity Binlog.

Code screen

The preliminary analysis is that after we get the data, the time field =now() in SqlMapper file triggers the actual update and generates binlog. We modify and check according to the actual situation to confirm whether there is any impact on the downstream system, such as optimization in the case of big data extraction.

The preliminary modification method of the commodity system is as follows:

1. You are advised to design a table in the following format:

create_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP;

update_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;

The advantage of this is that the created field does not need to be set for Insert; Update the Modified field does not need to be set, so that the business code is more pure, and there is no need to worry about the time when the data is not updated.

Select * from ‘where’ where ‘select * from’ where ‘where’ select * from ‘where’ where ‘ Such as id=# ID #, which makes the Sql statement syntactically correct. It also follows the rule that if there is no actual data update, no binlog will be generated.

Other extended

Using the result model above, we can clearly show the update status of a library. We can not only determine whether the Sql statement is valid update, but also provide feedback on the following:

1. Is the database design reasonable? For example, after analyzing the database of XX system of our company, the conclusion is as follows: All the updates are valid, but the most updated table has 98 fields, and the most updated part is only the yn field of the table. According to the simple text parsed by Binlog, even if only the YN field is updated, the 98 fields in the SET and WHERE block will be updated, and the total number of updates is about 200 rows. It also triggered a lot of Binlog files; The following plan is to solve the problem by splitting table fields and classifying data fields and status fields.

2. Provide quantitative basis for cached data, such as frequently updating fields, whether we can use cache inspiration, etc.;

3. Provide the perspective of the field dimension of database update to the students in charge of the system, know which fields are actually updated in the database, whether it is necessary, and what optimization inspiration can be done;

The above general analysis method is especially suitable for the system with a large amount of database updates and the common script analysis tools to quickly produce analysis results.