preface

Sometimes we need to update the data in the Mysql table in some business iterations. If the data volume is relatively small (ten thousand level), we can directly execute SQL statements. However, if the data volume reaches an order of magnitude, there will be some problems, such as Mysql deployed in the master-slave architecture. The master/slave synchronization requires a binlog. The binlog format is as follows. The statement and row binlogs are displayed in update mode:

format content
statement Record and synchronize every SQL executed on the master library, which has a small log volume and reduces IO. However, some SQL functions may have problems, such as RANDOM
row Records details about how each piece of data is modified or deleted. The amount of logs is large in certain conditions, such as batch delete and update
mixed The two methods are mixed, with statement records used for general statement modifications and row used for other functions

Our current online mysql binlog is to use the row format for master-slave synchronization, so if in million level data in the table to execute a full table update, inevitable meeting produces a lot of in the main repository binlog, then in the master-slave synchronization, from libraries also need to block implementation of SQL, the risk is extremely high, Therefore, a direct update will not work. In this paper, I started from the beginning of a full table UPDATE SQL, to the last online batch update strategy, how to optimize and thinking to expand the explanation.

The body of the

Direct update issues

Some time ago, we needed to convert some basic user information from HTTP to HTTPS, and the data in the library was about thousands of W. We needed to update some large tables in the whole table. At the beginning, I tried to throw a simple update statement with my DBA colleague, thinking of executing it when the traffic was low, as follows:

update tb_user_info set user_img=replace(user_img,'http://','https://')
Copy the code

Deep paging problem

Write a script like this and replace it in batches as the limit cursor increases. When the limit data is small, it is fine. When the limit data is large, it is inefficient and close to full table scanning. This is also known as the “deep paging problem”.

Update tb_user_info set user_img=replace(user_img,'http://','https://') limit 1,1000;Copy the code

In the efficiency of the

Since there is a problem with deep pagination in mysql, then I will check all these ids and update them in these lists. Can I update them in batches? So I wrote a script similar to the following SQL. Turns out not yet, although mysql has some key prediction for lookups like in, it’s still pretty inefficient.


select * from tb_user_info where id> {index} limit 100;

update tb_user_info set user_img=replace(user_img,'http','https')where id in {id1,id3,id2};
Copy the code

The final version

Finally, after several communication with the DBA, we wrote the following SQL and script, there are a few problems need to pay attention to, we used this syntax in the SELECT SQL /*! 40001 SQL_NO_CACHE */, this syntax means that innoDB’s buffer pool is not used for this query, and data pages will not be cached in the buffer pool as hotspot data. Then, the PRIMARY key INDEX FORCE INDEX(PRIMARY) is used for the query, and the sorted data is filtered by the id cursor. When we finally perform the UPDATE, we can perform a range lookup on the ID, since we found the sorted primary key in the previous SQL query.

select /*! 40001 SQL_NO_CACHE */ id from tb_user_info FORCE INDEX(' PRIMARY ') where id> "1" ORDER BY id limit 1000,1; update tb_user_info set user_img=replace(user_img,'http','https') where id >"{1}" and id <"{2}";Copy the code

We can only focus on the first SQL, as shown in the figure below, which is the general content of the buffer pool. We can use the keyword no cache to forcibly specify that the data processed in batches will not be removed from the buffer pool, so as not to affect the cache content in normal use and prevent the reduction of efficiency. Actually mysql is in some backup action. Data scan SQL is also used with this keyword, to prevent the normal business cache; You then need to force the primary key index specified by the current query, and then sort it, otherwise mysql may select another index when calculating the IO cost of index selection.

You can manually adjust the database flush rate by observing key attributes such as master/slave synchronization, IOPS, and memory usage. In fact, the interface can also define the number of threads and other attributes. According to the number of threads, the interface can brush data through the thread pool to increase the upper limit of the update rate of the whole table and control the rate.

Other problems

If we use the Snowflake algorithm or auto-increment primary key to generate primary key ids, insert records according to the primary key ID order. Of course, it is pre-processed in the business. The data entered into the database is replaced in advance and the full data is updated after the code goes online.

conclusion

It was a very boring job to brush data, but after the data update with a large amount of data and the communication with DBA colleagues for many times, I have gained some new understanding of mysql, including but not limited to the following ones, to learn together.

  1. The master/slave synchronization problem caused by binlog format;
  2. Mysql deep paging efficiency problem;
  3. How to prevent the buffer pool from contaminating hotspot data with normal services?