Insert into SELECT After my colleague applied it, the company lost nearly 10W yuan, and I was eventually fired from the company.
Cause of the matter
The transaction volume of the company is relatively large. The database used is mysql, and the daily increment is about one million. The company does not have a separate database and table, so the only way to maintain the performance of the table is to consider data migration.
A colleague surnamed Li received the task, so he came up with these two schemes
- First through the program query, and then insert the history table, and then delete the original table
- Let database IO do all the work using insert into SELECT
When I used the first plan, I found that the system was directly OOM when I loaded it all at once, but it took too much IO and too long time to do it in batches. Therefore, I chose the second plan. There was no problem in the test, and I was happy to heart line, and then I was fired.
What the hell happened? Let’s go over it
- So let’s look at the first one. Let’s look at the pseudocode
// 1. Query data to be migrated
List<Object> list = selectData();
// insert data into the history table
insertData(list);
// select * from table_name where table_name = 1
deleteByIds(ids);
Copy the code
We can see from this code, the reason for OOM is very simple, we directly load all the data into memory, memory is not a monster.
- Let’s look at the second scenario, and see what happens
In order to maintain the performance of the table and retain effective data at the same time, a certain amount was decided after discussion, that is, the data of 10 days should be retained, almost 1kW of data should be retained in the table. Insert into select… DateTime < (Ten days ago), so there is no OOM. It also simplifies a lot of code operations and reduces network problems.
In order to test, 1KW data was specially built to simulate. Of course, there was no problem with the test environment, and it passed smoothly. Since the table is a payment flow, make it a scheduled task and schedule it for 8pm.
The amount was not very large in the evening, naturally there was no problem, but the next day the company’s financial staff went to work and began to check the accounts, but found that the funds did not match, and a lot of water was not put into storage. The final investigation found that after 8 p.m., the payment flow failed to be inserted, and a lot of data was lost. At first, it was not clear, but there was no problem during the day. Then, I thought that the problem at night might be caused by the task at night. Finally, I stopped the second online of the task, but found that there was no such problem.
analyse
- What’s the problem?
Why stop the migration task and then it will be ok? What exactly does this insert into SELECT operation do? Let’s look at explain for this statement.As you can see from the figure, this query goes directly to the full table scan. At this point, it’s not hard to guess a little bit of a problem. If we have a full table scan, does it mean that the migration time will be long because our table is so large? If our migration time is an hour, does that explain why we don’t have such problems during the day? But is a full table scan the root cause?
We might as well try, while migrating, do some operations, restore the scene. The problem will eventually arise. At this time, we can adjust, bold assumption, if not the full table scan, is not such a problem. When we modify the conditions, it is found that there is no full table scan. Eventually the scene was restored and the problem was solved
Conclusion: the full table scan caused the accident.
This solved the problem that occurred, but it was hard to explain as a series of failures.
- why
Under the default transaction isolation level: Insert into A select B (a) locks the table directly and b (b) locks the table one by one. This explains why there have been successive failures. In the lock by lock, because most of the flow table is compound record, so the final part of the scan was locked, part of the lock can not get, and eventually lead to timeout or direct failure, there are some lock in this successful success.
- Why are there no problems with the tests?
During the test, the data of the formal environment was fully used for testing. However, one problem should not be ignored, that is, the test environment is the test environment after all. During the test, the real data amount does not mean the real business scene. For example, this case is missing a case where a large amount of data is inserted during migration. The result is an online bug
The solution
Since we can avoid a full table scan, we can avoid it. To avoid a full table scan, index the condition after where, and let our SELECT query go through the index
Insert into still available?
can
conclusion
Be careful when using insert into SELECT and always index well