Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
introduce
Hi, my name is Leo and I work in Java backend development. The first part of this article gives you an overview of the mechanics of WAL. Today, I’m going to focus on WAL security.
Thinking of writing
Based on the feedback from readers and friends, I will add a writing idea from the beginning of this article. It’s really annoying to let the reader know what they’re going to learn after they’ve read the article, in case they’ve spent half an hour reading and learned nothing.
Get into the business
Binlog write mechanism
Binlog writes to the log this is the easy one. If you mention binlog, you must mention binlog cache. So what is a binlog cache?
It’s worth knowing a little bit about the premise, and then binlog. Those who do can skip this paragraph and take care of friends who don’t understand.
The binlog cache is a buffer for binary log files that is controlled by the parameter binlog_cache_size.
A transaction is not allowed to be broken apart during execution, so no matter how big the transaction is, it should be saved and executed at once. So this is a binary log cache. If the amount of memory used exceeds the binlog_cache_size parameter. The temporary save to disk is used. When a transaction is committed, the data in the binlog cache is first written to the binlog and the binlog cache is emptied.
We can get an idea from the picture below.
Each binlog_cache is owned by a separate thread. This means that multiple threads with multiple binlog_caches will write to a binlog file when entering write operations. Efficiency is very fast because there is no disk IO overhead involved.
Data is persisted to disk only when fsync is performed. This is when the disk IO, or IOPS, is consumed.
Let’s dig into the question of when to write and when to fsync?
Let’s take a look at the timing of write and fsync.
int sync_binlog=0;
if(sync_binlog==0{commit transaction only write, not fsync}if(sync_binlog==1Fsync is executed every time a transaction is committed.if(sync_binlog>1){write each committed transaction, but fsync is not performed until N transactions are accumulated. }Copy the code
Therefore, in IO bottleneck scenarios, setting sync_binlog to a large value can improve performance. In actual service scenarios, it is not recommended to set this parameter to 0 because of the controllability of lost logs. A value ranging from 100 to 1000 is commonly used.
However, if sync_binlog is set to N, the binlog of the last N transactions will be lost if the host restarts abnormally.
Extension: Above we explained how to control the size of a parameter by binlog_cahe_size. Choose between disk and memory cache. So what are the other parameters that control this kind of problem? How many can you think of?
Let’s expand on sort_buffer_size, which was referenced in previous articles. You can go to the article Order by for a detailed review.
The binlog writing mechanism is basically like this, and the optimization will be improved as the technology advances. There are wrong places can also be pointed out! Discuss learning with each other!
Redo log writing mechanism
Same old, same old. I feel the design of MySQL is very similar in many ways. For example, the change buffer and binlog cache are temporary storage areas in memory. So what is redolog’s buffer? redo log buffer
Let’s start with redo log buffer
A redo log buffer is a transaction that writes data to the log before inserting a log. You cannot write to a redo log file when a transaction has not yet been committed. The temporary repository for this log is the redo log buffer. The actual writing of the redo log file is done at the commit step.
This is executed under the start of a transaction. How does this work if we only execute an INSERT statement?
When an update statement is executed separately, InnoDB starts a transaction on its own and commits it when the statement is completed. The process is the same as above, but is “compressed” into a statement.
A buffer is a temporary buffer, so do buffes persist to disk?
If MySQL restarts unexpectedly during a transaction, this part of the log is lost. Since the transaction is not committed, there is no loss of logs at this point.
Is it possible that some of the redo log buffer logs are persisted to disk before a transaction is committed?
There will be. The problem starts with the following diagram of three possible redo log states
- Redo log buffer: This is physically the MySQL process memory
- FS Page cache: written to disk but not persisted. Physically the Page cache file system.
- Persist to hard disk
The red areas in the figure are memory operations that do not involve disk IO. So the performance is very fast. Write is also very fast, as shown in yellow. Fsync is much slower. Because persistence to disk.
MySQL is not that simple. The redo log has a write policy. Let’s take a look at the strategies and cases.
Write policy This one involves a parameter called innodb_flush_log_at_trx_COMMIT which controls the redo log writing to disk. To provide better performance assurance!
- If the value is set to 0, only the redo log is left in the redo log buffer for each transaction committed.
- If the value is set to 1, the redo log is persisted to disk during each transaction.
- If set to 2, only the redo log is written to the page cache each time a transaction commits.
InnoDB has a background thread that writes logs from the redo log buffer to the page cache of the file system every second and then calls fsync to persist them to disk.
The redo log buffer helps redo log solve a very large problem. Is redo log buffer safe or performance safe?
If I’m executing a transaction, I haven’t committed it yet. MySQL must write data from redo log to redo log buffer! We wrote redo log buffers every second. So is it possible that the transaction is not completed, may have written disk?
The answer is yes. Let’s describe the redo log buffer refresh strategy
The parameter that controls this policy is innodb_log_BUFFer_size
- When the size of the redo log buffer reaches half of innodb_log_buffer_size, the background thread writes disks.
(Note that since the transaction is not committed, the write action only writes and does not call fsync, leaving only the page cache in the file system.)
- Alternatively, when a parallel transaction commits, the redo log buffer of that transaction is persisted to disk
Innodb_flush_log_at_trx_commit: innodb_flush_log_at_trx_commit: innodb_flush_log_at_trx_commit: Transaction B persists the redo log buffer to disk. Transaction A’s log in the redo log buffer is persisted to disk.
Note: As shown in the figure above, there is a prepare at the time of the two-phase commit. I’ll say redolog is in prepare. Write the binlog again. Commit at the end.
If innodb_log_buffer_size is set to 1, the redo log must be persisted once after prepare because a crash recovery logic relies on the prepare redo log plus binlog for recovery.
With a once-per-second round of polling and crash recovery, InnoDB assumes that redo logs are written to the page cache of the file system and do not need fsync at commit time.
Sync_binlog and innodb_flush_log_at_trx_COMMIT are both set to 1. In other words, a transaction must wait twice for a redo log and a binlog before it is fully committed.
It’s much more than that, sometimes we hear that the TPS of the big factories is 20,000 per second. That’s 40,000 disk writes per second. However, I use tools to test out, the disk capacity is only about 20,000, how can achieve 20,000 TPS? Group submission mechanism
Although this is the last module, there are still concepts!
This section describes the concept of log sequence number (LSN). The LSN is monotonically increasing and corresponds to each write point in the redo log. Each time a length redo log is written, length is added to the value of the LSN. The LSN is also written to InnoDB data pages to ensure that the data pages are not repeated by redo logs.
We’ll talk about the differences between LSN, redo log, and checkpoint. Checkpoint, you haven’t forgotten that one. This checkpoint is used to flush memory. Redo logs and LSN are the ones being introduced.
As shown in the following figure, trx1, TRx2, and TRx3 concurrent transactions completed the redo log buffer persistence process in the prepare phase.
It can be seen from the picture
- Trx1 is the first to arrive and will be chosen as the leader of the group.
- By the time TRx1 starts to write disk, there are already three transactions in the group, and the LSN has become 160.
- So when TRx1 returns, all redo logs whose LSN is less than or equal to 160 are persisted to disk.
- And then trx2 and trx3 can go straight back.
Therefore, the more members in a group submission, the better the disk IOPS saving effect. But if you only have single-threaded pressure, you can honestly only persist once per transaction.
In concurrent update scenarios, the later the fsync is invoked after the first redo log buffer is written, the more team members there are and the better the IOPS savings.
Optimization:
In order to improve MySQL performance, this area is usually chosen to delay because it can save more IOPS.
We made use of the two phase submissions above! There are two steps to writing a binlog.
- Write the binlog file from the binlog cache to the binlog file on disk.
- Call fsync persistence.
Let’s change it based on the optimization here. MySQL makes group submission work better.
In this way, the binlog can also be submitted as a group. Why do you say that? You can see step 2 in the figure above. If multiple transactions have been written (that is, written to the redo log buffer), they can be persisted to disk together at step 4. This optimization process is not to improve IOPS!
However, step 3 is usually performed quickly, so the time between binlog write and fsync is short, resulting in fewer binlogs that can be aggregated for persistence, so binlog group commits are usually not as good as redo logs.
If you want to improve the performance of binlog group commits, set binlog_group_COMMIT_SYNc_delay and binlog_group_COMMIT_SYNc_NO_delay_count. Fsync is called as soon as either of these two conditions are met.
- The binlog_group_COMMIT_SYNc_delay parameter, indicating the number of microseconds before fsync is called;
- The binlog_group_COMMIT_SYNC_NO_delay_count parameter indicates how many times fsync should be called.
Therefore, when binlog_group_COMMIT_SYNc_delay is set to 0, binlog_group_COMMIT_SYNc_NO_delay_count is invalid.
Going back to the WAL mechanism above, let’s continue our discussion. WAL mechanisms work in two main ways
- Both redo logs and binlogs are written sequentially. Sequential disk write is faster than random write.
- The group submission mechanism greatly reduces disk IOPS consumption.
Practical cases
If your MySQL is currently experiencing IO bottlenecks, what can you do to improve performance?
- Set binlog_group_commit_SYNc_delay and binlog_group_COMMIT_SYNc_no_delay_count to reduce the number of binlog disk writes. This approach is implemented on the basis of “extra deliberate waiting,” so it may increase the response time of the statement without the risk of losing data.
- Set sync_binlog to a value greater than 1 (more commonly 100 to 1000). In this case, binlog logs may be lost when the host is powered off.
- Set innodb_flush_log_at_trx_COMMIT to 2. The risk is that the host will lose data when it loses power.
I don’t recommend setting innodb_flush_log_at_trx_COMMIT to 0. If you set this parameter to 0, the redo log is stored in memory only. If you set this parameter to 0, the MySQL database will lose data if it restarts unexpectedly. The redo log is also written to the page cache of the file system very quickly, so setting this parameter to 2 is almost as fast as setting it to 0, but it is less risky than setting it to 0 so that MySQL does not lose data during an unexpected restart.
conclusion
Today we have covered two major log writing mechanisms. And the advantages and disadvantages of two-phase commit of logs. The flow and performance of the group commit mechanism