1 Write mechanism of binlog
The writing logic of binlog is simple: During transaction execution, the log is written to the binlog cache first, and when the transaction is committed, the binlog cache is written to the binlog file.
The binlog of a transaction cannot be broken apart, no matter how large the transaction is, it must be written at one time. This is where the binlog cache is stored.
The binlog_cache_size parameter is used to control the size of the binlog cache in a single thread. If it exceeds this size, it will be temporarily saved to disk.
When a transaction commits, the executor writes the entire transaction from the binlog cache to the binlog and empties the binlog cache.
Figure 1 Binlog write disk status
As you can see, each thread has its own binlog cache, but shares the same binlog file.
- In the figure, write refers to write logs to the page cache of the file system, without persisting data to disk, so the speed is relatively fast.
- Fsync in the figure is the operation to persist data to disk. Generally, fsync accounts for the IOPS of a disk.
The timing of write and fsync is controlled by the sync_binlog parameter:
- Sync_binlog =0, indicating that each transaction is committed only write, not fsync.
- Sync_binlog =1, indicating that fsync will be performed every time a transaction is committed.
- Sync_binlog =N(N>1), which indicates that every transaction is committed and fsync is performed after N transactions are accumulated.
Therefore, in IO bottleneck scenarios, setting sync_binlog to a large value can improve performance. In actual service scenarios, it is not recommended to set this parameter to 0 because of the controllability of lost logs. A value ranging from 100 to 1000 is commonly used.
If sync_binlog is set to N, the binlog of the latest N transactions will be lost if the host restarts unexpectedly.
2 Redo log writing mechanism
During a transaction, redo logs are written to the redo log buffer.
Do redo log buffers persist to disk every time they are generated?
- Don’t need.
If MySQL restarts unexpectedly during a transaction, this portion of the log is lost. Since the transaction is not committed, there is no loss of logs at this point.
Is it possible that some of the redo log buffer logs are persisted to disk before a transaction is committed?
- Possible.
The problem starts with the three possible states of the redo log, which correspond to the three color blocks in Figure 2.
Figure 2 MySQL redo log storage status
The three states are:
- The redo log buffer is physically stored in MySQL process memory.
- Write to disk, but not persist (fsync), physically in the file system’s Page cache;
- Persisting to disks corresponds to hard disk.
Log writing to the redo log buffer is fast, wirte is fast to the Page cache, but persistence to disk is slow.
To control the redo log write policy, InnoDB provides the innodb_flush_log_at_trx_commit parameter, which has three possible values:
- Set to 0: only redo log is left in the redo log buffer during each transaction commit.
- If the value is set to 1, the redo log is persisted to disk every time a transaction is committed.
- Set to 2: The redo log is only written to the page cache each time a transaction commits.
InnoDB has a background thread that writes logs from the redo log buffer to the page cache of the file system every second and then calls fsync to persist them to disk.
Note: Redo logs are written directly to the redo log buffer, and are persisted to disk by background threads. That is, a redo log of uncommitted transactions may have been persisted to disk.
In addition to the background thread polling once per second, there are two scenarios where redo logs of uncommitted transactions are written to disk:
- When the space occupied by the redo log buffer reaches half of innodb_log_buffer_size, the background thread writes disks. Note that since the transaction is not committed, the write action is only write and does not call fsync, leaving only the page cache of the file system.
- When a parallel transaction commits, persist the redo log buffer of that transaction to disk. If the innodb_flush_log_at_trx_commit value is set to 1, use the following code to write redo logs to the buffer: Transaction B persists the redo log buffer to disk. Transaction A’s log in the redo log buffer is persisted to disk.
Redo log prepare, write binlog, and commit If innodb_flush_log_at_trx_commit is set to 1, the redo log must be persisted once in the prepare phase because the crash recovery logic relies on the prepare redo log plus the binlog.
The redo log is written to the page cache of the file system without fsync at commit time.
Sync_binlog and innodb_flush_log_at_trx_COMMIT are both set to 1. In other words, a transaction must wait twice for a redo log and a binlog before it is fully committed.
Three groups of submission
3.1 Log Sequence Number (LSN)
The LSN is monotonically increasing and corresponds to each write point in the redo log. Each time a length redo log is written, length is added to the value of the LSN.
The LSN is also written to InnoDB data pages to ensure that the data pages are not repeated by redo logs.
As shown in Figure 3, the three concurrent transactions (TRx1, TRx2, and TRx3) completed redo log buffers in the prepare phase and were persistent to disk. The corresponding LSN values were 50, 120, and 160, respectively.
Figure 3 redo log group commit
As you can see from the picture,
- Trx1 is the first to arrive and will be chosen as the leader of the group;
- By the time TRx1 starts to write disk, there are already three transactions in the group, and the LSN has become 160.
- So when TRx1 returns, all redo logs whose LSN is less than or equal to 160 are persisted to disk.
- And then trx2 and trx3 can go straight back.
3.2 group submitted
In a group submission, the more members, the better the performance of disk IOPS saving.
In concurrent update scenarios, the later the fsync is invoked after the first redo log buffer is written, the more team members there are and the better the IOPS savings.
In order to increase the number of members in a fsync session, MySQL has an interesting optimization: drag time. During the two-phase commit, there is a flow chart.
Figure 4 two-phase commit
Actually, writing a binlog is a two-step process:
- Write the binlog file from the binlog cache to the binlog file on disk.
- Call fsync persistence.
MySQL fsync the redo log after Step 1 for better group commits. In other words, the graph above looks like this:
Figure 5 two-phase commit refinement
In this way, the binlog can also be submitted as a group. When performing step 4 in Figure 5 to fsync the binlogs to disk, multiple transactions whose binlogs have been written are persisted together, which also reduces IOPS consumption.
However, step 3 is usually performed quickly, so the time between binlog write and fsync is short, resulting in fewer binlogs that can be aggregated for persistence, so binlog group commits are usually not as good as redo logs.
To improve the performance of binlog group commit_COMMIT_SYNc_delay and binlog_group_COMMIT_SYNC_NO_delay_count, configure binlog_group_COMMIT_SYNc_DELAY and binlog_group_COMMIT_SYNC_NO_delay_count.
- The binlog_group_COMMIT_SYNc_delay parameter, indicating the number of microseconds before fsync is called;
- The binlog_group_COMMIT_SYNC_NO_delay_count parameter indicates how many times fsync should be called.
The two conditions are an or relationship, which means that fsync is called whenever either condition is met. If binlog_group_COMMIT_SYNc_delay is set to 0, binlog_group_COMMIT_SYNc_NO_delay_count is invalid.
WAL mechanisms benefit from two things:
- Both redo logs and binlogs are written sequentially. Sequential disk write is faster than random write.
- The group submission mechanism greatly reduces disk IOPS consumption.