This article is shared in the huawei cloud community “Can MySQL Lose data?” , by JavaEdge.

A way to temporarily boost performance during peak business hours.

WAL ensures that data can be recovered after MySQL restarts abnormally as long as redo logs and binlogs are persisted to disk.

Write mechanism of binlog

During transaction execution:

  • Write logs to binlog cache first
  • When a transaction commits, the binlog cache is written to the binlog file

The binlog of a transaction should not be split, no matter how big the transaction is, ensure that it is written once. This involves saving the binlog cache.

The system allocates a binlog cache, one for each thread, but shares the same binlog file. The binlog_cache_size parameter controls the size of the binlog cache in a single thread. If the value is exceeded, it is temporarily saved to disk.

When a transaction commits, the executor writes the entire transaction from the binlog cache to the binlog and empties the binlog cache.

  • Binlog Write disk status In the TODO diagram:
  • Write Writes logs to the page cache of the file system and does not persist data to disk. Therefore, it is fast
  • Fsync persists data to disk. Fsync is generally considered to account for disk IOPS

The timing of write and fsync is controlled by sync_binlog:

  • Sync_binlog =0, each commit transaction is only write, not fsync
  • Sync_binlog =1, fsync is performed every time a transaction is committed
  • Sync_binlog =N(N>1), write every committed transaction, but fsync only after N transactions are accumulated

Therefore, in I/O bottleneck scenarios, setting sync_binlog to a large value can improve performance. In actual service scenarios, you are not advised to set this parameter to 0 because of the controllability of lost logs. You are advised to set this parameter to a value from 100 to 1000.

However, if sync_binlog is set to N, the binlog of the latest N transactions will be lost if the host restarts unexpectedly.

Redo log writing mechanism

Next, let’s talk about the redo log writing mechanism.

During a transaction, redo logs are written to the redo log buffer.

  • Do redo log buffers persist directly to disk each time they are generated? Don’t need.

If MySQL restarts unexpectedly during a transaction, this part of the log is lost. Since the transaction has not committed yet, there is no harm in losing the log at this point.

  • Is it possible that some of the redo log buffer logs are persisted to disk before a transaction is committed? There will be.

The problem starts with three possible redo log states. These three states correspond to the three color blocks in Figure 2.

  • MySQL redo log storage status TODO:
  • The redo log buffer exists physically in MySQL process memory
  • Writing to disk but not persisting (fsync) is physically in the file system’s Page cache
  • Persist to disk, that is, hard disk

Log writing to the redo log buffer is fast, wirte to Page cache is about the same, but persistence to disk is slow.

InnoDB provides the innodb_flush_log_at_trx_COMMIT parameter to control the redo log write policy:

  • 0, each transaction commits only leaving the redo log in the redo log buffer
  • 1. Redo log is persisted directly to disk at each transaction commit
  • The redo log is only written to the page cache each time a transaction commits

InnoDB has a background thread that writes logs from the redo log buffer to the file system’s Page cache every 1s, and then calls fsync to persist them to disk.

Redo logs are written directly to the redo log buffer, and are persisted to disk by background threads. A redo log of uncommitted transactions may have been persisted to disk.

In addition to the background thread polling every s, there are two scenarios where the redo log of an uncommitted transaction is written to disk:

  • The redo log buffer takes up about half the size of innodb_log_buffer_size, and the background thread writes to the disk. Since the transaction is not committed, the write action is only write and does not call fsync, which is left in the page cache of the file system.
  • When a parallel transaction commits, the redo log buffer of that transaction is persisted to disk. If innodb_flush_log_at_trx_commit is 1, then transaction B persists the redo log buffer to disk. Transaction A’s log in the redo log buffer is persisted to disk.

In a two-phase commit process, prepare the redo log, write the binlog, and commit the redo log.

If innodb_flush_log_at_trx_commit is set to 1, the redo log must be persisted once in the prepare phase because the crash recovery logic relies on the prepare redo log plus the binlog.

After every second round of polling and crash recovery, InnoDB considers that redo logs are written to the page cache of the file system without fsync.

Sync_binlog and innodb_flush_log_at_trx_COMMIT are both 1. That is, a transaction needs to wait for two flusher before committing completely:

  • Redo log (Prepare)
  • binlog

That means I see from MySQL that TPS is 2W, 40,000 disk writes per second. But I use the tool test, the disk capacity is only about 2W, how can achieve 2W TPS?

This is explained by a group commit.

Log Sequence Number (LSN)

LSN monotonically increases, corresponding to the redo log write point. For example, write redo logs of length to LSN+length.

LSN also writes to InnoDB data to ensure that data pages are not repeated redo logs.

As shown in Figure 3, three concurrent transactions (TRx1, TRx2, and TRx3) were persistent to disk after they finished writing redo log buffers in the prepare phase. LSN values were 50, 120, and 160, respectively.

  • The redo log group commits TODO
  • Trx1 arrives first and is chosen as the leader of the group
  • By the time TRx1 starts writing disks, there are three transactions in the group and LSN has become 160
  • Trx1 writes to disk with LSN=160. So, by the time TRx1 returns, all redo logs with LSN≤160 have been persisted to disk
  • In this case, trx2 and TRx3 can go straight back

Therefore, the more members you have in a group submission, the better the disk IOPS saving effect is. But with single-threaded pressure, only one transaction can honestly be persisted.

In concurrent update scenarios, the later the fsync is invoked after the first redo log buffer is written, the more team members the larger the IOPS savings.

In order to increase the number of members of a fsync band, MySQL takes an optimization: drag time.

  • Two-phase commit

Writing a binlog actually has two steps:

  1. Write the binlog from the binlog cache to the binlog file on disk
  2. Call fsync persistence

SQL > alter database fsync redo log after step1:

  • Two-stage commit refinement

In this case, the binlog can also be submitted as a group. Step4 in the figure above, if the binlogs of multiple transactions have been written, they are persisted together, which also reduces IOPS.

In general, step3 performs fast, so the interval between write and fsync of binlogs is short. As a result, only a few binlogs can be collected for persistence. Therefore, the group submission effect of binlogs is usually inferior to that of redo logs.

To improve binlog group submission, set the following parameters:

  • Binlog_group_commit_sync_delay Specifies the microsecond delay before fsync is invoked
  • The number of times the binlog_group_COMMIT_SYNC_NO_DELay_count parameter is accumulated before fsync is called

The two conditions are an or relationship, meaning that fsync is called whenever either condition is met.

If binlog_group_COMMIT_SYNc_delay = 0, binlog_group_COMMIT_SYNc_NO_delay_count is invalid.

WAL reduces disk writes, but redo logs and binlogs are written every time a transaction is committed. S So now you can understand that WAL mainly benefits from:

  • Both redo logs and binlogs are written sequentially. Sequential disk write is faster than random write
  • The group submission mechanism greatly reduces disk IOPS

Therefore, if the MySQL IO performance bottleneck occurs, perform the following operations to optimize it:

  • Set binlog_group_commit_sync_delay and binlog_group_COMMIT_SYNc_NO_delay_count to reduce the number of binlog writes to disks. This scheme is implemented on the basis of “extra deliberate waiting”, so it may increase the response time of the statement without losing data.
  • If you set sync_binlog to a value greater than 1 (100-1000 is recommended), binlog logs will be lost when the host is powered off.
  • Set innodb_flush_LOG_at_trx_COMMIT to 2. The risk is that data will be lost when the host is powered off.

Setting innodb_flush_log_at_trx_COMMIT to 0 is not recommended. This indicates that the redo log is only stored in the memory. In this case, the MySQL database may lose data due to an abnormal restart. The redo log is written to the page cache of the file system very quickly, so setting this parameter to 2 is about the same as setting it to 0, but MySQL does not lose data if it restarts unexpectedly.

summary

MySQL: How to guarantee redo log and binlog integrity.

crash-safe

  • After performing an update and then performing hexdump to view ibD file contents directly, why not see data changes? Maybe WAL. After the update statement, InnoDB only guarantees that it has finished writing redo logs, memory, and possibly data to disk.
  • Why is the binlog cache maintained by each thread and the redo log buffer globally shared? Binlog cannot be “broken”. A transaction’s binlog must be written consecutively, so the entire transaction must be written to the file together. Redo logs do not have this requirement. They are generated and written to the redo log buffer. The contents of the redo log buffer can also hitchhike and be written to disk as other transactions are committed.
  • If a transaction crashes and redo logs are lost before the commit phase, do you think this will cause a master/slave inconsistency? Don’t. Because the binlog is still in the binlog cache, not sent to the standby database. After the crash, the redo log and the binlog are gone, and the transaction is not committed from a business perspective, so the data is consistent.
  • If a crash occurs after binlog finishes writing disks, the system restarts without sending a reply to the client. Is it a bug when the client reconnects and finds that the transaction has been successfully committed? It isn’t. Imagine a more extreme scenario where the entire transaction commits successfully, the redo log commit completes, and the standby database receives a binlog and executes. However, the network between the master library and the client is disconnected. As a result, the successful transaction package cannot be returned. In this case, the client also receives the “Network disconnection” exception. This can only be considered a successful transaction, not a bug.

In fact, DB crash-safe guarantees:

  • If the client receives a successful transaction message, the transaction must be persisted
  • If the client receives a message that the transaction failed (such as a primary key conflict, rollback, and so on), the transaction must have failed
  • If the client receives an abnormal execution message, the application needs to reconnect the client and query the current status to continue the subsequent logic. In this case, DB only needs to ensure internal consistency (between data and logs, between primary and standby).

Click to follow, the first time to learn about Huawei cloud fresh technology ~