Innodb transaction logs include redo log and undo log. Redo log: provides rollback operations. Undo log: provides rollback operations.

Undo logs are not redo logs. They are actually redo logs:

1. A redo log is usually a physical log that records physical changes to a page, not changes to a single row or rows. It is used to restore a physical page after a commit (only to the last committed location).

2. Undo records a version by rolling back and forth rows. The undo log is a logical log that is recorded on a per-row basis.

1.redo log

1.1 Differences between Redo Logs and binary Logs

Redo logs are not binary logs. Many operations on InnoDB tables are also recorded in binary logs and can be redone, but there are big differences between them.

Binary logs are generated on top of the storage engine, regardless of the storage engine. Changes to the database will generate binary logs. The Redo log is generated by the InnoDB layer and only records table changes in the storage engine. And binary logs are logged before redo logs. See group Commit summary below for details.

Binary logging operations are logical statements. Even if it is a row-based record, it is essentially a logical SQL setting, such as the value of each column in the row record. The redo log is a physical log that records changes made to each page in the database.

Binary logs are only written to the log “file” in the cache once for each transaction commit (for non-transaction table operations, they are written directly after each successful statement execution). The redo log is written to the redo log in the cache before data is modified. Ensure that a redo log is written to the redo log before a commit is executed.

4. The binary log is written only once when the binary log is submitted. Therefore, the recording mode of the binary log is related to the submission sequence, and each submission corresponds to one record. In the Redo log, the same transaction may be recorded multiple times, and the last committed transaction overwrites all uncommitted transactions. For example, for transaction T1, the redo log may record four operations (T1-1,T1-2,T1-3, and T1*). T1* indicates the last commit log, so the final status of the corresponding data page is T1*. Redo logs are written concurrently. Different versions of redo logs are written concurrently to redo logs. For example, redo logs are written as follows: T1-1,T1-2,T2-1,T2-2,T2*,T1-3,T1*.

5. Transaction logging, which records physical pages, is idempotent and therefore extremely concise. Idempotent means that the state is the same after multiple operations, such as inserting a new row and then deleting it without changing the state. Binary logs record all operations that affect data. For example, insert a row once and delete it again.

1.2 Basic Concepts of redo logs

The redo log consists of two parts: an in-memory redo log buffer, which is volatile; The second is a persistent redo log file on disk.

Conceptually, InnoDB implements transaction persistence through the force log at commit mechanism, that is, before a transaction commits, all transaction logs must be written to the redo log file and undo log file on disk for persistence.

To ensure that each log is written to the transaction log file, the operating system’s fsync operation (the fsync() system call) is called each time the log from the log buffer is written to the log file. Since MariaDB/MySQL works in user space, MariaDB/MySQL’s log buffer is in user space memory. To write to a logfile on disk (redo:ib_logfileN,undo:share tablespace or. Ibd), the OS buffer must be stored in the kernel space. The purpose of the fsync() call is to flush the logs from the OS buffer to the log file on disk.

Write logs from the redo log buffer to the redo log file on disk as follows:

It is important to note that the log file is not a physical log file on disk, but a log file in the operating system cache, which is what the official manual means (e.g. With a value of 2, the contents of the InnoDB log buffer are written to the log file after each transaction commit and the log file is Flushed to disk approximately once per second). But to be honest, it’s not easy to understand, since it’s called file, it should already be a physical file. Therefore, OS buffer or File System buffer is used to represent the Log file mentioned in the official manual in the subsequent contents of this article. Then, Log file refers to the physical Log file on disk, that is, Log file on disk.

In addition, a layer of OS buffer is passed because open does not use the O_DIRECT flag, which means that the IO is written directly to the underlying storage device, bypassing the OS buffer at the operating system layer. Not using this flag bit means that the log will be buffered to a certain size, or explicitly fsync() will flush the buffer to the storage device. Using this flag bit means that a system call is made every time. For example, writing abcDE would only make one system call without o_direct and five system calls with o_object.

MySQL supports user customization of how logs from the log buffer are flushed to the log file during commit. This control is determined by the value of the variable innodb_flush_LOG_at_trx_COMMIT. This variable has three values: 0, 1, and 2. The default is 1. Note, however, that this variable only controls whether the COMMIT action flusher the log buffer to disk.

When set to 1, a transaction writes logs from the log buffer to the OS buffer and calls fsync() to the log file on disk for each commit. This way, no data is lost even if the system crashes, but IO performance is poor because every commit is written to disk.

When set to 0, instead of writing logs from the log buffer to the OS buffer during transaction commits, the OS buffer is written every second and fsync() is called to log file on disk. This means that 0 is written to disk (approximately) every second, and when the system crashes, 1 second of data is lost.

When set to 2, each commit is written only to the OS buffer, followed by a call to fsync() every second to write the logs from the OS buffer to the log file on disk.

Note that innodb_flush_log_at_timeout has a value of 1 second. This variable refers to the frequency of flushing logs. Many people think it is the frequency of flushing logs in 1 second when innodb_flush_log_at_trx_COMMIT values are 0 and 2. Innodb_flush_log_at_trx_commit sets innodb_flush_log_at_trx_COMMIT to 0 and 2. As for what this frequency controls, it will be explained later in “Rules for brushing logs to Disk”.

In a master-slave replication structure, to ensure transaction persistence and consistency, log related variables need to be set as follows:

If binary logging is enabled, set sync_binlog=1, that is, every commit transaction synchronization is written to disk.

Always set innodb_flush_LOG_at_trx_COMMIT =1, that is, every committed transaction is written to disk.

The setting of the above two variables ensures that the binary log and transaction log are written to each commit transaction and flushed to disk at commit time.

Selecting the time to flush logs seriously affects the performance of data modification, especially the process of flushing logs to disks. The following example tests the difference between innodb_flush_LOG_at_trx_COMMIT values 0, 1, and 2.

In the current environment, the value of innodb_flush_LOG_at_trx_COMMIT is 1, that is, every commit flushes logs to disk. Tests the time it takes to insert 10W records at this point.

The result was 15.48 seconds.

When the value is 2, the OS buffer is flushed every commit but only flushed to disk every second.

The insertion time was reduced to 3.41 seconds.

When the final test value is 0, the OS buffer and disk are flushed every second.

The result was 2.10 seconds.

In the end, you can see that the difference between 2 and 0 is not that big, but 2 is much safer than 0. They are all flushed from OS Buffer to disk every second, and the time difference between them is reflected in the log buffer flushed to OS buffer. Because flushing logs from the log buffer to the OS buffer is just a transfer of memory data without much overhead, there is not much difference between each commit and the flush per second. You can test inserting more data for comparison, and here is the case of inserting 100W rows. As you can see from the results, there is not much difference between 2 and 0, but there is a significant difference in performance with 1.

Although 0 and 2 can greatly improve the insert performance, 1 second of data may be lost in the event of a failure, which is likely to have a large amount of data. According to the above test results, 100W records only take more than 20 seconds, and there are about 4W-5W data in one second. Despite the simplicity of the data inserted above, But it illustrates the enormity of data loss. A better way to insert data is to set the value to 1, and then modify the stored procedure to commit each loop only once. This will ensure data consistency and improve performance as follows:

When the test value is 1.

1.3 Log Blocks

Innodb storage engine stores redo logs in chunks of 512 bytes. This is called a redo log block. So log buffers, OS buffers, and redo log files on disk are all stored in 512-byte blocks.

Each redo log block consists of three parts: the log bulk, the log tail, and the log body. The log bulk occupies 12 bytes and the log tail occupies 8 bytes, so the log body of each Redo log block is 512-12-8=492 bytes.

Because the redo log records changes to a data page, when changes to a data page require a redo log of more than 492 bytes (), multiple redo log blocks are used to record changes to the data page.

The log bulk consists of four parts:

Log_block_hdr_no :(4 bytes) ID of the location of the log block in the redo log buffer.

Log_block_hdr_data_len :(2 bytes) specifies the log size that has been recorded in the log block. When the log block is full, the value is 0x200, indicating 512 bytes.

Log_block_first_rec_group :(2 bytes) the start offset of the first log in the log block.

Lock_block_checkpoint_no :(4 bytes) position to write checkpoint information.

The third part of the log block size is log_block_first_rec_group, because sometimes a data page generates more logs than one log block, which is why multiple log blocks are needed to log the page. For example, if a data page generates a log volume of 552 bytes, then two log blocks, the first 492 bytes and the second 60 bytes, would have a 73 byte (60+12) start for the second log block. If the value of this part is the same as that of log_block_hdr_data_len, it indicates that there is no new log block in this log block, that is, this log block is used to continue the previous log block.

The log tail has only one part: log_block_trl_no, which has the same value as the bulk log_block_hdr_no.

The redo log buffer, or redo log file on disk, consists of log blocks. The diagram below:

1.4 Log Group and redo log file

Log group represents a redo log group. A group consists of redo log files of the same size. The number of redo log files in a group is determined by the innodb_log_files_group variable. The default value is 2. This group is a logical concept, and no real file to say this is a group, but the group can be defined by the variable innodb_log_group_home_dir directory, redo log file in this directory, default is under the datadir.

In the default data directory, you can see that there are two ib_logfile files in the log group. They are the same size as innodb_log_file_size. The first file ibdata1 is a shared tablespace file when innodb_file_per_TABLE is not enabled, corresponding to the. Ibd file when innodb_file_per_table is enabled.

When InnoDB brushes redo log blocks from the log buffer to these log files, it cycles through apwrites. Write to the end of the first logfile (ib_logfile0) until it is full and write to the second logfile (ib_logfile1). When the second log file is full, part of the first log file is cleared and the first log file continues writing.

Since the logs in the log buffer are flushed to the log file, the log file is also recorded in the log block mode.

In the first redo log file for each group, the first 2KB logs four specific sections, and the log blocks start logging after 2KB. None of the other log files in the log group, except the first redo log file, records the 2KB, but does free the 2KB. As follows:

The size of the redo log file has a significant impact on innoDB performance. A large redo log file takes longer to recover, and a small redo log file takes longer to recover.

1.5 Redo log format

Because the InnoDB storage engine stores data in pages (as in SQL Server), the redo log is also recorded in page format. By default, innodb’s page size is 16KB(controlled by the innodb_page_size variable), and a page can hold a large number of log blocks (512 bytes each) that record changes to the data page.

The 492 byte part of the log block is the log body, which is divided into four parts:

Redo_log_type: log type of the redo log, occupying 1 byte.

Space: indicates the ID of the tablespace. The space occupied by the compressed tablespace may be less than 4 bytes.

Page_no: indicates the offset of the page, again compressed.

Redo_log_body represents the data portion of each redo log, which is parsed by calling the corresponding function during recovery. For example, insert statements and DELETE statements write redo logs differently.

The following figure shows the general recording mode of INSERT and DELETE respectively.

1.6 Rules for flushing Logs

The logs in the log buffer that are not flushed to disks are called dirty logs.

As mentioned above, by default the transaction log is flushed to disk every time a transaction commits, because the variable innodb_flush_log_at_trx_COMMIT has a value of 1. However, InnoDB does not only flush logs to disk after a commit action, this is just one of the rules of innoDB storage engine flush logs.

There are several rules for flushing logs to disk:

1. When the COMMIT action is issued. The innodb_flush_log_at_trx_COMMIT variable controls whether logs are flushed after a COMMIT is issued.

2. Brush once per second. The frequency of the flush is determined by the innodb_flush_log_at_timeout variable, which defaults to 1 second. Note that this frequency is independent of the COMMIT action.

3. More than half of the memory in the log buffer has been used.

4. When there is a checkpoint, the checkpoint indicates the LSN position of the log when the disk is flushed to some extent.

1.7 Data page flushing rules and checkpoint

Data in the buffer pool that is not flushed to disks is called dirty data. Because both data and logs exist in the form of pages, dirty pages represent dirty data and dirty logs.

The previous section covered when logs are flushed to disk, and it’s not just logs that need to be flushed, but dirty data pages as well.

In InnoDB, there is only one rule for flushing data: checkpoint. But there are several ways to trigger a checkpoint. Either way, checkpoint fires, flushing both the dirty data pages and the dirty log pages from the buffer to disk.

Innodb storage engine has two types of checkpoint:

Sharp checkpoint: When reusing a redo log file (for example, switching log files), flush all dirty data recorded in the redo log file to disk.

Fuzzy checkpoint: Only a small number of logs are flushed to disk at a time, instead of flushing all dirty logs. This checkpoint can be triggered in the following ways:

Master thread checkpoint: controlled by the master thread, a certain percentage of dirty pages are flushed to disks every second or every 10 seconds.

Flush_lru_list checkpoint: Starting with MySQL5.6, the innodb_page_CLEANERS variable specifies the number of page cleaner threads dedicated to flushing dirty pages to ensure that the LRU list has free pages available.

Async /sync Flush checkpoint: Synchronous or asynchronous flush. For example, there are a lot of dirty pages that have not been brushed to the disk (the number is very much, there is a proportional control), at this time, it will choose synchronous brush to the disk, but this rarely occurs; If the number of dirty pages is not large, you can asynchronously brush the dirty pages to disks. If the number of dirty pages is small, you can not brush the dirty pages to disks temporarily

Dirty pages too much checkpoint: A checkpoint is forcibly triggered when there are too many dirty pages to ensure sufficient free cache space. The ratio of too much is controlled by the variable innodb_max_dirty_pages_pct. The default value for MySQL 5.6 is 75.

Because cleaning dirty pages takes time to complete, the location of the checkpoint is not marked in the redo log until after each brush.

Innodb_fast_shutdown =102 (default: 1) Specifies whether the MySQL database will flush dirty data and logs to disk when it is stopped, as controlled by the variable innodb_fast_shutdown=102 (default: 1). Implement fast shutdown.

1.8 LSN Super detailed Analysis

The LSN is called the log sequence number. In the InnoDB storage engine, the LSN occupies 8 bytes. The value of the LSN increases gradually as logs are written.

According to LSN, several useful information can be obtained:

1. Version information of the data page.

2. Total number of logs to be written. You can calculate the number of logs to be written based on the start and end numbers of the LSN.

3. You can know the location of the checkpoint.

You can actually get a lot of implicit information.

LSN exists not only in redo logs, but also in data pages. At the head of each data page, there is one

fil_page_lsn

Record what the final LSN value was on the previous page. If the LSN in the redo log is smaller than the LSN in the redo log, some data is lost. You can use the redo log to restore the LSN in the redo log.

You can run the show engine Innodb status command to view the LSN of the redo log. Pages uploading up to is not flushed as 3 uploading records in the show result of MySQL 5.5.

Among them:

Log sequence number is the current LOG SN in the redo log.

Log uploading up to is the LSN that is flushed as the redo log file on disk.

Pages flushed up to is the flushed LSN that is uploaded to the disk data page.

Last checkpoint at is the LSN of the last checkpoint.

Innodb starts by executing a change statement:

(1). First modify the data page in memory, and record LSN in the data page, named datA_IN_BUFFer_LSN;

(2). Write redo log to redo log in buffer at the same time (almost simultaneously) as you modify the data page, and record the corresponding LSN, temporarily called redo_log_in_buffer_lsn;

(3). After writing the logs in the buffer, the redo log is flushed to the redo log file on disk when several rules of the log flush are triggered. The corresponding LSN is recorded in the file, which is temporarily called redo_log_ON_DISK_LSN.

Data pages cannot stay in memory forever. In some cases, the checkpoint is triggered to flush dirty data and log pages from memory to disk. Therefore, the CHECKPOINT LSN is recorded in the redo log when the checkpoint dirty pages are flushed. Call it checkpoint_lsn.

(5) to record the checkpoint location quickly, simply set a flag. However, it is not always fast to swipe data pages. For example, at this checkpoint, there are many data pages to swipe. That is to say, it takes a certain amount of time to load all the data pages, and each data page in the process will record the LSN of the current page, which will be called data_page_ON_DISK_LSN.

The details are as follows:

In the figure above, the lines from top to bottom represent: Time axis, LSN recorded in data pages of buffer (data_IN_BUFFer_lSN), LSN recorded in data pages of disk (data_page_ON_DISK_LSN), LSN recorded in redo logs of buffer (redo_log_IN_BUFFer_lsn) , the LSN(redo_LOG_ON_DISK_LSN) recorded in the redo log file on the disk, and the LSN(checkpoint_lsn) recorded at the checkpoint.

Assume that at the initial time (12:0:00) all log pages and data pages are flushed and the LSN of the checkpoint is recorded, at which point their LSN is exactly the same.

Suppose a transaction is started and an update is performed immediately. After the update, the data pages in the buffer and the redo log record the updated LSN value, which is 110. If you run show engine Innodb status to check the value of each LSN, that is, the position status at ① in the figure, the result will be:

logsequence number(110) >logflushed up to(100) = pages flushed up to = last checkpoint at

Then a DELETE statement was executed and the LSN grew to 150. At 12:00:01, the redo log flusher rules (one of which is innodb_flush_log_at_timeout, the default log flusher frequency is 1 second). The LSN of the redo log file on disk is updated to the same as the LSN of the redo log in buffer. The LSN of the redo log file on disk is equal to 150. The LSN of the redo log in buffer is equal to 150.

logsequence number(150) =logflushed up to > pages flushed up to(100) = last checkpoint at

After that, an UPDATE statement is executed, and the LSN in the cache grows to 300, which is position ③ in the figure.

Then checkpoints, assumption of graph (4) position, as said earlier, checkpoint will trigger the data and log pages brush set, but need some time to complete, so has yet to be completed in data page brush set, checkpoint LSN or the last checkpoint LSN, but at the moment, disk data and log pages LSN has increased, namely:

Logsequence number >logflushed up to and Pages flushed up to > last checkpoint at

However, the uploading size of log up to and Pages up to can not be determined because log flushing may be faster than, equal to, or slower than data flushing. However, the checkpoint mechanism protects that the data flush rate is slower than the log flush rate. When the data flush rate exceeds the log flush rate, the system stops data flush temporarily and waits for the log flush rate to exceed the data flush rate.

By the time the data and log pages are flushed, i.e. at position ⑤, all LSNS are equal to 300.

As time goes by, the log flush rule is triggered at 12:00:02 (⑥ in the figure), but the log LSN in the buffer and the log LSN in the disk are the same. Therefore, the log flush rule is not executed, that is, all LSNS are equal when the show Engine Innodb status is displayed.

An INSERT statement is then executed, assuming that the LSN in the buffer grows to 800, at position ⑦ in the figure. In this case, the sizes and positions of various LSNS are the same.

The commit action, position ⑧, is then performed. By default, the commit action triggers a log flush, but not a data flush, so show Engine Innodb status results in:

logsequence number =logflushed up to > pages flushed up to = last checkpoint at

Finally, as time passes, the checkpoint reappears, at position ⑨ in the figure. However, this checkpoint does not trigger a log flush because the LSN of the log is synchronized before the checkpoint occurs. If the data is flushed so fast that the state changes are not captured in a fraction of a second, show Engine InnoDB status will result in equal LSNS.

1.9 InnoDB recovery behavior

When InnoDB is started, it is always restored whether innoDB was shut down normally or abnormally last time.

Because redo logs record physical changes to data pages, recovery is much faster than logical logs (such as binary logs). Also, InnoDB itself has been optimized to make recovery faster.

When InnoDB is restarted, checkpoint indicates that all logs are flushed to the LSN on the data page on the disk. Therefore, only logs from checkpoint are restored. For example, when the LSN of the database at the last checkpoint is 10000, the database is down and the transaction is committed. When the database is started, the LSN of the data page on disk is checked. If the LSN of the data page is smaller than the LSN in the log, the recovery starts from the checkpoint.

In another case, the checkpoint is being flushed before the outage, and the data page flushing progress exceeds the log page flushing progress. In case of a downtime, the LSN recorded in the data page is larger than that recorded in the log page. This is detected during the recovery process of the restart. In this case, the part that exceeds the log progress will not be redone, because this indicates that the work has been done.

In addition, transaction logging is idempotent, so multiple operations that result in the same result are recorded only once in the log. However, binary logs are not idempotent. Multiple operations are recorded, and the recovery process is much slower because records in binary logs are executed multiple times. For example, if a record with an ID originally set to 2 is updated to 3 and then set to 2, the transaction log will record unchanged pages with no need to recover at all. Binary records two update operations, which are also performed during recovery, slower than transaction log recovery.

Several variables related to redo logs

Innodb_flush_log_at_trx_commit = | 2 # 01 specify when painting the transaction log to the disk, the default is 1.

0: Synchronizes “log Buffer “to” OS Buffer “and brushes from” OS buffer” to disk log files every second.

1 indicates that each transaction commit will synchronize “log Buffer “to” OS buffer” and flush from “OS buffer” to disk log file.

2 means that each transaction commit synchronizes the “log buffer” to the “OS buffer” but only flusits from the” OS buffer” to the disk log file every second.

Innodb_log_buffer_size: # Size of log buffer, default 8M

Innodb_log_file_size: # Transaction log size, default 5M

Innodb_log_files_group =2: # Number of transaction log files in the transaction log group, 2 by default

Innodb_log_group_home_dir =./ : # Transaction log group path. The current directory represents the data directory

Innodb_mirrored_log_groups =1: # specifies the number of mirror groups for the transaction log group, but mirroring seemed to be forcibly disabled, so there was only one log group. This variable has been removed in MySQL5.7.

2.undo log

2.1 Basic Concepts

Undo log serves two purposes: it provides rollback and multiple line versioning (MVCC).

During data modification, not only redo but also the corresponding Undo is recorded, which can be used to roll back the transaction if it fails or is rolled back for some reason.

The undo log is a logical log, unlike the redo log. You can assume that when a record is deleted, the Undo log records a corresponding INSERT record, and vice versa, when a record is updated, it records a corresponding reverse update record.

When rollback is performed, the logical records in the Undo log can be read and rolled back. Sometimes, row versioning is also implemented through undo log: When a row is locked by another transaction, it can analyze the previous data of the row from the Undo log, so as to provide the version information of the row, so that users can achieve non-locked consistent read.

The undo log is recorded in the form of segment. Each undo operation occupies an Undo log segment.

In addition, undo log also generates redo logs, because undo log also implements persistence protection.

2.2 Storage Mode of undo Log

Innodb storage engine manages undo in segments. Rollback segments are called rollback segments. Each rollback segment contains 1024 undo log segments.

In previous versions, only one rollback segment was supported, which allowed only 1024 undo log segments to be recorded. MySQL5.5 supports 128 rollback segments You can also customize the number of rollback segments by using the innodb_undo_logs variable (innodb_ROLLback_segments prior to 5.6). The default value is 128.

The undo log is stored in the shared tablespace by default.

If innodb_file_per_table is enabled, it will be placed in an. Ibd file for each table.

In MySQL5.6, the location of undo can also be customized through the innodb_undo_directory variable. The default value is “.” for datadir.

By default, rollback segments are all in one file, but the innodb_unDO_TABLespaces variable can be set to how many files are allocated on average. This variable defaults to 0, which means that all are written to a tablespace file. This variable is static and can only be changed when the database example is stopped, such as writing to a configuration file or starting with the corresponding parameter. Innodb storage engine indicates during startup that it is not recommended to change the value to a non-0 value, as follows:

2017-03-31 13:16:00 7f665bfab720 InnoDB: Expected to open 3 undo tablespaces but was able

2017-03-31 13:16:00 7f665bfab720 InnoDB: to find only 0 undo tablespaces.

2017-03-31 13:16:00 7f665bfab720 InnoDB: Set the innodb_undo_tablespaces parameter to the

2017-03-31 13:16:00 7f665bfab720 InnoDB: correct value and retry. Suggested value is 0

2.3 Undo Log-related Variables

Undo-related variables have become rare in MySQL5.6. As follows: Their meanings have been explained above.

2.4 Internal mechanism of DELETE/Update Operations

When the transaction is committed, InnoDB will not delete the undo log immediately, because the undo log may be used later. For example, when the isolation level is REPEATable Read, the transaction reads the latest version of the commit row when the transaction is started. As long as the transaction does not end, the version of the row cannot be deleted. That is, the undo log cannot be deleted.

However, when a transaction is committed, the undo log corresponding to the transaction will be put into the delete list, which will be deleted through purge in the future. In addition, when committing a transaction, it will determine whether the pages allocated by Undo Log can be reused. If they can be reused, they will be allocated to subsequent transactions to avoid wasting storage space and performance by allocating separate Undo log pages for each independent transaction.

Delete and update operations with undo log :(insert operation does not need analysis, just insert rows)

The delete operation does not actually delete directly, but instead flags the DELETE object as deleted, and the purge thread has done the final deletion.

Update is divided into two cases: whether the update column is a primary key column.

If it is not a primary key column, the undo log directly reverses how the update was done. That is, the update is done directly.

If it is a primary key column, the update is performed in two parts: delete the row and insert the target row.

3. Binlog and transaction log sequence and group commit

If the transaction is not a read-only transaction, that is, if data changes are involved, fsync() is called at commit time to flush the log to disk by default to ensure transaction persistence.

However, one transaction at a time has low logging performance, especially when transactions are concentrated at one point in time and the volume of transactions is very high. Innodb provides group Commit, which can flush transaction logs from multiple transactions to disk in a single fsync().

Because not only are transactions logged at commit time, but also binary logs are logged, but which of them is logged first? Binary logs are the upper layer of MySQL logs and are written before the transaction logs of the storage engine.

Before MySQL5.6, when the transaction commits (i.e. issuing a COMMIT command), MySQL receives the signal and enters the commit stage. After the prepare phase, write binary logs in the memory immediately. After writing binary logs in the memory, the COMMIT operation is determined. Then start writing the transaction log in memory; Finally, binary and transaction logs are flushed. How they are flushed is controlled by the variables SYNc_binlog and Innodb_flush_LOG_at_trx_COMMIT, respectively.

However, because binary and transaction logs are required to be consistent, a Prepare_COMMIT_MUtex lock is enabled during the prepare phase after the commit to ensure that they are sequential and consistent. However, group Commmit becomes invalid after binary log is enabled, especially in the master-slave replication structure, binary log is almost always enabled.

Improvements were made in MySQL5.6. The first transaction in the queue is called the leader, and the other transactions are called followers. The leader controls the behavior of the followers. The order remains the same, binary first and transaction log second, but the mechanism is completely changed: removing the prepare_COMMIT_MUtex behavior also ensures that the group COMMIT is valid even if binary logging is enabled.

MySQL5.6 consists of three steps: Flush phase, sync phase, and commit phase.

Flush phase: Writes a binary log of each transaction to memory.

Sync phase: Flush binary logs in the memory. If there are multiple transactions in the queue, a single fsync operation completes the flush of binary logs. This is called BLGC(Binary log Group Commit) in MySQL5.6.

Commit phase: the leader invokes the commit of the storage engine layer transaction sequentially. Since InnoDB inherently supports group commit, the group commit failure caused by the prepare_COMMIT_MUtex lock is resolved.

The binlog_max_flush_queue_time variable is used to wait for several more binlogs to enter the sync phase. The wait time is determined by the variable binlog_max_flush_queue_time. Setting this variable to a value greater than 0 has the advantage of increasing the number of transactions in the group, resulting in better performance. However, this can lead to slower transaction response times, so it is not recommended to change the value of this variable unless the transaction volume is very high and is constantly being written and updated.

When you enter the sync phase, binlogs are flushed from memory to disk. The number of binlogs flushed is the same as a single binary log flush, controlled by the sync_binlog variable.

When one set of transactions is in the COMMIT phase, other new transactions can flush. They do not block each other, so the Group commit continues. Of course, the performance of a group commit is dependent on the number of transactions in the queue. If there are only one transaction in each queue, then a group commit is no different from a single commit. When there are more transactions in the queue, the effect of a group commit becomes more obvious.

Author: Golden dragon horse

Source: www.cnblogs.com/f-ck-need-u…