1. Redo log introduction
Innodb does not synchronize dirty pages to disk in real time. When a data page is changed, Innodb adds it to the flush list of the Buffer Pool. Instead, the dirty pages will be synchronized to disk at some future time. Mysql database can guarantee the data persistence, without dirty pages from memory to disk synchronously, if the database is down, memory data will be lost, you have to have a mechanism to the loss of dirty pages to recover data, to ensure that the Mysql database startup again be dirty pages can also restore the original appearance, The Redo log provided by Innodb engine is designed to do just that!
As an ancient Chinese saying goes, “Before the troops and horses arrive, feed and grass should come first.” The same applies to database design, but with a twist: “Data not synchronized to disk, log first!” It means: If the dirty pages in the Buffer Pool are not synchronized to the disk, the modification operations of the cached pages are recorded in the form of logs. In this way, even if the database is down at a certain point, the data pages can be reloaded into the Buffer Pool when the database is restarted. The data page is then restored to its original appearance based on the change log. The log that records data page changes is the Redo log described in detail in this section. What we call log first is called WRITE-Ahead Logging (WAL).
Let’s review the Innodb engine architecture again:
We can see that the redo log generated by the Innodb engine during data exchange is persisted to a set of log files on disk (ib_logFile0, IBLogFile1) through the log Buffer in memory. In this section we describe the redo log storage structure in detail, including:
- Redo log format
- How does a Log Buffer store redo Log information
- How do redo log groups on disk store redo log information
In addition, you will learn more about Innodb’s mini-Transaction and redo log blocks. Let’s begin our redo log tour
2. Log format of Redo log
We now know that Redo log records the users are doing on the data changes, to modify the scene has a lot of data, we commonly used to add, delete, change may modify a data page, may also modify multiple data page, the page data also are not necessarily the same table in the database, so there are many kinds of format Redo log. In combination with the data page structure we discussed earlier, we can abstract a general Redo log type, namely, the specific changes we make to the data page in which space. Here is a general Redo log type structure:
Type indicates the log type of Redo log. Innodb engine has dozens of different types of Redo log for different scenarios of data page modification. The Space ID and the Page Number are the table Space ids and the data Page ids. The Space ID and the Page Number are the table Space ids and the data Page ids. The Space ID and the Page Number are the table Space IDS and the data Page ids. Let’s take a look at some of the simpler redo log types and then take a look at the more complex redo log types.
2.1 Several simple onesredo log
Log type
Let’s consider the simplest scenario where we simply update the value of a field in a record on a data page. We call this a very simple redo log called a physical log. We only need to record how many bytes we changed from an offset on the page. We also need to record the contents of these bytes. The Innodb engine divides redo logs into different types based on how much data is modified in the data page:
MLOG_1BYTE (the decimal number corresponding to the type field is 1) : indicates the type of the redo log that is written to an offset of the page.
MLOG_2BYTE (the decimal number corresponding to the type field is 2) : indicates the type of the redo log that is written to 2 bytes at an offset on the page.
MLOG_4BYTE (the decimal number of the type field is 4) : indicates the type of the redo log that is written to 4 bytes at an offset on the page.
MLOG_8BYTE (the decimal number corresponding to the type field is 8) : indicates that an 8-byte redo log is written at an offset on the page.
MLOG_WRITE_STRING (the decimal number corresponding to the type field is 30) : indicates that a string of data is written at an offset on the page.
For the more general type of MLOG_WRITE_STRING log, we need to log three types of information in the data section:
- The offset of the content to be modified in the data page
- How many bytes of data were modified (len)
- What is the content of the modified data?
As you may have noticed, len of the MLOG_WRITE_STRING log type is filled with 1, 2, 4, and 8 in decimal notation, respectively. MLOG_1BYTE, MLOG_2BYTE, MLOG_4BYTE, and MLOG_8BYTE are the log types. This is true, but why does the Innodb engine define redo logs of different types? The len field can be omitted if the different log types are defined. Of course it is! As we can see from the redo log structure above, redo logs take up very little space.
2.2 A little more complicatedredo log
Log type
Let’s look at a slightly more complex scenario, where a batch of data is inserted into a table, leaving aside the fact that this batch of data can be stored on a data page. Do we create a redo log for each log we insert using the simple Redo log type structure described above? If a secondary index is created in the table structure, the Innodb engine changes the corresponding secondary index page. Remember the data page structure we talked about earlier? We also need to update other information about the data Page in the File Header, Page Header, Page Directory, and so on; And the next_record property of the data record, which we say records the offset from the next record, also needs to change. In summary, bulk inserts into a table can involve changes in multiple places on multiple data pages. Let’s use an example diagram to briefly illustrate a single data page:
What I mean by this is that when inserting a batch of data into a data page, there are a lot of changes to be made. In this case, we want to log changes to the page using the simple redo log type mentioned above in one of two ways:
- Logging a redo log for every change in the data page produces a particularly large number of redo logs.
- All changes to a data page are recorded in a single redo log, but the data section records data from the first byte updated to the last byte changed. We know that a data page size is 16kb, and there is a lot of unchanged data. However, redo logs can be very large.
Who are the developers of the Innodb engine? Do they design redo logs this way? Of course not! The Innodb engine has many other redo logs designed to solve this problem. Let’s look at some of the most commonly used:
-
MLOG_COMP_REC_INSERT (decimal number 38) : Indicates the type of the redo log when a compact row was inserted.
-
MLOG_COMP_PAGE_CREATE (the decimal number for the type field is 58) : Represents the type of the redo log that created a page that stored records in compact line format.
-
MLOG_COMP_REC_DELETE (the decimal number corresponding to the type field is 42) : Deletes a redo log type recorded in compact line format.
-
MLOG_COMP_LIST_START_DELETE (the decimal number for the type field is 44) : Indicates a series of compact line-formatted redo log types that are deleted from a page starting with a given log.
-
MLOG_COMP_LIST_END_DELETE (the decimal number corresponding to the type field is 43) : This corresponds to the redo log of type MLOG_COMP_LIST_START_DELETE, which deletes a series of records up to the redo log of type MLOG_COMP_LIST_END_DELETE.
There are a number of redo log types, but there are also redo log types that correspond to the redo log types that are common in our business: insert, batch insert, batch update, and batch delete. These redo logs, however, contain not only physical logs but also logical logs. This means that when redo logs are used for data page restoration, they cannot simply copy the changed data to the corresponding place on the data page. The following is a complex redo log that restores a data page:
3. What isMini-Transaction
A mini-transaction is defined by Mysql as a single atomic access to the underlying page by the Innodb engine. What does that mean? For example, if we insert a record in an indexed table, we need to insert this record in the data page of the clustered index B+ tree of Innodb engine, change the contents of the next_record of the previous record to point to this record; Then insert the index information of this record into the data page of the secondary index B+ tree. The process described above is called an atomic access to the underlying page, meaning that this atomic access may modify the information of multiple data pages, and the process is indivisible. How do we understand this indivisibility? That is, we cannot insert a record into the data page of the clustered index B+ tree without changing the data page of the corresponding secondary index B+ tree. We call this indivisible access process a mini-transaction.
The reader may be wondering how this mini-transaction relates to the redo log described in this section. That’s a big deal! For example, inserting a log into an indexed table generates multiple redo logs. If only one of these redo logs was inserted, the database crashed. We know that redo logs are used to restore data pages. If we restore only a portion of the redo log, then the B+ tree corresponding to the redo log is in an incorrect state. Innodb does not allow this to happen, so every redo log generated by a Mini-Transactoin is grouped into a group. During a crash recovery, all redo logs are either recovered or none are recovered.
Innodb implements a special Redo log type called MLOG_MULTI_REC_END. The type of the Redo log is a decimal number 31. The structure of the Redo log is very simple. There is only one type field, so a set of redo logs from a mini-Transaction operation might look like this:
We mentioned earlier that there are some very simple redo log types. Do redo logs also need a MLOG_MULTI_REC_END redo log, indicating that the log is a separate group? Innodb uses only a few dozen redo log types, so 7 bits is enough for the redo log type. The remaining bits indicate whether the current redo log is a separate log, eliminating the need to group each simple redo log with a redo log of type MLOG_MULTI_REC_END. What a wise decision!
Now that you’ve seen the relationship between mini-Transactions and redo logs, some readers may be wondering what the relationship is between mini-Transactions and transactions provided by Innodb itself. This is an easy question to answer. If a Transaction in Innodb consists of multiple Sql statements, each Sql statement can consist of multiple MTRS, each of which contains an indivisible set of redo logs, So the relationship between Transaction, mini-transaction, and redo log is one of these contained correspondence:
4. How is redo Log information stored in the Log Buffer?
In order to facilitate the management of business data, Innodb engine designed “data pages” to store records; Innodb also has a redo log block that stores the redo logs we mentioned earlier.
4.1 redo log block
The structure
We call the redo log block the redo log pool. It is small because a redo log block is 512 bytes, compared to 16KB data pages. The structure is also very simple:
The only thing that stores redo log data is the 496 bytes of the log block body. LOG_BLOCK_CHECKSUM log block tailer LOG_BLOCK_CHECKSUM log block tailer LOG_BLOCK_CHECKSUM log block tailer LOG_BLOCK_CHECKSUM log block tailer LOG_BLOCK_CHECKSUM Let’s focus on a few properties of the log block header.
-
LOG_BLOCK_HDR_NO: Each block has a unique identifier greater than 0. This attribute represents the value of this identifier.
-
LOG_BLOCK_HDR_DATA_LEN: Indicates how many bytes have been used in the block, starting with 12 (because the log block body starts at the 12th byte). The value of this property increases as more redo logs are written to the block. If the log block body is fully written, the value of this property is set to 512.
-
LOG_BLOCK_FIRST_REC_GROUP: A single REDO log is also called a single redo log record. A SINGLE MTR produces multiple redo logs, which are called a redo log record group. LOG_BLOCK_FIRST_REC_GROUP represents the offset of the redo log group generated by the first MTR in the block (that is, the offset of the first REDO log generated by the first MTR in the block).
-
LOG_BLOCK_CHECKPOINT_NO: indicates the sequence number of a checkpoint. Checkpoint is the focus of the following content.
4.2 How do REDO logs are written to the log buffer
As mentioned earlier, Innodb engine’s memory architecture is mainly divided into two parts. One is the Buffer Pool, which we have already introduced in detail. The Buffer Pool is mainly used to solve the problem that every data page update is synchronized to the disk. Similarly, another part of Innodb’s memory architecture, the Log Buffer, is designed to address the performance loss caused by redo Log writing directly to disk. The size of the log buffer is specified by innodb_log_buffer_size. The default size is 16MB. Innodb also defines a global variable called buf_free to indicate where the redo log was written in the log Buffer, as shown in the following figure:
In Mysql5.7, writes to the log buffer are sequential and are committed in the exact order that mini-transactions are generated. We have previously analyzed that a Transaction may contain multiple Mini-transactions. Each mini-transaction contains a set of redo logs. In Innodb, transactions are executed concurrently, so redo logs from multiple mini-transactions may be written to the log buffer. For example, we have two transactions, T1 and T2, which are called “MTR” for convenience. T1 generates three redo logs, T1-mtr1, T1-mtr2, and T1-mtr3. T2 generates two redo logs, T2-mtr1 and T2-mtr2. The size of the log generated by each MTR is different, as shown in the figure below (we represent the set of logs generated by MTR as a color block structure). During the concurrent execution of T1 and T2 transactions, the log written by the redo log to the log Buffer might look something like this:
The MTR1 of T1 was first written into the Log Buffer, but did not fill up a block. Then THE MTR1 of T2 was also written into the Log Buffer and spread across three blocks. Then MTR was submitted, and the remaining MTR of T1 was submitted, which were small logs and filled into a block.
Mysql8.0 implements the parallel submission of MTR logs. The general idea is that when AN MTR generates a set of redo logs, the space occupied by the MTR is determined, so that you can calculate where each MTR should write the logs to the log Buffer. However, parallel replication can be fast or slow, and additional tools are needed to solve the resulting log holes and dirty order. The official blog has detailed the technology in depth: MySQL 8.0: New Lock Free, Scalable WAL Design
5. Redo log on disk
Writing redo logs to the log Buffer does not persist data. There is no doubt that redo logs fall. In this section we will take a look at the redo log file design on disk.
InnoDB’s redo log can be configured to multiple files with innodb_log_files_in_group. Another parameter, innodb_log_file_size, indicates the size of each file. Therefore, the total redo log size is innodb_log_files_in_group * innodb_log_file_size. The Redo logfile is named ib_logfile[number]. The log directory can be controlled by innodb_log_group_home_dir. The Redo log writes to file files in a sequential fashion. When the Redo log is full, it goes back to the first file and overwrites. The whole process is shown below:
As mentioned in the previous section, a log buffer is essentially a contiguous memory space consisting of 512-byte blocks. Redo logs are written in blocks to a log group file. Therefore, each log group file on disk is also composed of 512-byte blocks, and the first four blocks (2048 bytes) of each log group file are used to store administrative information. As shown below:
Let’s take a look at what the four blocks store:
Since checkpoinT1 and Checkpoint2 have the same structure, only one of them is drawn. If you don’t know what AN LSN and checkpoint are, don’t worry. I’m going to explain them in more detail in a later chapter.
Log file header structure:
The property name | Length (in bytes) | describe |
---|---|---|
LOG_HEADER_FORMAT | 4 | Redo log version |
LOG_HEADER_PAD1 | 4 | It’s for byte padding. It doesn’t make any sense, right |
LOG_HEADER_START_LSN | 8 | Mark the LSN value at the beginning of the redo log file, that is, the LSN value corresponding to the file offset of 2048 bytes |
LOG_HEADER_CREATOR | 32 | A string that identifies the creator of this redo log file. |
LOG_BLOCK_CHECKSUM | 4 | Check value of a block. All blocks have it |
Checkpoint structure:
The property name | Length (in bytes) | describe |
---|---|---|
LOG_CHECKPOINT_NO | 8 | The value is increased by 1 each time the server checkpoint is performed |
LOG_CHECKPOINT_LSN | 8 | LSN when the server checkpoint ends. System recovery starts from this LSN when the system crashes |
LOG_CHECKPOINT_OFFSET | 8 | The offset of the LSN value in the redo log file group |
LOG_CHECKPOINT_LOG_BUF_SIZE | 8 | Specifies the size of the log buffer when the server performs checkpoint operations |
LOG_BLOCK_CHECKSUM | 4 | Check value of block |
6. Summary and preview
In this section we have described the redo log storage structure in detail (how redo logs are stored in the in-memory log buffer and how redo logs are stored in the on-disk log file group). Understanding this is the basis for analyzing memory and disk interactions in the Innodb engine. We have also mentioned LSN and Checkpoint, but this section is not a separate one. It is quite complex and involves a lot of design skills and small details. Learning these will help us improve our business. The next section takes a closer look at LSN and Checkpoint in the Innodb engine. This article may be a bit boring, but I suggest that you come back to the later part of the puzzle. That’s all for today. Thank you