Understanding InnoDB — Architecture — Storage — Transaction — InnoDB — locking

Definition of transactions

Atomicity, the whole database transaction is indivisible unit of work Consistency, Consistency, the isolation of a database from one state to the next consistent state: Isolation, each object reading or writing a transaction can separate from the operating objects of other transactions Durability: Their results are permanent once a transaction is committed

Transaction concurrency issues dirty read: transaction reads the transaction B updated data, then A rollback operation, B so dirty data is read into A data not repeatable read: transaction A repeatedly read the same data, transaction data to make to the B during update and submit, lead to A multiple reads the same data, the results are inconsistent. Phantom read: Transaction A repeatedly queries the same condition, during which transaction B deletes or inserts data that meets the condition, resulting in inconsistent result sets for multiple reads by transaction A.

The ISOLATION levels defined by the SQL standard are:

Transaction isolation level describe Dirty read Unrepeatable read Phantom read
READ UNCOMMITTED A transaction reads data modified by another uncommitted transaction. is is is
READ COMMITTED A transaction can only read data modified by another committed transaction. no is is
REPEATABLE READ A transaction can only read data that has been modified by another committed transaction, and the first time that a transaction reads a record, even if other transactions have changed the value of the record and committed it, the transaction will read the first value of the record again. no no is
SERIALIZABLE Transactions are serialized to execute no no no

REPEATABLE READ InnoDB supports REPEATABLE READ by default, but unlike standard SQL, InnoDB uses the next-key Lock algorithm at the REPEATABLE READ transaction isolation level to avoid phantom READ problems.

Transaction implementation

redo log

Redo logs ensure atomicity and persistence of transactions

Write Ahred Log (WAL) is commonly used in transaction database systems. That is, when a transaction is committed, the redo Log is written first and then the disk page data is modified. In the event of an outage, data recovery is accomplished through redo logs. This is also a D(Durability) requirement in transaction ACID. Actually the first operation is to modify the page data in the memory pool. Write redo logs first as opposed to modifying disk data.

Any update to an Innodb table is converted to a redo log and written to disk. The redo log records details of the changes.

A redo log is a physical log that records data such as offsets and bytes of a page. Redo the redo log is the physical operation of the page that rewrites the committed transaction changes.

The innodb_flush_log_at_trx_COMMIT parameter controls the policy for flushing the redo log cache to disk. 0: The master thread performs the fsync operation every second instead of writing to the redo log. 2: The master thread performs the fsync operation every second. Only redo logs are written to the file system cache during transaction submission, and fsync operation is not performed. When mysql breaks down but the operating system does not break down, transactions will not be lost.

In addition to transaction commit time, there are the following scenarios for flushing redo logs to disk

  1. The redo log buffer has used up half of the memory
  2. Async/Sync Flush Checkpoint
log block

Redo log buffers. Redo log files are stored as blocks, called redo log blocks. Each block is 512 bytes. The redo log block is the same size as the disk sector, so the redo log can be written atomically and does not require doublewrite technology.

A log block consists of three parts: log block header, log body, and log block tailer

Log size content below

variable byte describe
LOG_BLOCK_HDR_NO 4 The log buffer is made up of log blocks. Internally, the log buffer is like an array of log blocks. LOG_BLOCK_HDR_NO is used to mark the subscript of this array, incrementing and iterating. The first bit is used as flush bit. The maximum value is 2G
LOG_BLOCK_HDR_DATA_LEN 2 Represents the size occupied by log blocks. The maximum value is 0x200, which indicates log block512 bytes
LOG_BLOCK_FIRST_REC_GROUP 2 The first log offset for the new transaction. If there is free space in the log block after the transaction L1 is stored, the contents of the next transaction L2 will be stored. LOG_BLOCK_FIRST_REC_GROUP records the offset of transaction L2.

If LOG_BLOCK_FIRST_REC_GROUP is the same as LOG_BLOCK_HDR_DATA_LEN, the current log block contains no new logs. LOG_BLOCK_HDR_NO is calculated using LSN. So InnoDB can also use LSN to locate specific redo logs.

The log block tail contains 4 bytes LOG_BLOCK_TRL_NO, which is the same as LOG_BLOCK_CHECKPOINT_NO

log group

A redo log group is a logical concept consisting of multiple redo log files. Each log group has the same log size. The default redo log group consists of ib_logFILE0 and IB_logFILE1. Before InnoDB 1.2, the total size of redo log groups was less than 4GB. Starting with InnoDB 1.2, the total size of redo log groups was increased to 512GB.

At the end of the redo log file, when a redo log file is full, a redo log file is written to the next redo log file in round-robin mode. As mentioned in the Architecture section, when a page is Checkpoint flushed to disk, the corresponding redo log is no longer needed and its space can be overwritten and reused.

The first 2kb of the first redo log file in each redo group does not store log block information, but log file header and checkpoint information. The redo log is sequential. The redo log is fast, but the checkpoint of the first log file is updated after the redo log is checkpoint. Redo group the first 2KB of the redo log file does not store information and is reserved empty.

The redo log file contains the first 2k files

attribute byte
LOG FILE HEADER 512
CHECKPOINT BLOCK1 512
empty 512
CHECKPOINT BLOCK2 512

Take a look at CHECKPOINT BLOCK

attribute byte parsing
LOG_CHECKPOINT_NO 8 Monotonically increasing value. The value is incremented after each checkpoint operation
LOG_CHECKPOINT_LSN 8 The value of the checkpoint
LOG_CHECKPOINT_OFFSET 4 Checkpoint indicates the offset of the redo log

LOG_CHECKPOINT_LSN Indicates the checkpoint value. Pages whose LSN is smaller than this value have been written to disks. There are two CHECKPOINT blocks, and InnoDB updates CHECKPOINT values alternately. In this way, even if a checkpoint block fails to write data, database transactions can be correctly recovered from the previous checkpoint during crash recovery.

InnoDB reads the two CHECKPOINT blocks, takes the larger LOG_CHECKPOINT_LSN, and restores redo logs larger than this value.

LSN

Log Sequence Number LSN is the Log Sequence Number of each redo Log.

An LSN exists in multiple objects with different meanings

  1. Represents the total number of redo log writes

The LSN is monotonically incremented and stored in log_sys (InnoDB maintains an object that manages the Redo Log Buffer at runtime and is initialized by log_init() at startup). LSN increments the number of bytes written to the Redo Log. For example, if a new Log is len, log_sys-> LSN += len.

  1. Indicates the latest checkpoint location. See LOG_CHECKPOINT_LSN in the checkpoint BLOCK.

When InnoDB is shutdown, a full checkpoint is performed after flush redo log and dirty pages. The checkpoint LSN is written to the FIL_PAGE_FILE_FLUSH_LSN of the FSP HEADER PAGE of the shared tablespace. When Mysql starts, FIL_PAGE_FILE_FLUSH_LSN in the shared tablespace will be read. And the larger LOG_CHECKPOINT_LSN in the CHECKPOINT BLOCK. If the two are the same, the CHECKPOINT BLOCK is shut down. Otherwise, fault recovery is required. Run the LOG_CHECKPOINT_LSN command to find the redo log and scan the redo log to restore the log.

  1. Represents the last refreshed position of the page. Each page has a FIL_PAGE_LSN in the header, which records the LSN size when the page was last refreshed and can be used to determine whether the page needs to be restored.

Innodb_fast_shutdown controls database shutdown operation 0: Shutdown requires all full Purge and merge insert buffer to be completed and all dirty pages flushed to disk 1: default, only dirty pages flushed to disk 2: Ensure that all logs are written to log files. The next startup will restore logs

Parameter: Innodb_force_recovery, controls the database recovery operation by default 0, indicating that all recovery operations are performed when required. Other configuration values are not listed.

undo

Undo log ensures atomicity of transactions, helps with transaction rollback and MVCC functionality. Undo is a logical log, which records each row of data and records the reverse operation of each operation. The rollback actually does the reverse of the previous work, doing a DELETE for INSERT, an INSERT for DELETE, and the reverse update for update.

Undo the storage

InnoDB treats undo logs as data and saves undo logs through Page.

Rollback segment rollback segment is also a segment object stored on page (0,6) (page 6 of the shared tablespace) as follows

variable byte describe
TRX_RSEG_MAX_SIZE Don’t use
TRX_RSEG_HISTORY_SZIE 4 The number of UNDO pages in the HISTORY list
TRX_RSGE_HISTORY 16 The committed undo log linked list can be reclaimed by purge
TRX_RSEG_FSEG_HEADER 10 Roll back the SEGMENT HEADER
TRX_RSEG_UNDO_SLOTS 4 * 1024 The offset to the page where the UNDO SEGMENT HEADER resides

One UNDO segment can manage one transaction, and one rollback segment can manage 1024 UNDO segments. Prior to InnoDB1.1, there was only one rollback segment that supported a maximum concurrent transaction of 1026. Starting with InnoDB1.1, up to 128 rollback segments are supported. FIL_PAGE_TYPE_SYS at (0,5) records the page on which all rollback segments are located.

UNDO segment The UNDO segment is where the UNDO log is stored. It is actually a linked list of UNDO pages. The first UNDO page of the linked list consists of the following sections:

  • UNDO LOG PAGE HEADER
  • UNDO LOG SEGMENT HEADER
  • The UNDO log

The UNDO LOG PAGE HEADER contains the following content

variable byte describe
TRX_UNDO_PAGE_TYPE 2 The type of undo logs is TRX_UNDO_INSERT or TRX_UNDO_UPDATE
TRX_UNOD_PAGE_STARE 2 UNDO page location of the latest transaction UNDO log
TRX_UNDO_PAGE_FREE 2 Offset of the UNDO page free
TRX_UNDO_PAGE_NODE 12 UNDO page linked list node

For TRX_UNDO_PAGE_NODE, refer to the linked list structure of the storage section

The UNDO LOG SEGMENT HEADER content is as follows

variable byte describe
TRX_UNDO_STATE 2 Status of the UNDO segment
TRX_UNDO_LAST_LOG 2 The offset position of the last undo log header in the page
TRX_UNDO_FSEG_HEADER 10 UNDO segment header
TRX_UNDO_PAGE_LIST 16 The linked list header of the UNDO page

Valid values of TRX_UNDO_STATE are TRX_UNDO_ACTIVE,TRX_UNDO_CACHEd,TRX_UNDO_TO_FREE, and TRX_UNDO_TO_PURGE.

UNDO LOG SEGMENT HEADER is only saved on the first UNDO page in the UNDO page list, and the corresponding positions on other UNDO pages are left blank

Each undo record consists of two parts

  • UNDO LOG HEADER
  • UNDO LOG RECORD

There are two types of undo log record: Update undo log Record and INSERT undo log Record. Other DML operations generate update undo log records. According to TRX_UNDO_PAGE_TYPE, an UNDO segment can store only one type of UNDO, insert UNDO log or update UNDO log. If a transaction has both INSERT and UPDATE operations, separate UNDO segments are assigned for each type, which also reduces the maximum number of concurrent transactions supported by InnoDB.

The UNDO LOG HEADER content is as follows

variable byte describe
TRX_UNDO_TRX_ID 8 Id of the transaction that generated the undo log
TRX_UNDO_TRX_NO 8 The ordinal number that identifies the transaction commit order
TRX_UNDO_DEL_MARKS 2 Flag whether undo logs generated by delete mark are included in this group of undo logs
TRX_UNDO_LOG_START 2 Indicates the offset of the first undo log in the undo log group
TRX_UNDO_DICT_OPERATION 2 Whether the operation is DDL
TRX_UNDO_TABLE_ID 8 If DDL operation, operation table ID
TRX_UNDO_NEXT_LOG 2 Next UNDO LOG HEADER location
TRX_UNDO_PREV_LOG 2 The previous UNDO LOG HEADER location
TRX_UNDO_HISTORY_NODE 12 HISTORY linked list node

Since purge may remove some undo log records, TRX_UNDO_LOG_START does not necessarily equal the undo log HEADER end offset.

When a transaction is started, a unique strictly incrementing transaction ID and UNDO segment are assigned and their TRX_UNDO_STATE variable is set to TRX_UNDO_ACTIVE.

Note: InnoDB treats undo logs as data. Undo pages are managed together with regular data pages, flushed out of memory according to LRU rules, and later read from disk. Also, undo log operations need to be logged in the redo log. For an INSERT operation, the redo log logs not only the insert operation, but also the undo INSERT operation.

During recovery, InnoDB will redo all transactions, including uncommitted transactions and rolled back transactions. The uncommitted transactions are then rolled back using undo log.

Innodb_undo_directory: specifies the UNDO independent tablespace location innodb_undo_logs: Set the number of rollback segments to 128. In InnoDB 1.2, Innodb_rollback_segments InnoDB_UNDO_TABLESPACES: Number of undo tablespace files Innodb_undo_LOG_TRUNCate: Innodb_max_undo_log_size: if innodb_max_undo_log_size exceeds this threshold, it will be automatically shrunk

When a transaction commits, UNDO pages need to be processed:

  • If the current undo log occupies only one page and the size of the header page is less than three fourfold (TRX_UNDO_PAGE_REUSE_LIMIT), set the status to TRX_UNDO_CACHED, indicating that the undo page can be reused. The new undo log is recorded after the current undo log.
  • If it is Insert_undo (the undo type is TRX_UNDO_INSERT), the state is set to TRX_UNDO_TO_FREE and the undo log can be deleted
  • If the above is not true, the undo log may need to be Purge, set to TRX_UNDO_TO_PURGE, add the undo log to the TRX_RSGE_HISTORY of the rollback segment, and Purge the undo log.

Purge operations

Purge is used to eventually complete delete and update operations, which is designed because InnoDB supports MVCC, so records cannot be processed immediately upon a transaction commit, and other transactions may be referencing the row. (The delete operation sets the delete flag of the record to 1)

As mentioned earlier, rolling back the TRX_RSGE_HISTORY list links undo logs based on the order in which transactions are committed.

During the purge process, InnoDB finds the first record trx1 from the TRX_RSGE_HISTORY list that needs to be purged. After purge, InnoDB searches the undo log page of trx1 to see if there are any records that can be purged until there are no purge records on the undo log page. Go back to the history list to find the next record that needs to be cleaned. Because it can be reused, an undo log may hold undo logs for different transactions. Thus, the Purge operation is a slow process that involves discrete reads of disks.

MVCC principle

Hidden columns MVCC TransactionID: DB_TRX_ID specifies the TransactionID that operates on this data RollPointer: DB_ROLL_PTR specifies the undo log location pointer to the previous version of the data

When a transaction changes the row data, it puts the data before the modification into the undo log, changes the TransactionID to the current TransactionID, and RollPointer points to the location of the previous version data.

For example, change A field of row data from A -> B -> C, TransactionID, RollPointer as followsNote: This is organized into an undo log chain by RollPointer.

Snapshot At the RR level, after the first SELECT read operation after begin/ Start transaction, a snapshot (read View) is created to record other transaction records that are active in the system. At the RC level, a snapshot is created for each SELECT statement in a transaction.

Visibility Check Set the last committed transaction ID of the row to be read to trx_id_current, the earliest transaction ID in the snapshot read view to up_limit_id, and the latest transaction ID to low_limit_id

  1. Trx_id_current < up_limit_ID, when the current transaction reads the record of this row, the latest transaction ID of the record is smaller than all active transactions in the current system, so the current row data is visible.
  2. Trx_id_current > low_limit_id. After the current transaction is started, this line of records is modified and committed, and the data is not visible.
  3. Up_limit_id <= trx_ID_current <= low_limit_id, this line records that the latest transaction is active,

Check whether trx_ID_current is in the active transaction ID list of the snapshot. If not, the data is visible. If yes, it is not visible, you need to search the undo log chain to get the previous version and then judge the visibility.

group commit

With InnoDB, transaction commits do two things:

  1. Modify the information corresponding to the transaction in memory and write the log to the redo log buffer
  2. Call fsync to write the redo log buffer to disk

Group commit, where redo log buffers for multiple transactions are flushed to disk in a single fsync

With Binlog enabled, InnoDB uses two-phase transactions to ensure consistency between transactions and binlog in the storage engine. Note: Redo logs are generated by InnoDB, in physical format, and record changes made to each page as transactions progress. A binlog is a logical log that is written once a transaction is committed.

The two-phase transaction steps are as follows: In the Prepare phase, the SQL was executed successfully and memory logs of redo and undo were generated. InnoDB sets the rollback section to prepare; Binlog commit phase: Binlog memory log data is written to the file system cache and written to disk through fsync(). Commit phase: fsync() permanently writes binlog file system cache log data to disk.

The recovery operation crashes before the prepare phase and the transaction is rolled back directly. Crash when binlog is fsync() but InnoDB is not committed; On recovery, transaction information will be retrieved from binlog, the transaction will be redone and committed so that InnoDB and Binlog are always the same.

InnoDB needs to ensure that binlog is written in the same order as InnoDB transaction commits.InnoDB detects that the latest transaction T3 has been committed and does not need to be restored. As a result, transaction T1 data is lost.

Before InnoDB1.2, prepare_COMMIT_MUtex was used to ensure the order. Only when the last transaction is committed and the lock is released, can the next transaction prepare. However, after binary logs are enabled, the group commit function becomes invalid, resulting in poor performance.

InnoDB1.2 has been optimized. The prepare stage remains unchanged, and the binlog commit and commit stages are split into three stages. Each stage maintains a queue, with the first one entering the queue as the leader thread and the rest as the follower thread. The leader thread collects the follower transactions and performs the sync. The follower thread waits for the leader to notify it that the operation is complete.

  • Flush phase, writes the binlog for each transaction in the queue to memory
  • In Sync phase, the binlog in the memory queue is flushed to disk. If there are multiple transactions in the queue, only one fsync operation will complete the log writing.

In the Commit phase, the leader invokes the Commit of InnoDB transactions in the order of the queue, at which point the Group Commit function can be used. Since all three phases perform operations in queue order, ensure that binlog is written in the same order as InnoDB transaction commits.

When one set of things is in the Commit phase, other new things can Flush, using the Group Commit to keep taking effect.

InnoDB undo log Roaming InnoDB crash recovery

If you think this article is good, welcome to pay attention to my wechat public number, your attention is my motivation!