Before we start the discussion in earnest, let’s clarify two concepts:

Database, is a collection of files, is composed of a data file; Database instance, is a program, the user program through the database program (that is, database instance) with database files.

Look at the MySQL architecture diagram:

2.1. InnoDB architecture

InnoDB uses background threads for the actual data manipulation. At the same time, InnoDB has multiple memory blocks that are used with background threads to form a large memory pool for the following tasks:

  • Maintain data structures required by all processes/threads.
  • Cache the disk page and update it here before updating the disk.
  • Caching of redo logs

The main purpose of the background thread is to read the pages from disk into the buffer, while ensuring that the pages in the buffer are the most recently used pages (such as using the LRU algorithm). Flush updated pages from the buffer back to disk, or use redo logs to handle data recovery during downtime.

Now let’s go through them one by one.

2.1.1. Background threads

InnoDB uses multiple threads for actual data processing. These threads include:

  1. The Master Thread. The core thread is responsible for synchronizing buffers to disk. Asynchronously flush buffer data to disk to ensure data consistency, including the flush of dirty pages (pages updated in the buffer, which are now different from the corresponding pages on the disk), merge the insert buffer (see insert buffer below), undo page recycling.

  2. IO Thread. InnoDB uses a lot of AIO operations to handle write I/OS, and IO Threads are used to handle callbacks to write I/OS.

  3. Purge Thread Purge Thread. The undo log used by a transaction may no longer be used after it is committed, so Purge Thread is required to reclaim the undo pages that have been allocated and used.

  4. Page Cleaner Thread. Used to place the refresh of dirty pages in a separate thread.

2.1.2. Memory

Having said four background threads, let’s move on to memory pools.

  1. The buffer pool.

It is used to read the pages in the disk into the buffer to achieve the next quick query of the same area. However, data is modified by modifying the buffer first (in fact, the redo log is written first to prevent data loss during downtime) and then periodically flushed to disk using a technique called checkpoint.

Let’s look at the composition of the buffer:

Among them, data pages and index pages account for the majority.

Starting with version 1.0.x, multiple buffer pool instances are allowed, with each page evenly allocated to different buffer pool instances based on the hash value. This reduces resource contention within the database and increases the concurrency of the database.

  1. LRU List, Free List, Flush List.

InnoDB manages buffer pools using the LRU algorithm, and InnoDB’s default page size is 16KB. If you want to add pages, but there are not enough pages, you can remove one from the bottom of the list and add it to the head of the list. Then if every time a page is visited, it is moved to the head of the queue, which is also the most common LRU, but InnoDB has optimized it for this:

First introduce a MIDpoint, then old_block_time. Midpoint points out where new pages should be placed in the entire LRU, rather than directly at the head of the queue, to avoid flushing out hot data during frequent query operations. However, this does not guarantee that hot data will not spawn in a large number of operations (a large number will spawn regardless of where the new data is placed), so old_block_time was introduced to indicate how long a page added to midPoint will have to wait before it is added to the hot data area.

At first, the entire LRU is empty, so a Free List is used for maintenance. Every time a page is added, it is loaded from the Free List first. If there are Free pages, it is removed from the Free List and used. Otherwise, use LRU to flush out an old page.

InnoDB started to support page compression in 1.0.x. The original 16KB page can be divided into 1KB, 2KB, 4KB, and 8KB. The compressed page is managed using unzip_LRU instead of LRU. Pages in LRU contain pages in unzip_LRU because compressed pages are compressed using pages in LRU.

For unzip_LRU: If we apply for a 4KB page, the process is as follows:

A. First check if the 4KB page is available, if so, use it, otherwise.

B. Otherwise check 8KB pages, if 8KB pages are available, split them into two 4KB pages and add them to unzip_LRU.

C. If no 8KB is available, request a 16KB page from LRU and divide it into 8KB + 2 * 4KB.

This algorithm is called the buddy algorithm.

When data pages in the buffer are modified and need to be written back to disk, these pages are called “dirty pages”. InnoDB periodically flushes dirty pages using CHECKPOINT technology. To record them, InnoDB records dirty pages using a List called Flush List, so that all pages in the Flush List are flushed to disk. Note that a page can exist in both LRU and Flush List. This makes sense because there can be “dirty pages” in the LRU.

In general, these algorithms are used for large data area management such as data pages and index pages.

  1. Redo log buffers

Redo logging records changes made by a transaction in case the changes disappear due to unexpected events such as a sudden outage, and is used to ensure the persistence of the transaction.

InnoDB typically buffers redo logs into this buffer and then flushes them to disk at a regular rate, just like dirty page flushes. There are three refresh trigger mechanisms for redo logs:

A. MasterThread flushs the redo log buffer to disk every second.

B. The transaction transaction triggers a write to the disk (after the transaction is committed, there is no need to log the changes of the transaction).

C. When the remaining space of the redo log buffer is less than half,

  1. Additional memory pools

For the memory needed by some data structures, or the control object corresponding to the frame buffer in the buffer, it needs to apply space in the extra memory pool, so when the buffer is large, it can consider expanding this space synchronously.

2.2. Checkpoint technology

In InnoDB, the Write Ahead Log technique is used, where the changes made by transactions are first made in the redo Log, then the redo Log buffer is written to disk, then the page is modified. This is done to ensure the persistence properties of ACID.

Even with this capability, you can’t always rely on redo logs to guarantee persistence because recovery from persistence takes time. And the data page dirty page refresh also needs time, can not accumulate too much.

So we introduced Checkpoint technology to ensure that dirty pages and redo logs are written back to the database or disk at regular intervals.

In InnoDB, redo logs are not infinite, but a circular file that is reused, so exhaustion can occur, most of which can be alleviated by refreshing dirty pages and releasing their corresponding redo logs. So checkpoint is mostly good for dirty page processing.

In InnoDB, there are two kinds of Checkpoint: Sharp Checkpoint/Fuzzy Checkpoint.

Sharp Checkpoint writes all dirty pages back to disk when the database is shut down.

If Sharp is used at runtime, database performance is affected, so Fuzzy is used.

There are several Fuzzy Checkpoint types available today:

  • Spaced firing of MasterThread.
  • FlushLRUList. Because the LRU needs to ensure that there are enough pages available, the pages at the end of the queue are removed, and if they contain dirty pages, they are triggered (the operation of checking dirty pages was put into the Page Cleaner thread separately in later versions).
  • Async/Sync Flush Checkpoint Triggers a mechanism to force dirty pages to be flushed when redo logs are unavailable to release log files.
  • Dirty Page Too Much Checkpoint Is triggered when there are Too many Dirty pages to ensure that the available buffer space is not Too small.

2.3. MasterThread

MatserThread implementation in older versions

MasterThread has the highest priority and has four internal loops: the main loop, background loop, refresh loop, and suspend loop.

The main loop has two operations: once per second and once every ten seconds.

Once per second includes:

  • Always write the log buffer back to disk.
  • Insert buffers may be merged (if the current system I/O pressure is low).
  • Up to 100 dirty pages may be flushed to disk.
  • Switch to background loop when there is no user activity.

Flush the log to the log file even if a transaction is not committed, which explains why commits are fast even if a transaction is large.

Every 10 seconds includes:

  • May flush 100 dirty pages (if system I/O pressure is low).
  • Always merge up to 5 insert buffers.
  • Always write back to the log buffer.
  • Always delete useless undo pages.
  • Always refresh 100/10 dirty pages.

The background loop switches to this point when there is currently no user activity or the database is down. It does the following:

  • Always delete useless undo pages.
  • Always merge 20 insert buffers.
  • Skip to refresh loop if idle, or main loop if not.

The refresh cycle flushes 100 dirty pages until there are no more pages to brush, at which point it jumps to the pending cycle.

A suspended loop is suspended until an event occurs to jump to the main loop.

An improved version of MasterThread

As you can see, older masterThreads are limited to IO because they refresh dirty pages based on the number of I/OS in the previous phase, and the number of dirty pages per flush is fixed, which is much slower on SSDS because SSD writes and reads are much faster than HDDS.

Parameters can be adjusted, such as the innodb_io_capacity value, and innodb_max_dirty_pages_pct indicates that the percentage of dirty pages in the buffer is triggered when this value is reached and can be adjusted to a smaller value.

Another option is to use an adaptive refresh technique: innodb_adaptive_flush, which is dynamically adjusted based on the rate at which redo logs are generated.

MasterThread in 1.2.X

In this version, the dirty page refresh is separated into separate PageCleanerThread.

2.4. Key features of InnoDB

The key features that make InnoDB different are:

  • Insert Buffer
  • Write two
  • Adaptive hash index
  • Asynchronous I/o
  • Refresh the adjacency page

2.4.1. InsertBuffer InsertBuffer

2.4.1.1. Describe

The InsertBuffer is used to solve the insertion problem of non-clustered indexes (secondary indexes). A clustered index (primary key index) can be inserted in order to achieve O(1) complexity by pre-storing the last primary key location. A secondary index cannot be inserted by recording the last position of the secondary index (the primary key is inserted by +1, but the secondary index is not. The value of the newly added secondary index has no relation to the last value, except for special types such as time). Therefore, the insertion operation of secondary index usually requires an IO random read, which usually requires several IO.

To solve this problem, InnoDB uses InsertBuffer, which not only exists in memory, but also in disk, where it is used for persistent storage and in the buffer for fast operations. After the insertion buffer is introduced, each insertion operation of the secondary index will first determine whether the secondary index page to be inserted is in the buffer (the secondary index page is a data structure containing several secondary indexes, which is the leaf node of the B+ tree (the tree is on disk) of the secondary index tree. Every time the index is inserted into the secondary index page that the index should be in, B+ tree algorithm is used to calculate the secondary index should be placed in which page); If yes, that’s fine, just insert and wait for the flush to disk; If not, it needs to be inserted into the InsertBuffer data structure, and then through a certain frequency and circumstances, the secondary index in the InsertBuffer to their corresponding secondary index page. The multiple inserts now become a flush disk operation.

To use InsertBuffer, you need to ensure two things:

  • The index must be a secondary index.
  • The index is not set to unique.

As a quick note on the second point, if the index is set to unique, this causes each insert to check if the index is unique, which in turn triggers a random read IO to traverse the index’s B+ tree (which is on disk), which is exactly what InsertBuffer wants to avoid.

2.4.1.2. ChangeBuffer

InnoDB later provides support for DML operations by providing three buffers: InsertBuffer, DeleteBuffer, and PurgeBuffer.

2.4.1.3. Internal implementation of InsertBuffer

InnoDB uses a global B+ tree to maintain the secondary index of all tables, which is used for InsertBuffer operations. InnoDB’s non-leaf node structure is as follows:

space(4) marker(1) offset(4)

Where space indicates which table it belongs to, offset indicates the offset of the page, which is used for traversal search. The tree key is determined by (space, offset), that is, the value of the search key is the value of offset in the same table.

Let’s look at the leaves:

space(4) marker(1) offset(4) metadata(4) real-index-page(x)

Metadata stores information such as the order in which each index is entered into the InsertBuffer, followed by the actual data (structured like the secondary index page, into which records are inserted). In order to ensure that each insert must succeed, bitmap is introduced to record the available space of each secondary index page. If the space is insufficient, merge operation will be triggered.

InsertBuffer 2.4.1.4. The merger

If the secondary index page of the record that needs to be inserted is not in the buffer and needs to be inserted into the B+ tree InsertBuffer, when do you merge? That is, when do I insert the data in the InsertBuffer into the secondary index B+ tree on disk? In general, there are three triggers:

  • The secondary index page is read into the buffer because of the SELECT statement.
  • Bitmap found that a page in InsertBuffer was out of free space.
  • Regular operation of MasterThread.

When a select is made, a secondary index page on disk is loaded into the buffer. If the secondary index has records in InsertBuffer, the records in the InserBuffer are added to the secondary index and wait for it to be written back. As you can see, the number of insertions to the page is reduced to one thanks to the InsertBuffer, so performance is greatly improved (for B+ tree maintenance, that’s another story).

If bitmap detects that an InsertBuffer page is about to fill up, it forces the corresponding secondary index page to be read from disk, and then inserts the InsertBuffer records that should be inserted into the page and the records to be inserted into the secondary index page, waiting for it to be written back.

MasterThread randomly selects pages for merge inserts to be fair.

2.4.2. Write twice

Write twice is essentially to ensure data consistency in the event of an outage. Redo logs record physical operations on the page, and there is no point in redo if the page itself has already been corrupted.

Doublewrite consists of two parts, a 2MB buffer in the buffer and a 2MB space on disk (allocated consecutively). Each time a dirty page is written back, the dirty pages are copied to doubleWrite’s buffer, then written to disk 1MB at a time, and the dirty pages are flushed.

2.4.3. Adaptive hash index

For some hot index queries, InnoDB will automatically create hash indexes for quick indexing. Instead of traversing the entire B+ tree, the hash index is the value of the index column or the query key, and the hash index is the value of the leaf node of the B+ tree, which is usually an index page structure.

2.4.4. Asynchronous I/o

To improve disk performance, asynchronous I/O operations are commonly used, which are generally provided by the OS.

2.4.5. Refresh adjacency pages

When InnoDB refreshes a page, it checks for other pages in the partition that the page is in. If there are dirty pages, it refreshes them in one block, so that sequential IO can be used to combine multiple IO operations into one.