【MySQL】InnoDB - Buffer Pool key data structure

Buffer Pool is one of InnoDB’s most important optimizations. It improves InnoDB’s overall performance by buffering read and write hotspot data.

All information in this article is based on the author’s practice verification, if there is any mistake, please correct, thank you very much.

Innodb-buffer Pool innodb-buffer Pool innodb-buffer Pool innodb-buffer Pool

InnoDB series articles:

【MySQL】InnoDB – Overall architecture: Memory structure and disk structure – Juejin
Innodb-buffer Pool innodb-buffer Pool
Innodb-buffer Pool innodb-buffer Pool innodb-buffer Pool

For disk-based storage database system, the most important purpose is to access data efficiently. However, due to the unbridgeable gap between CPU and disk speed, buffer pool technology must be used to speed up data access to compensate for the speed difference between the two. Therefore, the Buffer Pool is the most important part of InnoDB.

The introduction of this middle tier also makes InnoDB’s database memory management more complex. Buffer pool mainly includes the following features: LRU List, Free List, Fulsh List, Fulsh policy, Double Write Buffer, preread prewrite, preheat, dynamic expansion, compressed page memory management, concurrency control, multithreading, etc.

Buffer Pool Instance

Starting with InnoDB version 1.0.x, Buffer pools can be divided into multiple Buffer Pool instances, and each page is evenly allocated to different Buffer Pool instances based on the hash value. Each instance resource is independent, with its own locks, semaphores, physical blocks, logical linked lists, page hashes, and so on, which improves overall engine performance by reducing resource contention within the buffer pool. Relevant parameters are as follows:

innodb_buffer_pool_instances: denoted as n, the number of buffer pool instances.
innodb_buffer_pool_size: denoted by m, the total size of all buffer pools.

The size of a single buffer pool instance is N /mn/mn/m, and if M is less than 1G, n is reset to 1 to prevent too many small instances from degrading performance. Physical blocks of all instances are allocated when the database is started, and this memory is not released until the database is shut down.

buf_pool_t

Each buffer pool instance has a corresponding buF_POOL_T data structure called the buffer pool control body. The buffer pool control body is used to store the control information of the buffer pool instance, such as the lock of the buffer pool instance, the instance number, the page hash table, and so on, as well as the linked list root node of various logical linked lists. The Zip Free two-dimensional array is also included. Part of the code is:

struct buf_pool_t {. ulint instance_no;// Buffer pool instance number
    ulint                           curr_pool_size;    // Buffer pool instance size
    buf_chunk_t                     *chunks;           // The physical block list of the buffer pool instance
    hash_table_t                    *page_hash;        // page hash table
    UT_LIST_BASE_NODE_T(buf_page_t) free;              // Free list
    UT_LIST_BASE_NODE_T(buf_page_t) LRU;               / / the LRU list
    UT_LIST_BASE_NODE_T(buf_page_t) flush_list;        / / Flush linked list
    BufListMutex                    free_list_mutex;   // Free list mutex
    BufListMutex                    LRU_list_mutex;    // LRU list mutex
    BufListMutex                    flush_state_mutex; // Flush list mutex. }Copy the code

Page

InnoDB data Page is the smallest data management unit in InnoDB, the default is 16KB, InnoDB 1.2.x from the start can change the Page size to 4K, 8K, 16K, after the engine started for the first time cannot change the Page size.

If the table is compressed, the corresponding data page is called the compressed page. The size of the compressed page is specified when the table is created. The compressed page can be 1K, 2K, 4K, 8K, or 16K. To read data from the compressed page, decompress the compressed page and then read it. The size of the compressed page is the same as that of the default database page. If the compressed page size is specified as 4K but the data page cannot be compressed below 4K, a page split operation is performed on the data page.

Under normal conditions, the buffer pool caches both compressed and decompressed pages, and when the free list is insufficient, the elimination strategy is determined based on the real-time load on the system: decompressed pages are eliminated if the system bottleneck is on IO, and both are eliminated otherwise.

Dirty pages are pages in the buffer pool whose data has been modified but has not yet been released. A dirty page, whether a regular data page or a compressed page, is a dirty page, and the dirty page is linked to the Flush list. Some dirty pages are updated to the disk every once in a while or the system is idle. When dirty pages are removed from the buffer pool, they must be dropped from the disk.

Each data page has its corresponding data page control body, which is used to store all data related to the data page and Pointers to the data page. The data page control body is composed of two data structures, namely buF_PAGe_T and BUF_block_T.

buf_block_t

struct buf_block_t {
    buf_page_t page;   // A pointer to the other control block, buf_page_t, must be the first data member
    byte       *frame; // Data page pointer to the actual data page
    BPageMutex mutex;  / / page locks. }Copy the code

Buf_block_t is one of the control bodies of a data page and describes information about a small number of data pages. The first data member is another data page control block pointer and must be the first data member to be converted to another data page control block at any time. The second data member frame is a pointer to the owning data page.

buf_page_t

struct buf_page_t {.page_id_t      id;                  // page id
    page_size_t    size;                / / page size
    ib_uint32_t    buf_fix_count;       // for concurrency control
    buf_io_fix     io_fix;              // for concurrency control
    buf_page_state state;               / / page
    lsn_t          newest_modification; // Latest LSN, that is, the latest MODIFIED LSN
    lsn_t          oldest_modification; // The oldest LSN is the first modified LSN. }Copy the code

Each data page in the buffer pool has a block corresponding to the Buf_PAGe_t data structure, called the data page control body. The control body stores most data page information, including page ID, page size, page status, latest LSN, oldest LSN and all information of compressed pages. Compressed page information includes compressed page size, compressed page pointer.

Buffer Chunk

A Buffer Chunk is a physical Chunk in each Buffer pool instance and is the smallest physical unit of storage in the Buffer pool. The default size of a physical block is 128MB. Therefore, the minimum size of the buffer pool is 128MB, and the minimum size of the physical block is 1MB. In MySql 8.0, the physical block size can be dynamically adjusted. Physical blocks are allocated during engine startup and are not fully released until the database is shut down.

Pages are InnoDB’s smallest data management unit of memory, but contiguous memory storage units are physical blocks.

Each data block contains two areas, one is a control block array with buF_BLOCK_T as the element, and the other is a data page group with data pages as the element. The control block array occupies the first part of Chunk, and the data page group occupies the second part of Chunk. Each data page must have a corresponding control block, but the control block does not necessarily have a corresponding data page. Data blocks contain almost all types of data pages except BUF_BLOCK_ZIP_PAGE and BUF_BLOCK_ZIP_DIRTY data pages. Data pages do not always store user data; control information, row locking, and adaptive hashing are also present in data pages.

Logic chain table

The node of logical linked list is buf_PAGe_T control body. The introduction of various types of logical linked list makes the management of data page more convenient.

Free List

Free List A Free List is a linked List of all unused data pages that are initialized when a block is allocated memory. All data pages are Free. If new data pages need to be introduced into the buffer pool, they can be fetched directly from the free linked list. InnoDB will ensure that there are enough Free List nodes to be used, and when there are not enough Free nodes, a certain number of nodes will be removed from the LRU List and Flush List to replenish the inventory.

LRU List

The LRU List is the most important data structure in the buffer pool, on which almost all the data pages read are buffered. As the name implies, the LRU linked list eliminates nodes according to the Least Recently Used algorithm, but the optimized LRU algorithm is Used here.

Flush List

All dirty pages in the buffer pool are mounted to Flush List, waiting for data to fall. Modified pages in the LRU List are also placed in the Flush List, and modified compressed pages are also placed in the Flush List. Data is likely to be modified many times before the changes are flushed to disk. The most recent modified LSN (newset_MODIFICATION) and the oldest modified LSN (Newest_modification) are recorded in the data page control body. The entries to the Flush list are sorted in the order they were entered, with the most recently added pages placed at the head of the list. The data page locks the Flush List as it enters the Flush List to ensure that the nodes are entered in the correct order. Write from the end of the list when brushing data.

Unzip LRU List

Similar to LRU linked lists, but used specifically to store decompressed pages extracted from compressed pages.

Zip Clean List

This parameter is available only in Debug mode and is used to store compressed pages. Compressed pages are stored in LRU lists in normal mode.

Zip Free

Is a two-dimensional array of five linked lists, 1K, 2K, 4K, 8K, and 16K fragmented linked lists, designed to store compressed pages read in from disk. The engine manages this structure using the Buddy Buddy system.

Mutex

Buf_pool_t maintains several mutex locks for logical lists to protect concurrent access to each list:

| Name | Type goal | | | : -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | : -- -- -- -- -- - : | : -- -- -- -- -- -- -- -- -- -- -- - | | lru_list_mutex | mutex | LRU List | | Free_list_mutex | mutex | Free List | | flush_list_mutex | mutex | Flush List | | hash_lock | read-write lock | Page hash | | buffer Block mutex | mutex | buf_block_t | | buf_fix_count | semaphore | buf_page_t | | io_fix | | state quantity buf_page_t | | Rw_lock (BPageMutex) | read-write lock | Page |Copy the code

List Mutex

These include LRU_LIST_mutex, free_LIST_mutex, and flush_LIST_mutex. All data pages are on the free list, LRU list, and Flush list, so locks on these lists must be acquired before I/O can be performed.

hash_lock

Prior to MySql 5.6, a Page Hash level lock was used for Page Hash table operations. This was then optimized to a slot-level hash_lock, that is, as many slots as there are in the page hash table, to minimize lock conflicts.

Once the hash_lock is acquired and accessed to the data page, the hash_lock is released directly.

page block mutex

Buffer block mutex is a lock on buf_block_T, Used to protect variables such as IO_fix, state, and buf_fix_count on buf_PAGe_t, this mutex was introduced to reduce the overhead of using buffer pool level locking directly in earlier versions.

Buf_fix_count and io_fix

Io_fix indicates the status of ongoing PAGE frame I/O operations, including BUF_IO_READ, BUF_IO_WRITE, BUF_IO_PIN, and BUF_IO_NONE.

Buf_fix_count indicates how many times the current control block is referenced. Each time a page is accessed, buf_fix_count++ will be applied. Finally, during the final resource release phase of MTR: COMMIT (), buf_fix_count–, Release resources.

rw_lock

Read/write locks on data page entities, not on data page control bodies. S lock is applied to the data page when it is accessed, sx lock is applied to the data page when it is ready to write, and X lock is applied to confirm changes. InnoDB uses page block mutex, IO_fix, and buf_fix_count to protect rw_lock. When determining whether a page can be flushed, io_fix status is first determined to reduce the need for direct access to rw_lock.

The usual locking process for accessing a data page is:

Get hash_lock, get page block lock and release hash_lock.
Determine and modify io_fix and buf_fix_count, then release page block lock.
Obtain rw_lock.

Page Hash and Zip Hash

Pages read into the buffer pool are strung together by LRU lists, but it would be unthinkable to iterate through the LRU list every time a page is queried. InnoDB maintains a page hash table for each buffer pool instance, with space_id and page_id to locate and read data, using hash tables to query and locate data in O(1)O(1)O(1) time complexity.

Data pages from the LRU List will be added to the Page Hash, and data pages from the Unzip LRU List List will be added to the Zip Hash.

Double Write Buffer

The Double Write Buffer is mainly used to solve the problem of half-write data pages. The page size of the Disk management mechanism in Linux is 4K, which is inconsistent with the default 16K data page size of the engine. When writing a data page, a power failure may occur when the data is written less than 16K. If the file management system could guarantee atomic writes there would be no half-write problems, or if the engine default page size was changed to 4K, but the default setting for 16K data pages would be a best practice.

Double write buffers are configured in both the disk and memory. The size of each buffer is 2 MB, that is, 128 data pages. The buffer does not occupy the space of data blocks. It is divided into two parts, one for batch write, that is, batch brush dirty; The other part is for single Page write, i.e. a single page scrub. The default batch write size is Innodb_doubleWrite_batCH_size = 120.

When a batch dirty brush operation is performed, it is first written to the dual-write buffer of the memory. When the dual-write buffer of the memory is full, synchronous I/O is used to flush data to the dual-write buffer of the disk at a time. Synchronous I/O is used to ensure safe write. Then, asynchronous IO is used to write each data page back to its own tablespace, and buF_DBLWR_ADD_TO_batch is returned to indicate that the write is successful. However, subsequent write requests to the dual-write buffer of the disk are still blocked. Only after confirming that the asynchronous operation is successful, the contents of the dual-write buffer of the system tablespace are cleared, and the subsequent requests can be continued. The purpose of this method is: if the system power failure occurs during the asynchronous write back to the data page, the data page in the double-write buffer of the system tablespace is intact, it can be copied again. After the asynchronous I/O request completes, the integrity of the data page is checked and the change buffer operation is completed. Then the IO helper thread calls buf_flush_write_complete to remove the data page from the Flush list.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

【MySQL】InnoDB – Buffer Pool key data structure

Buffer Pool Instance

buf_pool_t

Page

buf_block_t

buf_page_t

Buffer Chunk

Logic chain table

Free List

LRU List

Flush List

Unzip LRU List

Zip Clean List

Zip Free

Mutex

List Mutex

hash_lock

page block mutex

Buf_fix_count and io_fix

rw_lock

Page Hash and Zip Hash

Double Write Buffer

【MySQL】InnoDB – Buffer Pool key data structure

Buffer Pool Instance

buf_pool_t

Page

buf_block_t

buf_page_t

Buffer Chunk

Logic chain table

Free List

LRU List

Flush List

Unzip LRU List

Zip Clean List

Zip Free

Mutex

List Mutex

hash_lock

page block mutex

Buf_fix_count and io_fix

rw_lock

Page Hash and Zip Hash

Double Write Buffer

Related Posts

The server was maliciously attacked because Redis did not set password

Great looks! The official Redis visualization tool is here, and it’s really powerful!

Talking about the idea of SQL optimization