Buffer Pool is one of InnoDB’s most important optimizations. It improves InnoDB’s overall performance by buffering read and write hotspot data.
- All information in this article is based on the author’s practice verification, if there is any mistake, please correct, thank you very much.
- Innodb-buffer Pool innodb-buffer Pool innodb-buffer Pool innodb-buffer Pool
InnoDB series articles:
- 【MySQL】InnoDB – Overall architecture: Memory structure and disk structure – Juejin
- Innodb-buffer Pool innodb-buffer Pool
- Innodb-buffer Pool innodb-buffer Pool innodb-buffer Pool
For disk-based storage database system, the most important purpose is to access data efficiently. However, due to the unbridgeable gap between CPU and disk speed, buffer pool technology must be used to speed up data access to compensate for the speed difference between the two. Therefore, the Buffer Pool is the most important part of InnoDB.
The introduction of this middle tier also makes InnoDB’s database memory management more complex. Buffer pool mainly includes the following features: LRU List, Free List, Fulsh List, Fulsh policy, Double Write Buffer, preread prewrite, preheat, dynamic expansion, compressed page memory management, concurrency control, multithreading, etc.
PAGE_STATE
Before you can understand data page management, you must understand the eight states defined for data pages in the data page control body, which are important for managing and changing data pages in the buffer pool.
Buf_page_t, the data page control body, is logically equivalent to the data page, and can even be understood as a pointer to the data page. So when we say “the data page is in the list” we mean “the control block of the data page is in the logical list”. When we say “the data page” we mean either the data page itself or the data page control block, although in reality both refer to the control block. Note, however, that not every control block has a data page. Some control blocks have a frame pointer that points to empty, so I will not refer to this type of control block as a data page.
The state field in buf_PAGe_t defines eight states of the data page, which are:
-
BUF_BLOCK_NOT_USED: indicates that the data page is free. The status data page exists only in the free list, and the free list has only pages of this type.
-
BUF_BLOCK_FILE_PAGE: indicates that the data page is in normal use. The compressed page itself changes to this state after decompression. The status data page exists only in the LRU list.
-
BUF_BLOCK_MEMORY: Identifies the data page used to store system information, such as InnoDB row locks, adaptive hash indexes, or compressed page data, etc. The status data page does not exist in any logical linked list.
-
BUF_BLOCK_READY_FOR_USE: Indicates that the data page has just been removed from the free list and is in a very temporary state. The status data page does not exist in any logical linked list.
-
BUF_BLOCK_REMOVE_HASH: Indicates that the data page is about to be added to the free list. At this point, the data page has been removed from the page hash table but has not been added to the free list. It is a transient state. The status data page does not exist in any logical linked list.
-
BUF_BLOCK_POOL_WATCH: indicates that this control block is an idle data page control block in the buf_POOL_T ::watch array. The buf_POOL_T :: Watch array is an array of control blocks specifically provided to purge threads to act as sentinels, with each buffer pool having the same size as the number of Purge threads to ensure concurrent work. The Purge thread uses this control block to determine whether the data page is being read by another thread.
-
BUF_BLOCK_ZIP_PAGE: Identifies the data page as an undecompressed compressed page or as a sentinel in the Watch array. Including three cases: one is just read from the disk has not decompressed the compressed page; Two is decompressed but decompressed pages are expelled, and not dirty compressed pages; The third is that the BUF_BLOCK_POOL_WATCH control block used by the Purge thread transitions to this state. The first two cases are in the LRU list, and the last case is where the data page control block does not point to any data pages and the frame pointer points to null.
-
BUF_BLOCK_ZIP_DIRTY: Indicates that the data page is a dirty compressed page that is expelled from the decompression page. It is a temporary state. If the page is decompressed again, it will be converted back to a BUF_BLOCK_FILE_PAGE data page. The status data page exists only in the Flush list.
The Buffer Pool was initialized. Procedure
Buffer pool memory initialization, mainly physical block memory initialization, multiple buffer pool instances are initialized in turn. Os_mem_alloc_large is used to allocate memory for chunk, and buf_CHUNk_init is used to initialize chunk. Then initialize the data Page control block, Page Hash, and Zip Hash of type BUF_BLOCK_POOL_WATCH that are not part of the Chunk.
Allocate memory
There are two ways to allocate memory from the operating system, one is HugeTLB, and the other is traditional MMap.
HugeTLB
HugeTLB is a large memory block allocation management technology. HugeTLB increases the operating system page size to 2M or more. The program passes virtual memory addresses to the CPU, which must be mapped to real physical memory addresses through fast tables. The full set of fast tables is stored in memory, and some hot memory pages can be stored in CPU cache to improve memory access efficiency. However, larger memory pages will inevitably lead to more in-page fragmentation, and virtual memory will also slow down if swap partition is used. This mode is used to allocate memory only when you specify the super-large-pages parameter when MySql is started.
MMap
MMap can be used to allocate shared memory for multiple processes, and the allocated memory is virtual memory, which is allocated only when it is actually used. Malloc also calls MMap to allocate memory when the allocation exceeds MMAP_THRESHOLD=128K.
Chunk init
Call buf_chunk_init to allocate memory for Chunk:
-
First, the whole Chunk is initialized as a continuous group of 16K pages, and the length of the array is assigned to the integer variable size, which is later converted to the length of the data page group.
-
Set a frame pointer to the Chunk head, which will be used later as a pointer to the first data page.
-
Through the loop, move the frame pointer back and calculate whether the number of control blocks in the space before the frame is enough for all the data pages behind the frame. If not, move the frame back one page. If enough, break out of the loop. Frame is the first data page pointer.
-
Using the frame pointer, initialize all control blocks, point the frame pointer of all control blocks to the corresponding data page, and drop all control blocks into the free list.
The third frame does not use a code block because there are too many frames.
LRU List
In LRU linked list, data pages are loaded and eliminated according to the optimized LRU algorithm. The fundamental purpose of selecting and optimizing LRU algorithm is to protect hot data from being washed out by cold data.
LRU
The purpose of the buffer pool is to cache hot data, so the data pages retained in the buffer pool should be the ones that are accessed more often. Another way to look at the buffer pool is to be able to weed out the least recently used data pages. Therefore, the Least Recently Used (LRU) algorithm is naturally the best choice for buffer pool algorithm. So the core data structure of the buffer pool is the LRU linked list, which is responsible for storing hot data and weeding out cold data.
The algorithm of LRU linked list in engine is optimized LRU algorithm, the purpose of optimization is to eliminate cold data more efficiently and protect hot data better.
The proofs
The engine prereads a page that is about to be added to the LRU list and loads several pages of data around the page. This is based on a well-known principle of locality in computer optimization:
-
Time locality principle: recently used pages are likely to be used again in the near future, so the LRU algorithm is chosen as the elimination algorithm.
-
Principle of spatial locality: pages that need to be used recently are likely to be adjacent to the current page in logical space, so adjacent data pages are loaded in prefetch mode.
Prefetch is divided into two types, random prefetch and linear prefetch:
-
Random prefetch: When a data page is loaded into the Buffer Pool, InnoDB will read all data pages of the page’s Extend into the LRU list in order of page number if InnoDB determines that the data page is a hot data page (the first quarter of the New SubList data page). The random prefetch mode loads data in asynchronous I/O mode. The OS_AIO_SIMULATED_WAKE_LATER and OS_AIO_SIMULATED_WAKE_HANDLER_THREADS commands are used to facilitate I/O merging.
-
Linear prefetch: Linear prefetch is triggered when a data page is the boundary of its partition and at least 56 consecutive innodb_READ_ahead_threshold data pages are accessed in this partition. The engine determines whether the preread partition is the first or the last partition by determining the rise and fall of the page number during sequential access. Linear prefetch trigger conditions are harsh, mainly to solve data read performance problems in full table scan.
Proofread the failure
If the pre-read data pages are not accessed, then these invalid data pages will occupy part of the linked list space and even the hot data pages will be removed from the linked list, while these new invalid pages will be removed from the linked list for a period of time. This problem is called pre-read failure.
The goal of resolving the prefetch failure is to eliminate the failed data as soon as possible and eliminate the hot data pages as little as possible. When the length of LRU list is larger than BUF_LRU_OLD_MIN_LEN=512, the engine will use the substitution mechanism to divide LRU list into New SubList and Old SubList. By default, about 5/8 of the new subchain is located at the head of the list to store hot data pages, and about 3/8 of the old subchain is located at the tail of the list to store temporarily loaded data pages. The connected split point is called midPoint.
The newly loaded page is inserted at midPoint, the head of the Old SubList. Any page in the Old SubList that is accessed again is moved to the New SubList header. Pages that fail to read will be quickly eliminated from Old SubList. At the same time, data pages in New SubList will not be called back to the head frequently when they are accessed, but will be called back to the head only 1/4 of the way into the New subchain.
Buffer pool contamination
Preread data pages that are not revisited or not visited are simply eliminated from Old SubList, but in the case of full table scans, sometimes a large number of data pages are accessed several times in a short period of time, but are scanned and never accessed again. A large number of cold data washed into the New SubList and washed away hot data, which is called buffer pool pollution.
As with prefetch failures, the barrier to entry for new subchains must be raised if hot data is to be protected. Therefore, the engine uses the Time Window mechanism to raise the entry threshold for New SubList: it can only enter the New SubList if it has survived in the Old SubList for more than a set Time (default: 1s) and is accessed again later.
Data page access
Any query or modification to a data page loads the data page into the buffer pool, and the engine retrieves the data page through the following steps:
- Buffer pool instance: The buffer pool instance number is calculated based on the data page characteristics and the corresponding buffer pool instance is displayed.
- Find hash table: Query the page hash table to determine whether the page is in the buffer pool. If so, determine whether the page needs to be moved to the head of the LRU list.
- Disk load page: The data page is loaded from disk if it does not exist.
- Get free pages: Get free pages from the free linked list. If there are no free pages, perform a forced flush to remove the empty pages.
- Mount to LRU: Point the free page control body to the free page memory and attach the control body to the head of the old LRU subchain. If the LENGTH of the LRU list is insufficient, the old subchain tail data page is eliminated.
- Persisting dirty pages: If the pages to be eliminated are dirty, the dirty pages are flushed.
- Write data page: Reads data from disk and writes it to a prepared blank page.
Data page initialization
When a data page is loaded from disk into the buffer pool, initialization is performed on the written data page. Including free page acquisition operations, adding free pages to the LRU list, etc. :
- Get free pages: call
buf_LRU_get_free_block
Get free data pages from the buffer pool. - LRU lock: Locks the LRU list.
- Initialize page type: Initialize the free page to specify the data page type.
- Mount to LRU: Adds data pages to the LRU linked list.
- LRU unlock: Unlock the LRU linked list.
Get free pages
In the function buf_LRU_get_free_block, n_iterations defines the variable n_iterations to save the number of attempts to obtain free pages. Based on the number of attempts, you can divide into three strategies:
-
For the first time, n_iterations=0:
- call
buf_LRU_get_free_block
Get free data pages from the buffer pool. - If the acquisition fails and
try_LRU_scan
If true, scans the specified number of pages at the end of the LRU list (default: 100), finds a page that can be reclaimed (no transactions are in use), and callsbuf_LRU_free_page
Reclaim the page and call againbuf_LRU_get_free_only
Gets free pages from the free list. - If no pages are found that can be recycled, try to swipe a dirty page from the end of the LRU list, call
buf_flush_single_page_from_LRU
Make a single page brush dirty. And then callbuf_LRU_get_free_only
Go to the free list to get free pages. - If no free page is still found
n_iterations
Plus one.
- call
-
For the second time, n_iterations=1:
The basic steps are the same as when n_Iterations =0, but in Step 2, the entire LRU list is scanned instead of the end. If you fail all times, n_iterations increases one.
-
For the third and more iterations, n_Iterations >1:
The basic steps are the same as when n_iterations=1, but you can sleep 10ms before single page iterations to avoid failures caused by capturing dirty pages produced by single page iterations.
-
After the twentieth time, n_iterations>20:
Prints logs that cannot get free pages.
Compressed page memory management
InnoDB uses the Buddy Buddy system to manage irregular memory allocations. The compressed page size is 16K, 8K, 4K, 2K, and 1K respectively. The Buddy Buddy system will allocate the compressed page size to the free fragments in the corresponding fragment list in ZIP Free, and then hand the fragments to the frame pointer of the control block.
The allocation operation first checks zip Free to see if there are free fragments in the corresponding linked list. If there are no free fragments in the corresponding linked list, the allocation operation will be split. If there are no shards up to 16K, call buf_LRU_get_free_block to get a new data page.
When loading the 2K compressed page, Buddy will now check the ZIP Free 2K linked list to see if there are fragments:
-
If there is no 2K fragment, go to the 4K linked list to get 4K fragment, then the split operation will be carried out: the high address 2K fragment is inserted into the 2K linked list, the low address 2K space is returned to the storage compression page, the two split fragments are a pair of partners.
-
If there is no 4K shard, it will go to the 8K linked list to get the shard, and then split the 8K shard twice: split into two lower level (4K) shards, the high address shard is inserted into the 4K linked list, the low address shard continues to split into 2K shards, the high address shard is inserted into the 2K linked list, and the low address is returned.
-
If there are no 8K fragments, go to the 16K list again and split step by step.
-
If there are no 16K shards, call buf_LRU_get_free_block to get 16K shards. After obtaining the memory fragment, add the fragmented data page to the ZIP hash.
Buddy calls buf_buddy_free to free memory for compressed pages. All fragments are split from 16K fragments, so 16K fragments do not have their own partners, and the recycling of 16K fragments is very simple, empty the memory and then hang to the 16K linked list. For low-dimensional compressed pages, such as 4K compressed page memory release, it will first look for its splitting partner, high memory block looks for low memory fragment, low memory block looks for high memory fragment. And determine whether its partners can be freed or idle, and if so merge into higher dimensional (8K) fragments. Then look for 8K partner fragments and try merging again. If the partner is not freed, the compressed page data on the partner is moved to the free 8K fragment in zip Free, and then merged with the partner into a 16K fragment. If there are no free 8K shards in zip Free, it abandons the merge and simply hangs itself in the 8K list. The purpose of this merge is to reduce memory fragmentation.