preface
Users to the database is the most basic requirement is that it can efficiently read and store data, but read and write data involves interactions with equipment of the low speed, in order to make up for the speed difference between the two, all databases have buffer pool, used to manage the corresponding data page, improve the efficiency of the database, of course, because the middle layer is introduced, Database memory management becomes relatively complex. This paper mainly analyzes related technologies and implementation principles of MySQL Buffer Pool. The source code is based on MySQL 5.6 branch of Ali Cloud RDS, and some features have been open source to AliSQL. Buffer Pool related source code in buF directory, mainly including LRU List, Flu List, Double write Buffer, preread prewrite, Buffer Pool preheat, compressed page memory management modules, including header files and IC files, a total of 20,000 lines of code.
Basic knowledge of
Buffer Pool Instance: Innodb_buffer_pool_size/Innodb_buffer_pool_instances. Each instance has its own locks, semaphores, physical chunks, and logical linked lists. That is, each instance can read and write concurrently without competition. Physical Buffer chunks of all instances are allocated when the database is started and are not released until the database is down. When innodb_buffer_POOL_size is less than 1GB, innodb_buffer_pool_instances is reset to 1, mainly to prevent too many small instances from causing performance problems. Each Buffer Pool Instance has a page hash linked List with which space_id and page_no can be used to quickly find data pages that have been read into memory without linear traversing the LRU List. Note that this hash table is not an adaptive hash of InnoDB. The adaptive hash is to reduce Btree scanning, while the Page hash is to avoid scanning the LRU List. Data pages: in InnoDB, the minimum unit of data management is a page, the default is 16KB, the page can store user data, but also control information data. The minimum unit of read and write for the InnoDB IO subsystem is also a page. If the table is compressed, the corresponding data page is called the compressed page. To read data from the compressed page, decompress the compressed page to form the decompression page. The decompression page is 16KB. The size of the compressed page is specified when the table is being built and currently supports 16K, 8K, 4K, 2K, and 1K. Even if the compressed page size is set to 16K, there is some benefit in blob/ VARCHar /text types. Assume that the size of a specified compressed page is 4K. If a data page cannot be compressed to a size smaller than 4K, you need to split the B-tree, which is a time-consuming operation. Under normal circumstances, the Buffer Pool will cache both compressed and decompressed pages. If the Free List is insufficient, the elimination strategy will be determined according to the actual load of the system. If the system bottleneck is IO, only the uncompressed page is expelled, and the compressed page is still in the Buffer Pool. Otherwise, both the uncompressed page and the compressed page are expelled. Buffer Chunks: Consists of two parts: the data page and the corresponding control body of the data page. The control body has Pointers to the data page. Buffer Chunks are the lowest layer of physical Chunks that are requested from the operating system during startup and released until the database is shut down. Through chunks, you can access almost all data pages, except those that have two states: compressed pages that have not been unzipped (BUF_BLOCK_ZIP_PAGE) and compressed pages that have been modified and the unzipped pages have been expelled (BUF_BLOCK_ZIP_DIRTY). In addition, the data page does not necessarily store user data, starting with control information, such as row locking, adaptive hashing, etc. Logical linked list: The linked list node is the control body of the data page (the control body has Pointers to the real data page). All nodes in the linked list have the same attribute, which is introduced for easy management. The following lists are all logical lists. Free List: The nodes on this List are unused nodes. If you need to allocate a new data page from the database, you can get it directly from this List. InnoDB needs to ensure that the Free List has enough nodes for user threads to use, otherwise it needs to weed out certain nodes from the FLU List or LRU List. After InnoDB is initialized, all data pages in the Buffer Chunks are added to the Free List, indicating that all nodes are available. LRU List: This is the most important linked List in InnoDB. All newly read data pages are placed on it. The List is sorted by the least recently used algorithm. The least recently used node is placed at the end of the List. If there are no more nodes in the Free List, the last node is eliminated from the List. The LRU List also contains undecompressed pages, which have just been read from disk, and undecompressed pages. The LRU List is divided into two parts. By default, the first 5/8 is the Young List, which stores frequently used hot pages, and the last 3/8 is the old List. By default, newly read pages are added to the old list header, and are moved to the Young list only when certain conditions are met. This is mainly to pollute the buffer pool with preread data pages and full table scans. FLU List: All nodes in this linked List are dirty pages, meaning that the data pages have been modified but have not yet been flushed to disk. A page on the FLU List must be on the LRU List, but not vice versa. A data page may be modified multiple times at different times, and the oldest (that is, the first) modification LSN, known as OLdest_modification, is recorded on the data page. Different pages have different OLdest_modification. Nodes in the FLU List are ordered by oldest_modification. The tail of the List is the smallest, that is, the first page to be modified, and when pages need to be eliminated from the FLU List, they are eliminated from the tail of the List. Adding to the FLU List is protected by flush_list_mutex, so the order of the nodes in the FLU List is guaranteed. Quick List: MySQL > add LRU List to Quick List; add LRU List to Quick List; add LRU List to Quick List; Unzip LRU List: The data pages stored in this List are Unzip pages, that is, the data page is extracted from a compressed page. Zip Clean List: This List is only available in Debug mode, mainly to store compressed pages that have not been decompressed. These compressed pages are just read from disk and uncompressed. Once uncompressed, they are deleted from the linked List and added to the Unzip LRU List. Zip Free: Compressed pages come in different sizes, such as 8K and 4K. InnoDB uses a partner system similar to memory management to manage compressed pages. Zip Free can be understood as a two-dimensional array composed of five linked lists, each of which stores memory fragments of the corresponding size. For example, 8K linked lists store 8K fragments. If a new page of 8K is read, it is searched from the list first, and if there is any, it is returned directly. If not, two 8K blocks are split from the 16K list, one is used and the other is placed in the 8K list.
Core data structure
InnoDB Buffer Pool has three core data structures: buf_pool_T, buf_block_T, and buf_page_t. But_pool_t: stores Buffer Pool Instance level control information, such as mutex, instance_NO, page_hash, old_list_pointer, and so on for the entire Buffer Pool Instance. It also stores the list root node of various logical lists. The Zip Free two-dimensional array is also included. Buf_block_t: This is the control body of the data page, which describes the information in the data page section (most of the information is in buf_PAGe_T). The first field in buf_block_t is buf_page_t. This is not arbitrary, but must be placed in the first field, because only then can buf_block_T and buf_page_t Pointers be converted to each other. The second field, the Frame field, points to the data page where the data is actually stored. Buf_block_t also stores the root node of the Unzip LRU List. Another important field is the Block-level MUtex. buf_page_t: This can be interpreted as the control body of another data page, where most of the data page information exists, such as space_id, PAGe_NO, page state, NEwest_modification, oldest_modification, Access_time and all information about the compressed page. The compressed page information includes the size of the compressed page, and the data pointer to the compressed page (the real compressed page data is stored on the data page allocated by the partner system). Note that if a compressed page is decompressed, the data pointer to the decompressed page is stored in the frame field of buf_block_t.
Here’s a look at the state field in buf_PAGe_t, which is used primarily to indicate the current page state. There are eight states. These eight states may be difficult for beginners to understand, especially the first three states, if you do not understand can be skipped. BUF_BLOCK_POOL_WATCH: This type of page is provided for the Purge thread. In order for InnoDB to be multi-versioned, previous data needs to be recorded in the undo log, which can be deleted through the Purge thread if no read requests are needed. In other words, the Purge thread needs to know if some data page has been read, so the solution is to first look at the Page hash to see if the data page has been read, and if not, to fetch (allocated by malloc at startup, Not in Buffer Chunks) a sentinel data page control body of type BUF_BLOCK_POOL_WATCH, Add page_hash but no real data (buf_blokc_T ::frame is empty) and set its type to BUF_BLOCK_ZIP_PAGE(to indicate that it has been used, and the other Purge threads don’t use this control body), The related function buf_pool_watch_set, if this data page is found after looking at the page hash, only need to determine whether the address of the control body in memory belongs to the Buffer Chunks. If it is, the corresponding data page has been read by other threads. Related functions buf_pool_watch_occurred. On the other hand, if the user thread needs the data page, it looks at the page hash to see if it is a data page of type BUF_BLOCK_POOL_WATCH. If it is, it reclaims the data page of type BUF_BLOCK_POOL_WATCH. Allocate a Free control body from the Free List (i.e., in Buffer Chunks) and fill it with data. The idea here is to determine whether the data page is still in use by controlling the location of the body in memory. BUF_BLOCK_ZIP_PAGE: When the compressed page is read from disk, a temporary buf_page_t is allocated via malloc, and then space for the compressed page is allocated from the partner system to store the compressed data read from disk. Then mark the temporary buf_PAGe_t as BUF_BLOCK_ZIP_PAGE (buf_page_init_for_read). Only when the compressed page is decompressed will the state field be changed to BUF_BLOCK_FILE_PAGE. Add LRU List and Unzip LRU List(buf_PAGe_get_gen). A compressed page is marked as BUF_BLOCK_ZIP_PAGE(buf_LRU_free_page) if the corresponding uncompressed page is expelled, but the compressed page needs to be preserved and is not dirty. So normally, not many are in BUF_BLOCK_ZIP_PAGE state. Both of the preceding compressed pages labeled BUF_BLOCK_ZIP_PAGE are in the LRU List. Another use is that from a BUF_BLOCK_POOL_WATCH node, if used by a Purge thread, it is also marked as BUF_BLOCK_ZIP_PAGE. BUF_BLOCK_ZIP_DIRTY: If the decompressed page corresponding to a compressed page is expelled, but the compressed page needs to be kept and is dirty, it is marked as BUF_BLOCK_ZIP_DIRTY(buf_LRU_free_page). If the compressed page is decompressed again, the status changes to BUF_BLOCK_FILE_PAGE. Therefore, BUF_BLOCK_ZIP_DIRTY is also a transient state. Data pages of this type are in the Flush List. BUF_BLOCK_NOT_USED: When the List is in the Free List, it is in this state. It’s a state that can exist for a long time. BUF_BLOCK_READY_FOR_USE: When a Free data page is retrieved from the Free List, the status changes from BUF_BLOCK_NOT_USED to BUF_BLOCK_READY_FOR_USE(buf_LRU_get_free_block), which is also a transient state. The data page in this state is not in any logical linked list. BUF_BLOCK_FILE_PAGE: Normally used data pages are in this state. Most of the data pages in the LRU List are in this state. After the compressed page is decompressed, the state also changes to BUF_BLOCK_FILE_PAGE. BUF_BLOCK_MEMORY: The data pages in the Buffer Pool can store not only user data, but also some system information, such as InnoDB row locks, adaptive hash indexes, and data from compressed pages, which are marked BUF_BLOCK_MEMORY. BUF_BLOCK_REMOVE_HASH: The page hash needs to be removed before being added to the Free List. Therefore, this state indicates that the page hash has been removed from the page, but has not been added to the Free List, which is a transient state. In general, most of the data pages are in BUF_BLOCK_NOT_USED(all in the Free List) and BUF_BLOCK_FILE_PAGE(most in the LRU List, The LRU List also contains data pages in addition to the BUF_BLOCK_ZIP_PAGE state marked by the Purge thread) state, a small portion of which is in BUF_BLOCK_MEMORY state, and very little else. InnoDB lists the first three states as “INVALID State” (buF_BLOCK_STATE_VALID), with data pages not in the Buffer Chunks and corresponding control bodies temporarily allocated. If you understand these eight states and the transitions between them, reading the code details of the Buffer pool becomes much easier.
Next, I will briefly introduce the buf_fix_count and io_fix variables in buf_PAGe_t, which are mainly used for concurrency control and reduce the scope of mutex locks. Buf_page_t ::buf_fix_count (buf_PAGe_T ::io_fix); BUF_IO_READ (buf_io_fix); If buf_page_T ::buf_fix_count is not zero and buf_PAGe_T ::io_fix is not BUF_IO_NONE, if buf_page_T ::io_fix is not BUF_IO_NONE, Buf_page_can_relocate is not allowed. The trick here is to reduce mutex contention on the control body of the data page, while the contents of the data page are still read and write locked.
The Buffer Pool memory was initialized. Procedure
The memory initialization of the Buffer Pool is mainly the memory initialization of the Buffer Chunks. The Buffer Pool instances are initialized one by one. The core functions are buf_CHUNk_init and OS_MEM_ALloc_LARGE. Reading the code, you can see that there are currently two ways to allocate memory from the operating system, one through HugeTLB and the other using traditional MMAP. HugeTLB: This is a large memory allocation management technique. The default page size is 4KB. HugeTLB increases the page size to 2M or more. The program passes virtual memory addresses to the CPU, which must be mapped to real physical memory addresses through fast tables. The full set of fast tables is stored in memory, and some hot memory pages can be stored in CPU cache to improve memory access efficiency. Assume that the CPU cache is 100KB, each fast table occupies 1KB, and the page size is 4KB. Then, the number of hot memory pages is 100KB/1KB, covering the memory data of 1004KB=400KB. If the default page size is 2 MB, the CPU cache of the same size is. If the mapping relationship is not found in the cache, you need to read the mapping relationship from the memory to the cache, then search for the mapping relationship, and then read the required data in the memory. As a result, the physical memory is accessed twice. In other words, using HugeTLB, a large memory technology, can improve the hit ratio of fast tables and thus improve the performance of accessing memory. Of course, this technique is not a silver bullet, as larger pages will inevitably lead to more in-page fragmentation. If you need to load virtual memory from a swap partition, it will also slow down. The final reason, of course, is that 4KB pages have been used consistently in the industry for years, and there is no need to take the risk unless there is a specific need. In InnoDB, if you need to use this technique you can start MySQL with the super-large-pages parameter. Mmap allocation: In Linux, multiple processes need to share the same memory. Mmap can be used to allocate and bind the memory. Therefore, it is ok to provide only one MySQL process for use. The memory allocated by mmap is virtual memory. The VIRT column is occupied in the top command, not the RES column. Only the corresponding memory is actually used, the corresponding memory is counted in the RES column, improving the memory usage. This is why it is common to see MySQL being allocated a lot of virts at startup, while RES is slowly rising. And one of the things you might wonder about here is why you don’t use Malloc. When the requested memory is larger than MMAP_THRESHOLD(default: 128KB), malloc calls Mmap. In InnoDB, mmap is used for allocation by default. The buf_CHUNk_init function divides the memory into two parts. The first part is the data page control body (buf_block_T). In ali Cloud RDS MySQL 5.6 release, each buf_BLOCK_T is 424 bytes. There are total innodb_buffer_pool_size and UNIV_PAGE_SIZE. The latter part is the actual data page, separated by UNIV_PAGE_SIZE. Assuming that the page size is 16KB, then the memory of the data page control body: data page is approximately 1:38.6, which means that if innodb_buffer_POOL_size is set to 40GB, an additional 1GB of extra space is required to store the data page control body. Once the space is partitioned, iterate over the data page control body, set the buf_BLOCK_T ::frame pointer to the actual data pages, and add those data pages to the Free List. After initializing the Buffer Chunks, you also need to initialize the data page control blocks of type BUF_BLOCK_POOL_WATCH, the structure of the page hash, and the structure of the ZIP hash (all data pages allocated by the compressed page’s partner system are added to this hash table). Note that this memory is extra allocated and is not included in the Buffer Chunks. In addition to buf_pool_init, the but_pool_free function is recommended to further understand the memory associated with Buffer pools.
Buf_page_get function parsing
This function is extremely important as an external interface function for other modules to get the data page. If the requested data page is already in the Buffer Pool, the corresponding data page pointer is directly returned after modifying the corresponding information. If there is no related data page in the Buffer Pool, the data page pointer is read from disk. Buf_page_get is a macro definition. The real function is buf_page_get_gen, and the main parameters are space_id, page_no, lock_type, mode, and MTR. This section mainly introduces a mode parameter, which indicates the read mode. Currently, there are six modes supported, and the first three are more commonly used. BUF_GET: The default method for obtaining data pages. If the data pages are not in the Buffer Pool, they are read from disk. If the data pages are already in the Buffer Pool, you need to determine whether to add them to the Young list and whether to perform linear prefetch. Read locks are added if it is read, and write locks are added if it is modified. BUF_GET_IF_IN_POOL: The data page is only looked up in the Buffer Pool, if so, whether to add it to the young list and whether linear prefetch is required. If not, return null. The locking method is similar to BUF_GET. BUF_PEEK_IF_IN_POOL: Similar to BUF_GET_IF_IN_POOL, except that it is not added to the young list and does not perform linear prefetch even if the condition is met. The locking method is similar to BUF_GET. BUF_GET_NO_LATCH: The data page is unlocked regardless of whether it is read or modified. Otherwise, it is similar to BUF_GET. BUF_GET_IF_IN_POOL_OR_WATCH: Only the data page is looked up in the Buffer Pool, and if so, whether to add it to the young list and whether linear prefetch is required. If not, set watch. The locking method is similar to BUF_GET. This is for the Purge thread. BUF_GET_POSSIBLY_FREED: This mode is similar to BUF_GET, but allows the corresponding data page to be freed during function execution. It is used to estimate the number of data rows before two slots in Btree. Next, let’s briefly examine the main logic of this function.
- First of all by
buf_pool_get
The function finds the specified data page in the Buffer Pool Instance based on space_id and page_no. The algorithm is very simpleinstance_no = (space_id << 20 + space_id + page_no >> 6) % instance_num
(space_id and page_no) then compute a fold value based on the number of instances. There is a small detail here. The sixth bit of page_no is cut off to ensure that data with an extent can be cached in the same Buffer Pool Instance for later prereads. - Next, call
buf_page_hash_get_low
The page hash function looks to see if the data page has been loaded into the corresponding Buffer Pool Instance. If the data page is not found and mode is BUF_GET_IF_IN_POOL_OR_WATCH, the watch data page (buf_pool_watch_set
Next, if no data page was found and mode is BUF_GET_IF_IN_POOL, BUF_PEEK_IF_IN_POOL, or BUF_GET_IF_IN_POOL_OR_WATCH, the function returns null, indicating that no data page was found. If no data is found but mode is different, read from disk synchronously (buf_read_page
). Before reading disk data, if we find uncompressed pages to be read, we will first obtain Free data pages from the Free List. If there are no Free pages in the List, we need to clean up to release the data pages. Some details here will be analyzed in the LRU module later, after obtaining the Free data pages, Add to LRU List (buf_page_init_for_read
). Before reading the disk data, if we find that we need to read the compressed page, we temporarily allocate a buf_page_t as the control body, through the partner system to allocate the compressed page data space, finally added to the LRU List.buf_page_init_for_read
). Once this is done, we call the IO subsystem interface to read the page data synchronously. If the read fails, we retry 100 times (BUF_PAGE_READ_MAX_RETRIES
) then the assertion is triggered, and if it succeeds, it determines whether random prefetch is required (details related to random prefetch are also analyzed in the prefetch prewrite module). - Then, after the data is read successfully, we need to determine whether the data page read is compressed. If so, since the control body of the compressed page read from disk is temporarily allocated, we need to reallocate the block(
buf_LRU_get_free_block
Buf_page_t (buf_page_t)buf_relocate
After the decompression is successful, set state to BUF_BLOCK_FILE_PAGE and add to the Unzip LRU List. - Next, we determine if the page is accessed for the first time, if so, we set buf_page_T ::access_time, if not, we determine if the page is in the Quick List. If the page is in the Quick List and the current transaction is not the one with the Hint statement, The data page needs to be removed from the Quick List because it is accessed by other statements and should no longer be in the Quick List.
- Then, if mode is not BUF_PEEK_IF_IN_POOL, we need to decide whether to move the data page to the Young list, which will be analyzed later in the LRU module.
- Then, if mode is not BUF_GET_NO_LATCH, we add read/write locks to the data page.
- Finally, if mode is not BUF_PEEK_IF_IN_POOL and the data page is accessed for the first time, determine whether linear prefetch is required (details related to linear prefetch are also analyzed in the prefetch prewrite module).
Maintenance of young List and old List in LRU List
When the LRU List is larger than 512(BUF_LRU_OLD_MIN_LEN), it is logically divided into two parts. The first part stores the hottest data pages, which is called the Young List, and the second part stores the cold data pages, which is called the old List. Once there are no pages in the Free List, the cold pages are expelled. The length of the two parts is controlled by the innodb_old_blocks_pct parameter. Adjusting the length of young and old lists (buf_LRU_old_adjust_len) after every data page is added or deported. Also, BUF_LRU_OLD_TOLERANCE is introduced to prevent linked lists from adjusting too often. When the LRU List is less than 512, there is only old List. New pages are placed in the old list header by default. After innodb_olD_blocks_time, if they are accessed again, they are placed in the young list header. When a data page is read into the Buffer Pool, it is accessed many times within less than innodb_old_blocks_time, and then is not accessed again. Such data pages are also quickly expelled. The design argues that such data pages are unhealthy and should be banished. In addition, if a data page is already in the Young List, when it is accessed again, it will not be moved to the young List head unconditionally. It will only be moved to the Young List head when it is 1/4 of the length of the Young List. The purpose of this is to reduce the modification of the LRU List, otherwise it would be inefficient to modify the linked List once every data page is accessed, because the fundamental purpose of the LRU List is to ensure that frequently accessed data pages will not be expelled, so you only need to ensure that these hot data pages in a controllable range in the header. The related logic can be found in the function buf_page_peek_if_too_old.
Buf_LRU_get_free_block function parsing
This function and the function it calls can be said to be the most important function in the whole LRU module, and also plays a pivotal role in the whole Buffer Pool module. If we can understand these functions thoroughly, we believe that other functions can be easily understood.
- Quick List = ENGINE_NO_CACHE; Quick List = ENGINE_NO_CACHE; Quick List = ENGINE_NO_CACHE
buf_quick_lru_get_free
). - Then, the length of the Free List and LRU List is counted. If they are found to occupy too little space in Buffer Chunks, it means that too much space is occupied by internal structures such as row locking and self-use hashing, which are usually caused by large transactions. An alarm will be given.
- Next, check to see if there are any Free data pages in the Free List (
buf_LRU_get_free_only
), if yes, go to the next step. In most cases, this step will find free data pages. - If there are no Free data pages in the Free List, an attempt is made to expel the data pages at the end of the LRU List. If the system has compressed pages, things get a little more complicated and InnoDB will call them
buf_LRU_evict_from_unzip_LRU
If the Unzip LRU List is greater than 1/10 of the LRU List or the current InnoDB IO pressure is high, the Unzip LRU List will be first removed from the Unzip LRU List, otherwise the Unzip and compressed pages will be removed from the LRU List at the same time. No matter which path you take, you end up calling the functionbuf_LRU_free_page
To perform the expulsion operation, this function because of the need to deal with the compressed page decompression of various cases, extremely complex. If so, do not expel it. Otherwise, remove the linked List from the LRU List. If necessary, remove the data page from the Unzip LRU List.buf_LRU_block_remove_hashed
), then if we choose to keep the compressed page, we need to create a new compressed page control body and insert it into the LRU List. If the compressed page is dirty, we also insert it into the Flush List. Finally, we insert the deleted data page into the Free List.buf_LRU_block_free_hashed_page
). - If no free data pages were found in the previous step, you need to scrub (
buf_flush_single_page_from_LRU
The buf_LRU_get_free_block function is called in the user thread, so even if there is a dirty page, there is a dirty page to prevent too many dirty pages blocking the user thread. - If the brush of the previous step cannot be cleaned because the data page is read by another thread, jump back to the second step above. The difference between the first iteration and the second iteration is that in the first iteration, innodb_lru_scan_depth is scanned at most, while in the second iteration, the entire LRU List is scanned. If, unfortunately, no free data pages are found in this round, start with the third iteration and wait 10ms before brushing.
- When a free page is finally found, the state of the page is BUF_BLOCK_READY_FOR_USE.
Controls full table scan without adding cache data to the Buffer Pool
Full table scanning has a great impact on the Buffer Pool. Even if the old list is used, the old list accounts for 3/8 of the Buffer Pool by default. Therefore, Ali Cloud RDS introduces a new syntax ENGINE_NO_CACHE(for example, SELECT ENGINE_NO_CACHE count(*) FROM T1). If an SQL statement contains ENGINE_NO_CACHE, the data fetch pages it reads into memory are put into the Quick List. When the statement ends, the exclusive data pages are deleted. Introduce two parameters simultaneously. The innodb_rds_trx_OWn_block_max parameter controls how many data pages each thing using Hint can have. If the number exceeds this value, it will start evicting its own data pages to prevent large transactions from consuming too many data pages. Innodb_rds_quick_lru_limit_per_instance This parameter controls the length of the Quick List for each Buffer Pool Instance. If this length is exceeded, subsequent requests will eject data pages from the Quick List. Then get the free data page.
Delete all data pages of the specified tablespace
The function (buf_LRU_remove_pages) provides three modes, the first (BUF_REMOVE_ALL_NO_WRITE), Delete all data pages of this type (LRU List and Flush List) from the Buffer Pool and do not write back to the data file. This is suitable for rename tables and the new 5.6 tablespace transfer feature because space_id can be overused. So you need to clear everything in memory to prevent future readings of the wrong data. BUF_REMOVE_FLUSH_NO_WRITE only deletes the data pages in Flush List and does not write back to the data files in Flush List. This is appropriate for drop tables. So they’re going to get kicked out over time. The third method (BUF_REMOVE_FLUSH_WRITE) does not delete any linked List data and only flusits dirty pages from the Flush List back to disk. This method is appropriate for a tablespace shutdown, such as a normal database shutdown. It is also worth mentioning that InnoDB has made a small optimization because changes to logical lists require locking and deleting specified tablespace data pages, which is a large operation and can cause other requests to starve. Every drop of BUF_LRU_DROP_SEARCH_SIZE (default: 1024) frees the MUTEX of the Buffer Pool Instance for other threads to execute.
LRU_Manager_Thread
This is a system thread that is started with InnoDB startup and is used to periodically clean up idle data pages (innodb_LRU_scan_depth) and add them to the Free List to prevent user threads from doing sync sweeps. At regular intervals, the thread attempts to Flush some of the data pages from the LRU first, and if not enough, flushes the Flush List (buf_flush_LRU_tail). The frequency of thread execution is calculated using the following policies: Max_free_len = innodb_LRU_scan_depth * innodb_buf_pool_instances. If the number in the Free List is less than 1% of max_free_len, then the sleep time is zero. Buf_flush_LRU_tail: indicates that there are too few free pages at this time. You need to always run buf_flush_LRU_tail to free free data pages. If the number in the Free List is between 1%-5% of max_free_len, then the sleep time is reduced by 50ms(default: 1000ms). If the number in the Free List is between 5%-20% of max_free_len, then the sleep time remains unchanged. If the number in the Free List is greater than 20% of max_free_len, the sleep time is increased by 50ms, but the maximum value is not greater than rDS_cleaner_MAX_LRU_time. This is an adaptive algorithm that ensures that there are enough free data pages (lru_manager_adapt_sleep_time) to be useful under high stress.
Hazard Pointer
Technically, a Hazard Pointer is a Pointer. If the Pointer is owned by a thread, no other thread can modify it until it is released. In InnoDB, however, the concept is reversed. He needs to adjust the pointer to a valid value for other threads to use. We use Hazard Pointer to speed up reverse logic list traversal. To put this in context, we know that InnoDB may have multiple threads flushing Flush lists at the same time, such as LRU_Manager_Thread and Page_Cleaner_Thread. In order to reduce the lock time, InnoDB releases the lock when writing to the disk. The combination of these two factors causes the same dirty thread to Flush A data page and then go back to the end of the Flush List to scan for new dirty pages that can be flushed. On the other hand, data page flushing is an asynchronous operation. In the process of flushing, we will fix the corresponding data page IO_to prevent other threads from operating on the data page. Let’s say a machine uses a very slow mechanical hard disk and all pages in the current Flush List can be flushed (buf_flush_readY_for_replace returns true). One of our dirty brush thread gets the last data page at the end of the team, IO fixed, and sends it to the IO thread. Finally, it scans for dirty pages that can be brushed from the end of the team. In this scan, it finds that the last data page (that is, the data page that was just sent to the IO thread) is in IO fixed state, so it can’t swipe, skips, starts to swipe the penultimate data page, again IO fixed, sends the IO thread, and rescan the Flush List again. It also finds that neither of the two trailing data pages can be refreshed (possibly unfinished because the disk is slow) until the third to last data page is scanned. So, in an extreme case, if the disk is slow, the scrub algorithm performance degrades from O(N) to O(N*N). The essential solution to this problem is not to rescan a dirty page from the back of the queue every time you finish it. We can solve this problem by using Hazard Pointer as follows: Walk through the data page to find a brushable disk, and before the lock is released, adjust the Hazard Pointer to point to the next node in the Flush List. Be sure to do so while holding the lock. Then the lock is released, the disk is flushed, the disk is flushed, the lock is reacquired, the Hazard Pointer is read and the next node is set, the lock is released, the disk is flushed, and so on. When this thread flushes, another thread flushes, using Hazard Pointer to get a reliable node and reset the next valid node. By this mechanism, Hazard Pointer is guaranteed to be a valid Flush List node every time it is read. Even if the disk is slow, the Flush algorithm is O(N) efficient. This solution can also be used in LRU List expulsion algorithm to improve the efficiency of expulsion. The corresponding Patch was first proposed in MySQL 5.7. Ali Cloud RDS applied its Port to our 5.6 version to ensure the efficiency of the dirty algorithm in the case of large concurrency.
Page_Cleaner_Thread
This is also an InnoDB background thread, which is responsible for flushing Flush lists to prevent user threads from flushing dirty pages simultaneously. Like the LRU_Manager_Thread, the dirty pages are brushed at regular intervals. Its sleep time is also adaptive (page_cleaner_adapt_sleep_time), which is mainly influenced by three factors: The current LSN, oldest_modification in Flush list, and the current sync brush dirty point (log_sys->max_modified_age_sync, depending on the size and number of redo logs). To put it simply, the greater the difference between the LSN-oldest_modification and the synchronous brush dirty points, the longer the sleep time, and vice versa. In addition, adaptive sleep time can be turned off with the rds_page_cleaner_adaptive_sleep variable, which is fixed at 1 second. Unlike the innodb_LRU_scan_depth data pages that are cleaned by LRU_Manager_Thread at a fixed time, Page_Cleaner_Thread sweeps the number of dirty pages at a time by itself. The calculation process is a bit complicated (page_cleaner_flush_PAGes_IF_needed). It depends on the rate of dirty pages in the current system, the speed of log generation, and several parameters. Innodb_io_capacity and innodb_max_io_capacity control the number of dirty pages per second. The former is a soft limit and the latter is a hard limit. Innodb_max_dirty_pages_pct_lwm and innodb_max_dirty_pages_pct_LWm control the dirty page ratio, that is, how many dirty pages InnoDB reaches, and need to brush more frequently. Innodb_adaptive_flushing_lwm controls which LSN to flush to. Innodb_flushing_avg_loops controls the flushing efficiency for loops. If the variable is set to a high value, the flushing speed is slow and the flushing speed is slow. If the variable is set to a low value, the flushing speed is slow for loops. Brush dirty speed can be improved in a very short time. This variable is to make the system run more smoothly and play the role of peak cutting and valley filling. Related functions, af_get_pcT_for_dirty and af_get_pct_for_lsn.
Preread and prewrite
If a data page is read into the Buffer Pool, there is a high probability that the surrounding data pages will also be read into memory. Therefore, it is better to read all the data pages into memory at once rather than read them separately multiple times to reduce disk seek time. In the official InnoDB, there are two types of prefetch, random prefetch and linear prefetch. Random prefetch: This prefetch occurs when a data page is successfully read into the Buffer Pool (buf_READ_ahead_random). If there are more than a certain number of hot data pages within an Extent range (1M or 64 consecutive data pages if the page size is 16KB), all other data pages of the entire Extend (traversed from low to high according to page_NO) are read into the Buffer Pool. There are two questions here. First, what is the number? By default, 13 data pages. Then, what are the hot data pages? Read the code and find that only the data pages in the top 1/4 of the Young list are hot data pages. OS_AIO_SIMULATED_WAKE_LATER and OS_AIO_SIMULATED_WAKE_HANDLER_THREADS are used to facilitate I/O merging. Random prefetch can be switched on and off with the innodb_random_read_ahead parameter. In addition, the mode parameter of the buf_page_get_gen function does not affect random prefetch. Linear prefetch: This prefetch occurs only on a bounded data page (the first or last data page in Extend) (buf_read_ahead_LINEAR). At a range of Extend, if more than a certain number of data pages (controlled by innodb_read_ahead_threshold, 56 by default) are accessed sequentially (by determining whether the data page access time is in ascending or reverse order), All data pages of the next Extend are read into the Buffer Pool. Asynchronous I/O and I/O combination policies are still used for reading data. The condition of linear prefetch trigger is harsh. The trigger operation requires boundary data pages to be accessed in strict sequence at the same time, which is mainly to solve the performance problem of full table scan. Linear prefetch can be switched on and off with the innodb_read_ahead_threshold parameter. In addition, linear prefetch is not triggered when the mode of buf_page_get_gen is BUF_PEEK_IF_IN_POOL. In addition to the pre-read function, InnoDB can also pre-write (buf_flush_try_neighbors) when flushing dirty pages. When a data page needs to be written to disk, check whether its front or rear data pages are dirty and can be flushed to disk (not iofixed and in the old list). If possible, flush to disk together to reduce disk seek time. The write-ahead function can be controlled with the innodb_flush_NEIGHBORS parameter. With today’s SSDS, this feature can be turned off.
Double Write Buffer(dblwr)
If the data page is written down (for example, the directory information in the data page is damaged), InnoDB’s Redolog log is not a complete physical log, and some of it is a logical log, so even crash recovery cannot be restored to a consistent state. You can only rely on the Double Write Buffer to restore the full data page first. The Double Write Buffer is designed to solve the problem of half-writes to data pages. If the file system can ensure that writing to data pages is an atomic operation, this function can be turned off. In this case, each Write request is directly written to the corresponding tablespace. The default size of the Double Write Buffer is 2M, or 128 data pages. There are two parts, one for Batch write and the other for Single Page write. The former is mainly provided for batch dirty operation, while the latter is reserved for single page dirty operation initiated by user thread. The batch write size can be controlled by the innodb_doubleWRITe_batch_size parameter. For example, if innodb_doubleWRITe_batch_size is set to 120, 8 data pages are left for single Page write. If the DBLWR is full, the DBLWR will be flushed to the specified location in the system tablespace at once. Buf_dblwr_add_to_batch = buF_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch = buf_dBLWR_add_batch However, subsequent write requests will still block, and the system tablespace will be cleared until all asynchronous operations are successful. The purpose of this is that if the system fails to write back to the data page asynchronously and a half-write occurs, the data page in the system table space is complete, just copy it from it (buf_dBLWR_init_or_load_pages). After the asynchronous I/O request is complete, the I/O helper thread checks the integrity of the data page and completes the change buffer operation. Then the I/O helper thread calls buf_flush_write_complete to remove the data page from Flush List. If all data pages in Batch Write are found to have been written, space for the DBLWR is freed.
Buddy system
Similar to the memory allocation management algorithm, InnoDB’s partner system is used to manage irregular size memory allocation, mainly for compressed page data. As mentioned above, compressed pages in InnoDB can have five sizes: 16K, 8K, 4K, 2K, and 1K. The unit of compressed page size is table, which means there may be many tables with different compressed page sizes. The use of partner systems for distribution and recycling can improve the efficiency of the system. The function to claim space is buf_Buddy_alloc, which first checks the ZIP free list to see if the block of the specified size still exists, and if not, allocates it from the larger list, causing some column splitting. For example, if a piece of MEMORY of 4K size is needed, it will be searched from the 4K linked list first, and if there is one, it will be returned directly, and if there is no one, it will be searched from the 8K linked list. If there is still free in 8K, the 8K will be divided into two parts, the 4K with low address will be provided to the user, and the 4K with high address will be inserted into the 4K linked list for subsequent use. If there are no free ones in 8K, it is allocated from 16K. 16K is first split into two 8KS, the high address is inserted into the 8K linked list, the low address 8K is further split into two 4K, the low address 4K is returned to the user, and the high address 4K is inserted into the 4K linked list. Assuming there are no free pages in the 16K linked list, call buf_LRU_get_free_block to get a new data page, add the data page to the ZIP hash, and set the state to BUF_BLOCK_MEMORY, indicating that the data page stores the compressed page’s data. The function that frees space is buf_buddy_free, which is a bit more complicated than the function that allocates space. Suppose to release a 4 k data block, the first to put the 4 k to 4 k corresponding to the list, then will see its partners (release block is low address, then partner is high address, release is high address, the partner is low address) is released, if has also been released is merged into 8 k blocks of data, and then continue to look for partners in this 8 k data blocks, Attempted to merge into 16K blocks. Buf_buddy_relocate if the partner is not released, the buf_buddy_relocate function does not exit but removes the partner. For example, if the partner of the 8K block is not released, the system looks at the 8K list and moves the partner to the free 8K block. This will merge into 16K blocks, if not, the function will abandon the merge and return. Relocate operations will reduce memory fragmentation, but will be inefficient when memory copies are involved.
Buffer Pool preheating
This is a new feature in official 5.6 that dumps pages in the current Buffer Pool to external files based on space_id and page_NO. When the database restarts, the Buffer Pool can be directly restored to its previous state. Buffer Pool Dump: Iterates through the LRU List of all Buffer Pool instances. For each data page, a 64-bit number based on space_id and PAGe_NO is written to an external file (buf_dump). Buffer Pool Load: Reads the specified external file, reads all the data into memory, sorts the data using merge sort, merges I/OS in 64 data pages, and then launches a real read operation. The purpose of sorting is to facilitate I/O merge (BUf_load).
conclusion
InnoDB Buffer Pool can be considered simple, such as LRU List and Flush List, but InnoDB has made a lot of performance optimizations, such as reducing the lock range, speed up page hash lookup, etc., so the implementation details are relatively complicated. Especially with the introduction of compressed pages, some of the core code becomes obscure and requires a lot of reading.
About me
Graduated from Xi ‘an Dongda Men’s Technical School with a bachelor’s degree.
Master roamed in the imperial capital zhongguancun, infested in rongke computer training school (praise 1, praise 2), all day eluding yellow pictures, bloody violence pictures, reactionary pictures of the monitoring, to put it plain is a wall service, ha ha
I live in Ningbo, Zhejiang, so I worked in Alibaba (joined in July 2016), ali Cloud Business Division, source code group of cloud database ApsaraDB, mainly responsible for MySQL kernel development
Love computer, love programming, especially in the field of database, familiar with MySQL, basic theory of database, database middleware, etc
I also dabbled in high performance server development and high performance code optimization
In addition, love photography, currently maintain lofter photo sharing website, determined to become the best photographer in the field of coding ~
If you have any problem, you can leave a comment here or visit my Sina Weibo