Innodb is the default database engine after Mysql5.5. Compared with MyISAM, Innodb supports transaction processing. Its design goal is to use memory and CPU efficiently. After continuous optimization, the read and write performance of Innodb engine has been further improved after Mysql8.0. Learning the working principle of Innodb engine is very helpful for the business architecture design of enterprise developers. However, the architecture design of Innodb engine is huge and complex, so the author has made a study plan before 2020. In the first stage, I learned the modules of Innodb engine in a paoding way (basic module). In the second stage, the principle of collaboration between these modules is sorted out and some tuning experience is summarized (module collaboration work). The third stage will combine the source code of Mysql8.0.20 to explore some new features and optimization details of Innodb engine (explore the source code of new features). There are two purposes of writing blog posts. First, I set up a flag to accept supervision (in fact, I’m afraid I can’t stick to it ~). The second is to write down the study notes, there are mistakes or omissions of key points also hope readers in the comments section criticism.

This section focuses on the Buffer Pool. Before that, we will briefly describe the data page and data record structure, and then give an overview of the Innodb engine architecture according to the architecture diagram on the official website. The following chapter of basic modules will be expanded according to the modules involved in this diagram.

1. Basic structure of data records and data pages

The basic unit of data interaction between disk and memory in Innodb engine is the data page. The default size of the data page is 16KB, and the data page stores some basic information and data records of the page. Data records are stored on disk in a row format. Here is the basic structure of the data page and the user record structure of the most common Compact row format:

The Compact user record is divided into two parts: the extra information of the record and the real data section. The extra information of the record is stored in the variable length field list, the NULL flag bits list, and the record header information. The record header information uses 40 flag bits to record useful identifier information. The real data part stores the data records in the business as well as some hidden column information. The following table shows the information contained in the 40 flag bits in the record header. Don’t be discouraged if you don’t understand the specific meaning, just make a familiar face. The useful bits will be explained later.

The name of the Take up the bit number describe
Set aside a 1 1 Not currently in use
Set aside a 2 1 Not currently in use
delete_mask 1 Flags whether the record has been deleted
min_rec_mask 1 This tag is added to the smallest record in each layer of the B+ tree’s non-leaf nodes
n_owned 4 Number of records held by the current slot
heap_no 13 The current record is in the record heap
record_type 3 Indicates the type of the current record. 0 indicates the common record, 1 indicates the B+ tree non-leaf node record, 2 indicates the minimum record, and 3 indicates the maximum record
next_record 16 The relative position of the next record

Taking a look at the data page, the following table describes the basic structure of the data page for each part:

The name of the Chinese name Space occupied A brief description
File Header The file header 38 byte Page for some general information
Page Header Page header 56 bytes Some information unique to the data page
Infimum + Supremum Minimum record and maximum record 26 bytes Two virtual row records
User Records User record indefinite The contents of the row records that are actually stored
Free Space Free space indefinite Unused space in a page
Page Directory Page directory indefinite The relative positions of some records in a page
File Trailer File the tail 8 bytes Verify page integrity

The space occupied by each part of File Header, Page Header, Infimum + Supermum record and File Tailer is fixed. The total Space occupied by User Records, Free Space and Page Directory is fixed, but each part changes dynamically, just like a spring. When User Records are added to the User Records part, the Free Space part is compressed. The information recorded in the Page Directory changes accordingly. Each data page has two fixed minimum Records and maximum Records. User Records in the User Records part are stored next to each other in a linked list formed by next_record in the header information as a logical pointer. Many User Records can be stored in each data page.

The FileHeader section stores some general information about the page. Some of the information is critical for this section.

The name of the Space occupied describe
FIL_PAGE_SPACE_OR_CHKSUM 4 bytes Page checksum
FIL_PAGE_OFFSET 4 bytes Page number
FIL_PAGE_PREV 4 bytes The page number of the previous page
FIL_PAGE_NEXT 4 bytes Page number of the next page
FIL_PAGE_LSN 8 bytes The position of the log sequence corresponding to when the page was last modifiedLog Sequence Number)
FIL_PAGE_TYPE 2 – The type of the page
FIL_PAGE_FILE_FLUSH_LSN 8 bytes Defined only on one page of the system tablespace, representing that the file has been flushed to at least the corresponding LSN value
FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID 8 bytes Which table space the page belongs to

Innodb engine uses a B+ tree structure to store data. Leaf nodes of the tree store data pages, and non-leaf nodes store directories (directory pages) of data pages. The front and back Pointers to a bidirectional list are called FIL_PAGE_OFFSET and FIL_PAGE_PREV. Here is a B+ tree structure diagram of clustered index:

FIL_PAGE_OFFSET stores this information. FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID records which table space the data page belongs to. In a database, a data page can be uniquely identified by the table space + page number. Innodb engine has many kinds of pages, in addition to the data page storing user records, there are index pages, Undo pages, Inode pages, system pages, BloB pages, etc. The page type is marked with FIL_PAGE_TYPE. FIL_PAGE_SPACE_OR_CHKSUM is used to verify the integrity of data pages. FIL_PAGE_FILE_FLUSH_LSN can also be ignored. But FIL_PAGE_LSN is very very important!! It is called Log Sequence Number and will be explained in a later article.

With a basic understanding of data records and data pages, the reader is left with a question to ponder: Why does Innodb engine use data pages as the basic unit of disk and memory interaction?

I’m sure the reader already has his or her own answer, but I hope the reader understands the concept of locality, and that it can help explain this question.

The principle of program locality refers to the law of program locality, which is divided into time locality and space locality. Time locality means that if data is accessed, it may be accessed again shortly thereafter; Spatial locality refers to the fact that once a program accesses a storage unit, it will not be long before nearby storage units are accessed as well.

2. Overview of Innodb engine architecture

Innodb’s architecture is complex, but don’t be intimidated by it. Once we break down the modules, break through them one by one, and figure out how they work together, we’ll be in a different mood 💢(this is my real feeling).

Here is the overall architecture of Innodb from the official website:

From the architecture diagram, we can see that Innodb architecture can be divided into “in-memory Structures” and “on-disk Structures”. The memory architecture is relatively simple, consisting of Buffer Pool and Log Buffer, while the disk architecture is slightly more complex. Log files, Undo Log files, System Tablespace files, and General Tablespaces files. There are also files under Temporary Tablespaces, Doublewrite buffers, and so on.

Innodb engine uses data pages to complete the interaction between disk and memory. CPU directly operates the data in memory, while Mysql stores user data to disk files under the table space for data persistence. Every time the CPU reads or writes data, it would be too expensive to load data pages from disk into memory, so Innodb engine allocates a large amount of memory to cache data pages on disk, called “Buffer Pool”. Now let’s introduce it.

3. Buffer Pool architecture

Earlier we mentioned that in the disk structure of the Innodb engine, data pages are managed by double-linked lists on the leaves of the B+ tree, which facilitates scoping based on criteria. When the data pages are loaded into the Buffer Pool, they are scattered.

3.1 Internal Composition of a Buffer Pool

Innodb engine creates a block structure for each cached page. The block structure contains the table space number, page number, Buffer Pool address of the cached page, linked list node information, lock information and LSN informationLSN? That’s what we talked about when we described the data pageLog Sequence NumberThis information is also available in the control block, so keep an eye out for this concept and I’ll cover it in a chapter later. The size of the Buffer Pool is similar to the following:

The control block of the cached page is placed at the front of the Buffer Pool, and the cached page is placed at the back. There is a one-to-one correspondence between them. When the control block and cache pages fill the Buffer Pool, there may be a fragment that is not large enough to hold the next cache page.

3.2 Free linked list in Buffer Pool

When Mysql server starts up, Innodb engine needs to initialize the Buffer Pool. It first applies for the required memory space from the operating system, and then divides it into many control blocks and cache pages. Since no data pages are actually cached in the Buffer Pool at this point, Innodb engine maintains a free linked list to keep track of which cached pages are available:

The Innodb engine specifically defines a base node for the list, which contains information about the head node, the tail node, and the number of nodes in the list.

With the free list, whenever a page needs to be loaded from disk into the Buffer Pool, a free cached page is taken directly from the free list and the corresponding control block of the cached page is filled with the meta information of the cached page (as mentioned earlier, Table space number, page number, cache page’s address in the Buffer Pool, linked list node information, lock information, LSN, etc.), and remove the control block from the free linked list, indicating that the cache page has been used.

3.3 Setting Parameters of the Buffer Pool and initialization process

A Buffer Pool is a Buffer Pool that can be used to create a Buffer Pool.

  • Innodb_buffer_pool_size: Specifies the total size of the Buffer Pool
  • Innodb_buffer_pool_instances: Number of instances in the Buffer Pool
  • Innodb_buffer_pool_chunk_size: specifies the size of chunk in the Buffer Pool. The default value is 128 MB

The Buffer Pool is divided into instances. The number of instances is controlled by the innodb_buffer_pool_instances parameter.

Innodb_buffer_pool_size specifies the total size of the Buffer Pool. Innodb_buffer_pool_instances specifies the number of Buffer Pool instances. The actual size of each Buffer Pool can be calculated using the following formula (total size divided by the number of instances) :

innodb_buffer_pool_size/innodb_buffer_pool_instances
Copy the code

However, creating as many Buffer Pool instances as possible is not always the best, and managing the Buffer Pool itself requires additional overhead. If the value of innodb_buffer_pool_size is less than 1G, setting multiple instances will not work, and the Innodb engine will change the value of Innodb_buffer_pool_instances to 1 by default. If the Buffer Pool size is greater than or equal to 1 GB, it is better to have multiple Buffer Pool instances.

Prior to Mysql5.7.5, the Buffer Pool size could only be adjusted at server startup by configuring the innodb_buffer_pool_size startup parameter and was not allowed to be adjusted while the server was running. Starting with Mysql5.7.5, the Innodb engine supports resizing Buffer pools during service runs. This is great, but the previous implementation required the operating system to apply for a contiguous memory space every time the size of the Buffer Pool was adjusted, and then copy the contents of the old Buffer Pool into the new space, which was extremely time-consuming. Therefore, after Mysql5.7.5, Innodb engine initializes the Buffer Pool, instead of applying for a large chunk of contiguous memory space at once, it introduces a chunk structure to apply for space from the operating system each time. In this way, a Buffer Pool instance consists of several chunks, in which the control blocks and the cached pages mentioned above are stored, as well as the linked list information that manages these cached pages. Buffer Pool Buffer Pool Buffer Pool Buffer Pool

The size of chunk in the Buffer Pool can be customized. The default size is 128 MB. There is a formula for allocating the memory size of the entire Buffer Pool. Assume that n is the number of chunks.

innodb_buffer_pool_size = n * innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances
Copy the code

The value of n is calculated during Buffer Pool initialization. If n is not a positive integer, the Innodb engine automatically adjusts the size of Innodb_buffer_pool_size.

Let’s briefly review the initialization process of the Buffer Pool. As mentioned above, the structure of the Buffer Pool holds a cache page, and each cache page corresponds to a control block. The Buffer Pool uses buf_block_t as its basic unit:

/ / code path (mysql8.0.20) : storage/innobase/include/buf0buf. H struct buf_block_t {buf_page_t page; /* Page meta information, id, size, etc. */ byte *frame; /* Page data */}Copy the code

The buffer Pool is initialized with the buf_pool_init() method as the entry point, and then each instance is initialized with the buf_pool_create() method. Initialize linked lists, create associated Mutex, create chunks, and so on. The buf_chunk_init() method is then called for actual initialization, which is also divided into several steps:

  • Chunk initialization, using the OS_MEM_alloc_LARGE () function to allocate memory
  • Initialize buf_block_t. Each Block contains Block ->frame data pages and Block ->page page information management
  • Add each Block’s Page structure Block -> Page to the buf_Pool ->free list

The initialization process of the Buffer Pool described in the code fragment above is consistent with our previous analysis, except that after the introduction of multi-instance and chunk structure, the method and granularity of applying for memory is changed from applying for large chunks of memory to applying for memory in the unit of chunk to the operating system.

3.4 Caching the Page Hash Table

The free list that we mentioned earlier manages the cache pages that are not used in the Buffer Pool; When data pages are loaded from disk into the Buffer Pool, the cached pages in the Buffer Pool are removed from the free list. The page is already in the Buffer Pool, so the next time we access the data in the page, we can just fetch it from the Buffer Pool. The question is: how do we know if the data page is in the Buffer Pool?

The FileHeader section of the data page contains the tablespace number and page number of the data page. The tablespace number + page number can be used to uniquely identify a data page in the database. Then we can create a hash table using the table space number + page number as the key and the cache page as the value. When we need to access the page data, we can use the table space number + page number first to determine whether there are cached pages in the Buffer Pool. If there are any cached pages, we can directly use them. Then load the corresponding data page from disk to access.

The hash table for cached pages is easier to understand, so let’s look at another important list in the Buffer Pool: flush.

3.5 Linked List of Flush in Buffer Pool

We often update data records with DML statements such as add, delete, and modify. When we modify the data in the Buffer Pool, it is inconsistent with the data on disk. Such data pages are called dirty pages. The Mysql database is highly consistent. To achieve data consistency, the simplest method we can think of is to immediately synchronize to the corresponding disk page every time a cached page is modified. The Innodb engine is known for its high performance. It would never do that! The dirty pages will be synchronized to disk sooner or later, but the Innodb engine is not in a hurry to synchronize every time. When and how will be synchronized is not a concern for now, but will be covered in more detail later. Let’s consider the following question: If dirty pages are not synchronized to disk immediately, how can we determine which pages are dirty and which have never been changed when we try to synchronize them in the future? Innodb maintains a different type of linked list, Flush List, which is added when cached pages are modified. In the future, when dirty pages are synchronized to disk, Remove it from the Flush List, which is constructed similarly to the Free List.

3.6 LRU List management in a Buffer Pool

The size of the Buffer Pool is limited, the user data is constantly growing, and there will always be a time when the free list will not be able to produce a usable cache page. So the question is, what should the Innodb engine do when the time comes? The problem was so simple that the Innodb engine had to weed out some cached pages from the Buffer Pool.

The question is which cached pages to eliminate? Is it possible to flush some dirty pages from the Flush list to disk? The solution is feasible, but the reader should not assume that the flush list must contain contents if the free list does not contain cached pages. When we query the data of a large table, if all the pages of the large table are loaded into the Buffer Pool and the memory space is occupied, then we only query the data. No cached pages have been changed. The Flush list is empty.

You can’t always take things for granted, you have to have some theoretical basis, remember the locality principle we talked about earlier? Time locality refers to: if data is accessed, it may be accessed again shortly thereafter. For this reason, many software developers use LRU(Least Recently Used) algorithms to manage obsolescent data, as readers familiar with Redis should know.

Innodb engine also creates its own LRU linked list in the Buffer Pool. The design principle is to maximize the Buffer Pool cache hit ratio. We will create a base node of the LRU list as well as a base node of the LRU list. When accessing a page, we will do this:

  • If the data page is not in the Buffer Pool, the corresponding control block of the Buffer page is inserted as a node into the head of the linked list when the page is loaded from disk into the Buffer Pool.

  • If the page is already cached in the Buffer Pool, the corresponding control block is moved directly to the head of the LRU list.

Our imagined LRU list would look like this:

If you really think this is how Innodb engine is implemented! Take a moment to think about our imagined LRU linked list, and there are a lot of problems. In the example above, we performed a full table scan on a large table, and a large number of new cache pages were entered into the Buffer Pool. The Buffer Pool used to have hot cache pages (hot cache pages are the pages that have been used most recently), but the scan of a large table rarely happens. As a result, the Buffer Pool hit ratio decreases, which takes a long time to recover. Innodb engine takes this problem into account at the beginning of the design of the LRU list. Let’s look at how Innodb engine LRU list is implemented:

The Innodb engine splits the LRU linked list in proportion to the New Sublist that stores the hot cached page data. Followed by the Old Sublist that stores the cold cache page data. New Sublist and Old Sublist can be adjusted by the following parameters:

mysql> show variables like 'innodb_old_blocks_pct'; +-----------------------+-------+ | Variable_name | Value | +-----------------------+-------+ | innodb_old_blocks_pct | 37 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + 1 row in the set (0.01 SEC)Copy the code

By default, Old Sublist accounts for 37% of LRU lists; So that’s 3/8. The control block of the Buffer page will record the time when it was first loaded into the Buffer Pool. The next time it is accessed, the Buffer page will be loaded into the Buffer Pool. Determine that the access time is within a certain interval from the first access time, so that the page will not be moved from the Old section to the head of the New section, otherwise it will be moved to the head of the New section.

The interval we described above can also be configured with parameters:

mysql> show variables like 'innodb_old_blocks_time'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | innodb_old_blocks_time 1000 | | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + 1 row in the set (0.00 SEC)Copy the code

The default value of innodb_old_blocks_time is 1000, which is in milliseconds. This means that a page loaded from disk into the Old section of the LRU list will not be added to the New Sublist header if the interval between the first and last access to the page is less than 1s. We can also set innodb_old_blocks_time to 0, so that every time we access a cached page, the cached page will be placed in the LRU’s New Sublist header.

Innodb engine design of this LRU linked list structure has many other considerations, such as prefetch strategy. There are also some other optimization strategies to improve the cache hit ratio of the Buffer Pool, which are not described here. Readers who are interested can have a further study.

3.7 Other data structures of the Buffer Pool

The Buffer Pool has other data structures designed to facilitate the management of cached pages in memory, such as zip_free, which is an array of Pointers. They form a partner system to provide memory space for compressed pages. The essence of the partner system is to merge and split adjacent memory blocks in multiples of 2, so as to achieve efficient management and low code complexity. This pointer array actually contains four layers, 1024,2048,4096, and 8192, based on the block size, with each layer of base nodes managing only blocks of the same size. There are other unZIP LRU lists, zip Clean lists; It doesn’t matter if you don’t understand this description, which is not the point of this article, just know that they are created by Buffer pools to better manage cached pages.

Through the descriptions in the previous chapters, we believe that readers have a comprehensive understanding of the buffer pool architecture, which is very helpful for us to analyze the working principle of Innodb engine later. We will refer to it in later chapters.

4. Summary and preview

This section gives a brief introduction to the structure of data pages and data records, and then Outlines the architecture of the Innodb engine, which will be followed by the sections on basic modules. This section mainly summarizes the structure of Buffer Pool and analyzes the design of some important linked lists. Interested readers may find that in the diagram of Innodb engine architecture, there is a part of Buffer Pool called “Change Buffer”, which is not mentioned in this section because I think it is very important. I don’t want to make it too long, so I will focus on it in the next article.