Innodb engine overview
Corresponding to “MySQL Technology Insider InnoDB Storage Engine Volume 2” chapter 2
The InnoDB engine contains background threads, memory modules, and files on the hard disk, as shown below:
1. Background threads
Background threads are classified into IOThreads, MasterThreads, PurgeThreads, and PagecleanerThreads.
1.1 IOThread
InnoDB makes heavy use of AIO and IOThread handles callbacks to these requests. Iothreads are divided into four types: Read, Write, InsertBuffer, and logIO. Read and Write are responsible for reading and writing respectively. After InnoDB version 1.0.x, there are 4 and 1 respectively before InnoDB.
The id of the reader thread is smaller than that of the writer thread.
1.2 MasterThread
MasterThread is an important background thread. It refreshes dirty pages, merges insert buffers, cleans undo logs, and writes redo logs in two cycles to ensure data consistency. There are two kinds of cycles, which are cycles per second and cycles every 10 seconds. The flow is as follows (since the MD syntax of nuggets does not seem to support flow charts, I will directly use words to express it. In addition, this flow is written according to innoDB version 1.0.x before, and later optimized. Innodb cycles every 10 seconds only if it is idle, otherwise cycles every second) :
Repeat every second: write redo log-> merge insert buffer -> refresh dirty pages -> Undo log cleanup, merge insert buffer.
Innodb_max_dirty_pages_pct, innodb_max_dirty_pages_pct, innodb_max_dirty_pages_pct, innodb_max_dirty_pages_pct, innodb_max_dirty_pages_pct, innodb_max_dirty_pages_pct If the value is greater than that, the dirty page is refreshed
Every 10 seconds: Dirty page refresh -> Merge insert buffer -> write redo log-> Undo log cleanup -> Dirty page refresh
Redo log: Redo log. Transactions are written to the redo log buffer in memory and then to disk redo logs. Redo logs are stored in blocks of 512 bytes, the same as disk sectors, ensuring atomicity of writes and therefore not requiring double write protection.
Undo log: Undo log records the “reverse operation” of the data operation in the transaction. When the transaction is rolled back, undo log can be used to roll back. At the same time, when InnoDB is MVCC, undo log can be used to ensure the isolation of data and achieve consistent non-locked read. Insert data rollback is very simple, because there will be no other transactions refer to current data insert, so delete it is ok to insert data directly, but not be the same update, and delete, update is divided into two kinds, if the update is a primary key, actually perform a delete key corresponding record, and then insert a new record, if the update is not the primary key, In this case, you need to check whether there are other transactions in the current Undo log (in RR or RC transaction isolation mechanism, you need to find different versions of data through undo log). If there are any transactions, you cannot delete them; if there are no transactions, you can delete them. This also involves undo page recycling, which will be mentioned in the next few articles.
1.3 PurgeThread
This thread is used to clear undo logs of the MasterThread to share the workload of the MasterThread.
1.4 PageCleanerThread
This thread is used to refresh dirty pages in MasterThread and share the workload of MasterThread.
2. Allocate memory
InnoDB memory structure is as follows:
As an added bonus, InnoDB has shared table Spaces. If independent table Spaces are enabled, each table has its own table space. Independent table Spaces hold data, indexes, Insert buffer bitmaps, etc.
2.1 LRU List&Flush List&Free List
InnoDB manages memory in the form of pages, just as the operating system does with pages. All pages are mounted in the LRU List. This List, as the name implies, is the least recently used List. A free page is fetched from a FreeList and then allocated. When the FreeList cannot allocate memory for a new page, the pages at the end of the LRU are allocated to the new page. In order to prevent frequent hot pages from being discarded and read (such as full table scans), InnoDB uses MIDPoint technology, which places newly mounted pages in the LRU List not at the head of the List, but at midpoint (determined by Innodb_old_blocks_pct, such as 37, Means that the new page is placed at 37% of the end, and the portion following midpoint is the old list, and the previous is the new list.
Innodb_old_block_time innodb_old_block_time innodb_old_block_time innodb_old_block_time innodb_old_block_time innodb_old_block_time innodb_old_block_time innodb_old_block_time innodb_old_block_time If innodb_olD_block_time exceeds innodb_olD_block_time when the next page is read, the page is considered as hot data and placed in the new list. Here’s an example: When innodb_old_block_time=1s, one of the pages is read. Because it met the requirement, it is put into the new list. At this time, a large number of pages are preread, the remaining 9 pages are discarded due to the memory is not enough. This means that the unnecessary data does not occupy the memory, saving the memory overhead. At the same time, because the hotspot data is in the new list, it ensures that the hotspot data does not need to be read and eliminated frequently, improving the efficiency.
Innodb page size is 16KB by default, but it can also be compressed, with 2, 4, 8KB and so on. These three pages are corresponding to three unzip_LRU lists. Here is how to allocate different size pages:
When the LRU List page is modified, the page becomes dirty (memory data is inconsistent with disk data), and the page is mounted to the Flush List (note that the page exists in both the LRU List and Flush List). The Master Thread or PageCleanerThread flushed dirty pages using the Flush List.
2.2 Redo Logs
Redo logs, which I’ve just outlined, are primarily used to record the data changes that occur per second in transactions. Note that redo logs are physical logs, not logical changes, and are different from binary logs, which will be covered in the next few articles. Redo logs have redo log buffers in memory. Redo log buffer data is flushed to redo log files under the following three conditions:
- MasterThread writes redo logs every second and every 10 seconds.
- There is less than half of the redo log buffer.
- When the transaction commits.
Checkpoint technology
When the data of a page in the LRU List is modified, the page becomes a dirty page (mounted to the Flush List), and the dirty page also needs to be flushed to disk (otherwise, it takes a long time to recover the data using redo logs if the database goes down).
Note that dirty pages with dirty read, there is a difference between, dirty pages are revised data in memory, but the disk data is not modified, is in conformity with the data consistency (anyway by innodb engine can read out the correct data), dirty pages refers to data in RU isolation mode, a transaction read not commit another transaction data, A phantom read is when a transaction reads data, and another transaction commits some new data, which then reads the newly committed data.
Checkpoint is used to solve the following problems:
- Shorten the database recovery time
- Flush dirty pages to disk when the buffer pool is insufficient
- Flush dirty pages to disk when redo logs are not available
There are two types of checkpoint:
- Sharp checkpoint: Refreshes all dirty pages.
- Fuzzy checkpoint: Only some dirty pages are flushed.
Because sharp Checkpoint can cause poor database availability when running, fuzzy checkpoint refresh mechanism is introduced as follows:
- Master thread checkpoint: the dirty page refreshes performed by the master thread every second and every 10 seconds.
- Flush LRU List checkpoint: InnoDB engine needs to ensure that the LRU List has more than 100 free pages available. If not, Flush through the Flush List (because the pages in the Flush List also exist as LRU List).
- Async/Sync Flush Checkpoint: Because redo logs store user operations on data, dirty pages in memory that are not flushed to disk for various reasons (such as downtime) can be recovered using redo logs. Therefore, unflushed redo logs cannot be overwritten until dirty pages are flushed to disk. The redo log of a page flushed to disk can be considered largely useless, so it can be overwritten to other redo logs. If redo logs are unavailable, dirty pages in memory must be flushed to disk.
- Dirty Pages too much checkpoint: The proportion of Dirty pages in the buffer pool (controlled by innodb_max_dirty_pages_pct) is too high, some of them will be flushed to disk.
Write Ahead Log (WAL) is commonly used in databases, where logs precede data falls (ext4 file systems also support logs precede data falls).
3 InnoDB key features
3.1 Insert Buffer
Insert buffer innnoDB insert buffer innNODB insert buffer innNODB insert buffer innNODB insert buffer innNODB Since the secondary index insertion may not be continuous, it may be necessary to read different index pages discreetively and then modify the index page when modifying the index page. In this way, the I/O times are large and the efficiency is low. Insert Buffer is a good way to solve this problem. If you need to modify a secondary index page, if the index page is in memory, you can modify it directly. If it is not in memory, you can put it into the Insert Buffer. Insert into a page together (refers to merging of insert buffers that can fit into an index page).
The insert buffer is for non-unique secondary indexes. Imagine innoDB traversing all index pages to determine the uniqueness of secondary indexes before inserting them into the insert buffer
The insert buffer is a B+ tree and is located in a shared table space (sometimes restoring a separate table space fails because the insert buffer for secondary indexes is in the shared table space). Note that the insert buffer for all table Spaces is in a B+ tree. In order to keep track of whether each index page has room to Insert an index, we need an auxiliary structure called the Insert Buffer Bitmap. Merge insert buffer timing:
- MasterThread requires merge insert buffers every second and every 10 seconds.
- When the secondary index page is traced to no extra space through the Insert Buffer Bitmap, the merge Insert Buffer is performed. If the secondary index page is less than 1/32 (InnoDB requires the secondary index page to have at least 1/32 free space), then a merge insert buffer is enforced to insert the merged result along with the index to be inserted.
- After the secondary index page is read into the buffer pool, the merge insert buffer is performed.
3.2 write twice
Double Write gives InnoDB reliability because data may fail when it is written from memory to disk, and incomplete data may be written, even if redo log is used to restore data. Because redo log restores data directly to physical page, if the physical page itself fails, There is no way to restore the redo log. Therefore, a copy of the page is required so that data can be restored with two writes.
- Dirty pages are copied to the doublewrite buffer.
- Dirty pages are written twice, 1MB at a time, to a contiguous 2MB space in the shared table space (on disk). Because 2MB in a shared table space is contiguous, contiguous writes are inexpensive.
- After writing to the shared table space, the 2MB pages are written to each table space separately.
3.3 Adaptive hash index
Adaptive Hash Index (AHI), because the Index uses B+ tree, the height is usually 3-4 layers, one query needs to be queried three to four times on disk, so InnoDB can effectively improve the query efficiency by creating Hash Index on hot pages (hot data) in memory. The establishment conditions are:
- The query conditions are the same.
- The number of consecutive queries exceeds 100.
- Not a range query, but an equivalent query.
- Page is accessed through this mode N times, N= records on the page *1/16.
The key value of the hash index is obtained from space_id<<20+space_id+offset. For hash collisions, the linked list method is used, and for every page with the same hash value, there is a pointer to the page with the same hash value.
3.4 asynchronous I/o
Refer to AIO, for example: Index scanning can be divided into multiple I/O requests. You can send one I/O request and immediately send another one. There is no need to wait for one to finish before starting the next one, which improves efficiency.
3.5 Refresh adjacent pages
When InnoDB refreshes a dirty page, it can see if all the pages in the area of the page are still dirty. If there are any dirty pages, it will refresh all the pages together, so that multiple I/O requests can be combined into one, which improves efficiency.
Innodb storage is divided into sections, areas and pages. The area is a contiguous 1MB space, so pages in an area can be regarded as adjacent to each other, which is equivalent to sequential reading and writing, which is efficient. In fact, it can be similar to the operating system refresh dirty page, which also combines the BIO into a request, which is also conducive to improve IO efficiency.