As the saying goes, behind every successful man there is a woman who pays silently for him, and for MySQL, this “person” is the InnoDB storage engine.

The most important feature that differentiates MySQL from other databases is its plug-in table storage engine. InnoDB is the most common storage engine among many storage engines. Starting with MySQL5.5.8, InnoDB Storage engine is the default storage engine.

InnoDB storage engine supports transactions and is designed primarily for online transaction processing (OLTP) applications. It features a row lock design, foreign key support, and non-lock read support, meaning that the default read operation does not generate a lock.

InnoDB achieves high concurrency by using multi-version Concurrency Control (MVCC) and implements the ISOLATION level of REPEATABLE in SQL standard 4 by default. Meanwhile, a strategy called next-key-locking is used to avoid phantom reading. In addition, InnoDB storage engine also provides high performance and high availability features such as Insert Buffer, double write, adaptive Hash index and Read Ahead.

The buffer pool

InnoDB storage engine is disk-based storage and manages records on a page basis. But because of the gap between CPU speed and disk speed, disk-based database systems often use buffer pool records to improve the overall performance of the database.

In a database read operation, the page read from disk is first placed in the buffer pool, and the next time the same page is read, it is first determined whether the page is in the buffer pool. If the page is in the buffer pool, the page is said to be hit in the buffer pool, and the page is read directly. Otherwise, the page on disk is read.

For page modification operations in the database, the page is first modified in the buffer pool and then flushed to disk at a certain frequency. The flushing of pages from the buffer pool back to disk is not triggered each time a page is updated, but is flushed back to disk through a mechanism called CheckPoint.

Therefore, the buffer pool size directly affects the overall performance of the database, which can be set by the innodb_buffer_pool_size configuration parameter.

Specifically, the types of data pages cached in the buffer pool are: Index page, data page, undo page, Insert Buffer, Adaptive Hash Index, Lock info and Data Dictionary stored in InnoDB.

As you can see from the architecture diagram, the memory area of InnoDB storage engine has redo log buffers and additional memory pools in addition to buffer pools. InnoDB storage engine first puts redo log information into this buffer and then flusher it to redo log files on a regular basis. Redo log buffering generally does not need to be set to a large value, which is controlled by the innodb_log_buffer_size configuration parameter.

Data pages and index pages

Page is the most basic structure of Innodb storage and the smallest unit of Innodb disk management. All database-related content is stored in the Page structure. There are several types of Page, data Page and index Page are two of the most important types.

Insert Buffer

As we all know, InnoDB engine inserts usually need to be in primary key order to achieve high insert performance. When a table is not the only index of clustering and, when inserted, the storage of data pages or carried out in accordance with the primary key order to deposit, but for the clustering index leaf nodes insertion is no longer the order of, this is where the discrete access the clustering index page, insert operation due to the existence of random read performance degradation.

InnoDB has designed Insert buffers for this purpose. For the insert or update operation of the non-clustered index, it is not directly inserted into the index page every time, but determines whether the inserted non-clustered index is in the buffer pool. If so, it is directly inserted. If not, it is first put into an Insert Buffer. The non-clustered index in the database appears to have found a leaf node, but does not, and is stored in a different location. Then, Insert Buffer and non-clustered index page child nodes are merged with a certain frequency and condition. It is often possible to combine multiple inserts into a single operation, which greatly improves insert performance for non-clustered indexes.

Double Write

If Insert Buffer brings performance improvements to InnoDB storage engine, Double Write brings data page reliability to InnoDB storage engine.

As shown in the figure above, Double Write consists of two parts: the Double Write Buffer in memory, which is 2MB in size, and the 128 contiguous pages of the shared table space on physical disk, which is also 2MB in size. When flushing dirty pages from the buffer pool, we do not write the dirty pages directly to the disk. Instead, we use the memcpy function to copy the dirty pages to this area of memory. Then we use the doubleWrite buffer to write the dirty pages to the physical disk of the shared table space twice, 1MB each time, and immediately call fsync to synchronize the disk. Avoid operating system buffer write problems. After the DoubleWrite page is written, the pages in the Doublewirite buffer are written to each tablespace file.

If the operating system crashes while writing a page to disk, during recovery InnoDB storage engine can find a copy of the page from doublewrite in the shared table space, copy it to the table space file, and apply the redo log.

Redo Log Buffer

When the version of the page in the buffer pool is newer than that on disk, the database needs to flush the new version of the page from the buffer pool to disk. But if every time a page sends a change, the performance development is very large, so InnoDB adopts the Write Ahead Log policy, which means that when a transaction commits, the redo Log is written first, and then the dirty pages are written to disk at an selected time. If data is lost due to an outage, data is recovered through redo logs.

InnoDB storage engine first puts redo log information into redo log buffer and then flusher it to redo log file on a regular basis. The redo log buffer generally does not need to be large, as the redo log buffer is typically flushed to the log file every second. This can be controlled by the innodb_log_buffer_size configuration parameter, which is 8MB by default.

In addition to the refresh per second mechanism, the redo log buffer is flushed to the log every time a transaction commits. InnoDB is the transaction storage engine, which implements transaction persistence through Force Log at Commit mechanism. That is, when a transaction is committed, all logs of the transaction must be written to redo Log file for persistence, and then the transaction is completed. InnoDB’s write mechanism is roughly shown below.

To ensure that each log is written to the redo log file, after each redo log buffer is written to the redo log, a fsync operation must be invoked to actually write the buffered file from the file system cache to disk.

You can control the policy for flushing redo logs to disk with innodb_flush_log_at_trx_COMMIT. The default value of this parameter is 1, indicating that the transaction commit must be fsync once, and it can also be set to 0 or 2. 0 indicates that the redo log operation is not performed when the transaction is committed, and this operation is only performed in the main thread. 2 indicates that the redo log operation is only written to the file system cache, and no fsync operation is performed. Thus, when set to 0, performance is highest, but transaction consistency is lost.

Adaptive Hash Index

InnoDB hashes hot pages based on the frequency and mode of access to improve query efficiency. The InnoDB storage engine monitors queries to each index page on a table. If it observes that creating a hash index improves speed, it creates a hash index, so it is called an adaptive hash index.

Adaptive hash indexes are built from B+ tree pages of the buffer pool, so they are fast to build and do not require hash indexes for the entire table. There is a requirement that the pattern of continuous access to the page must be the same, that is, the query conditions (WHERE) must be exactly the same, and must be continuous.

Lock Info

As we all know, InnoDB storage engine locks table data at the row level. However, InnoDB also uses locks in many other places within the database, allowing concurrent access to many different resources. Database systems use locks to support concurrent access to shared resources, providing data integrity and consistency. We’ll learn more about locks later.

Data Dictionary information

InnoDB has its own table cache, which can be called a table definition cache or data dictionary. When InnoDB opens a table, it adds a corresponding object to the data dictionary.

A data dictionary is a collection of meta-information about data, library objects, table objects, and so on in a database. In MySQL, data dictionary information includes table structure, database name or table name, data type of field, view, index, table field information, stored procedure, trigger and so on. The MySQL INFORMATION_SCHEMA library provides data office metadata, statistics, and access information about MySQL Server (for example, database or table names, data types of fields, and access permissions). The information stored in this library can also be called a data dictionary for MySQL.

Afterword.

This article is just a brief introduction to the concepts and principles related to InnoDB memory. If you want to know more about InnoDB, please follow my wechat official account.

reference

  • InnoDB Storage Engine
  • High Performance MySQL
  • InnoDB architecture drawing by Jiang Chengxiao