InnoDB is a storage engine that supports transaction security and is the default storage engine for mysql. This paper mainly introduces InnoDB row record format and data page realization principle in detail from the perspective of data structure, and see InnoDB storage engine clearly from the bottom.

The main content of this article is based on the nuggets pamphlet “from the root to understand MySQL” collation. If you want to understand in detail, it is recommended to buy the nuggets brochure to read.

InnoDB profile

It is well known that data in mysql is stored on physical disk, and the real data processing is performed in memory. The disk reads and writes slowly. If the disk reads and writes frequently for each operation, the disk performance will be poor. To address these issues, InnoDB divides data into several pages, using the page as the basic unit of disk and memory interaction. Typically, the page size is 16KB. In this case, at least one page of data is read to memory or written to disk at a time. Improves performance by reducing the number of interactions between memory and disk.

In fact, this is essentially a typical cache design idea. Generally, the design of cache is basically considered from the time dimension or space dimension:

  1. Time dimension: If a piece of data is being used, there is a high probability that it will be used again for some time to come. You can thinkHot data cacheAre all implementations of this idea.
  2. Spatial dimensions: If a piece of data is in use, the data stored near it will most likely be used soon.InnoDB data pageandPage caching for the operating systemIs the embodiment of this idea.

InnoDB row format

Mysql inserts data into a table in the form of a record (row). These records are stored on disk in a format called row format. Mysql supports four different types of row formats: Compact, Redundant (older than this article), Dynamic, and Compressed. We can specify the row format in the statement creating or modifying the table:

CREATE TABLETable name (column information) ROW_FORMAT=Line format nameALTER TABLEThe table name ROW_FORMAT=Line format nameCopy the code

Record_format_demo; record_format_demo; record_format_demo;

mysql> CREATE TABLE record_format_demo (
    ->     c1 VARCHAR(10),
    ->     c2 VARCHAR(10) NOT NULL.->     c3 CHAR(10),
    ->     c4 VARCHAR(10)
    -> ) CHARSET=ascii ROW_FORMAT=COMPACT;
Query OK, 0 rows affected (0.03 sec)
Copy the code

Suppose we insert 2 rows of data into the record_format_demo table:

mysql> SELECT * FROM record_format_demo;+------+-----+------+------+ | c1 | c2 | c3 | c4 | +------+-----+------+------+ | aaaa | bbb | cc | d | | eeee | fff | NULL | NULL | + -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + 2 rows in the set (0.00 SEC)Copy the code

COMPACT row format

As can be seen from the figure above, a complete record contains two major parts: the additional information recorded and the real data recorded.

Additional information recorded

There are three main types of additional information for records: variable-length field lists, NULL value lists, and record header information.

Variable-length field list

Mysql supports variable length data types (such as VARCHAR(M) and TEXT). The storage space occupied by these data types is not fixed, but changes with the storage content. To accurately describe this data, the storage space occupied by this variable length field should include both:

  1. Real data content
  2. Number of bytes occupied

In the Compact row format, the actual data length of all variable-length fields is stored at the beginning of the record, resulting in a variable-length field length list with the number of bytes occupied by variable-length field data in reverse column order.

We take therecord_format_demoTake the first row of data as an example. Due to thec1,c2andc4Are converted to data types (VARCHAR(10)), so save the three column values at the beginning of the record.

The other thing to note is,The variable-length field length list stores only the length of columns with non-NULL values, not the length of columns with NULL values. That is, for the second record, since the value of c4 column is NULL, the variable length field length list for the second record only needs to store the length of c1 and C2 columns.

A NULL value list

For nullable columns, mysql does not store NULL values in the real data portion of the record to save storage space. Instead, it is stored in a list of NULL values in the extra information of the record.

Each null-allowed column has a binary bit (1: the value is NULL, 0: the value is not NULL) to indicate whether or not to store NULL values, and is sorted in reverse order. MySQL specifies that the list of NULL values must be represented as integer bits. If the number of binary bits used is not integer bytes, 0 will be added to the high part of the byte. C1, C3, and C4 are allowed to store NULL values in the record_format_demo table. The first two records look like this after filling in the NULL value list:

Record header information

The record header consists of a fixed set of 5 bytes (40 bits), each of which represents a different meaning:I won’t go into all the details.

Recorded real data

In addition to the column specific data, some hidden column data is automatically added to the recorded real data.

The column name Whether must Take up the space describe
row_id no 6 bytes A row ID that uniquely identifies a record
transaction_id is 6 bytes Transaction ID
roll_pointer is 7 bytes Rollback Pointers

The actual columns are DB_ROW_ID, DB_TRX_ID, and DB_ROLL_PTR (row_id, transaction_id, and roll_pointer).

Only if the database is not definedA primary keyorThe only keyWhen, hide the columnrow_idExists and uses it as a data tableA primary key. Because the tablerecord_format_demoThere is no primary key defined, so the MySQL server adds the above three columns to each record. Now let’s seeRecorded real dataData structures for two records of:

CHAR(M) column storage format

For CHAR(M) columns, the number of bytes used by the column in a fixed-length character set is not added to the variant-length field length list, as is the number of bytes used by the column in a variant-length character set. It is also important to note that the CHAR(M) columns of variable length character sets require at least M bytes, whereas VARCHAR(M) does not. For example, for a CHAR(10) column using the UTF8 character set, the size of the data stored in the column ranges from 10 to 30 bytes, even if we stored an empty string into the column, it would take 10 bytes.

Row overflow data

VARCHAR(M) Maximum number of data that can be stored

MySQL has a limit on the maximum storage space a record can occupy. All columns (excluding hidden columns and record headers), except BLOB or TEXT columns, can occupy a maximum of 65535 bytes. It can be loosely considered that the storage space occupied by a mysql row cannot exceed 65535 bytes. Storage overhead: VARCHAR(M) type VARCHAR(M) type VARCHAR(M) type VARCHAR(M)

  1. Real data
  2. The length of bytes occupied by real data
  3. NULL indicates that the column may NOT have the space if it has the NOT NULL attribute

If varchar_size_demo has only one field of type VARCHAR, the field occupies a maximum of 65532 bytes. Because the length of real data may be 2 bytes, a NULL value indicates that 1 byte is required. If the VARCHAR column does NOT have a NOT NULL attribute, it can only store 65532 bytes of data at most. If the column is an ASCII character set, the maximum number of characters is 65532. If the character set is UTF8, the maximum number of characters is 21844.

Overflow caused by too much data in the record

Let’s use the varchar_size_demo table in the ASCII character set as an example and insert a record:

mysql> CREATE TABLE varchar_size_demo(
    ->       c VARCHAR(65532)
    -> ) CHARSET=ascii ROW_FORMAT=Compact;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO varchar_size_demo(c) VALUES(REPEAT('a'.65532));
Query OK, 1 row affected (0.00 sec)
Copy the code

The basic unit of disk-memory interaction in mysql is a page, which is usually 16KB, 16,384 bytes, and a single record can occupy the maximum65535Bytes, and that createsMore than one row of data can be stored on a page. In Compact and Redundant line format, to take up the storage space is very large, in the record of the real data will only be stored in the column part of the data, to spread out the rest of the data stored in a few other pages, and then record the real data in the address of a storage point to these pages with 20 bytes, which can find the remaining data in the page, As shown in the figure:This kind ofOnly the first 768 bytes of the column and an address pointing to another page are stored in the actual data of the record, and then the remaining data is stored in another pageLine overflowPages that store more than 768 bytes are also calledOverflow page.

Critical point of row overflow

MySQL specifies that a page contains at least two rows of records. For example, the varchar_size_demo table has only one column c. If we insert two records into the table, how many bytes must be inserted into each record before the row overflow occurs? We need to analyze how the space on the page is being used.

  1. In addition to storing our records, each page also stores some additional information, about 132 bytes.
  2. Each record requires 27 bytes of additional information.

Assume that the number of data bytes stored in a column is N. To ensure that the column does not overflow, the following requirements must be met:

132 + 2X (27 + n) < 16384
Copy the code

So n is less than 8099. That is, if a column stores less than 8099 bytes of data, it is not an overflow column. This value is even smaller if there are multiple columns in the table.

Dynamic and Compressed row format

The default row format in mysql isDynamic.DynamicandCompressedRow format andCompactThe line format is very similar, just processingLine overflowThere is a discrepancy in the data.DynamicandCompressedThe line format will not be inRecorded real dataInstead of storing the first 768 bytes, you store all the bytes on other pages.CompressedThe line format uses a compression algorithm to compress the page to save space.

InnoDB data page structure

We already know that pages are the basic unit of storage InnoDB manages, and the size of a page is typically 16KB. InnoDB has many different types of pages designed for different purposes, but here we focus on pages that store data records, officially called index pages. Since we haven’t introduced indexes yet, let’s call them data pages.

A quick look at the data page structure

The data page can be structurally divided into multiple parts, and different parts have different functions, as shown below:

An InnoDB data page is divided into 7 sections, which are roughly described below.

The name of the Chinese name Space occupied A brief description
File Header The file header 38 byte Page for some general information
Page Header Page header 56 bytes Some information unique to the data page
Infimum + Supremum Minimum record and maximum record 26 bytes Two virtual row records
User Records User record Not sure The contents of the row records that are actually stored
Free Space Free space Not sure Unused space in a page
Page Directory Page directory Not sure The relative positions of some records in a page
File Trailer File the tail 8 bytes Verify page integrity

Storage of records in a page

The user’s own stored data will follow the correspondingRow formatThere areUser RecordsIn the. In fact, the newly generated page does notUser RecordsOnly when we insert data for the first time will theFree SpaceMark a space for the record sizeUser Records. whenFree SpaceWhen this is done, it means that the current data page is also used.In order to be able toUser RecordsTo be clear, we have to understand what was mentioned earlierRecord header information.

Understand the record header information

Firstly, the description of each attribute of record header information is briefly introduced:

The name of the Size (unit: bit) describe
Set aside a 1 1 Don’t use
Set aside a 2 1 Don’t use
delete_mask 1 Flags whether the record has been deleted
min_rec_mask 1 This tag is added to the smallest record in each layer of the B+ tree’s non-leaf nodes
n_owned 4 Indicates the number of records owned by the current record
heap_no 13 Represents the current record location information in the record heap
record_type 3 Indicates the type of the current record. 0 indicates the common record, 1 indicates the B+ tree non-leaf node record, 2 indicates the minimum record, and 3 indicates the maximum record
next_record 16 Represents the relative position of the next record

Next, take the page_demo table as an example and insert some data to detail the record header information.

mysql> CREATE TABLE page_demo(
    ->     c1 INT.->     c2 INT.->     c3 VARCHAR(10000),
    ->     PRIMARY KEY (c1)
    -> ) CHARSET=ascii ROW_FORMAT=Compact;
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO page_demo VALUES(1.100.'aaaa'), (2.200.'bbbb'), (3.300.'cccc'), (4.400.'dddd');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0
Copy the code

These 4 rows are recorded in InnoDB in the following format (only the header and real data are displayed), and the data in the columns are expressed in decimal:Let’s highlight the details of a few properties against this diagram:

  • delete_mask: indicates whether the current record is deleted. 0 indicates that the record is not deleted and 1 indicates that the record is deleted. Undeleted records are not immediately removed from the disk. Instead, they are marked for deletion. All deleted records form oneWaste list. Newly inserted records may later be reusedWaste listSpace occupied, so the storage space occupied by garbage lists is also calledReusable space.
  • heap_no: indicates the positions of the current records on the page. For example, the positions of the previous four records on the page areTwo, three, four, five. In fact, InnoDB automatically adds two virtual records per page, one of which isMinimum recordAnd the other one isThe biggest record. The structure of these two records is very simple5 bytes of record header informationandFixed portion of 8 bytes(infimum or supremum). The two records are kept separatelyInfimum + SupremumPart.We can see from the figure that the minimum record and the maximum recordheap_noThe values are 0 and 1, which means they’re at the front.
  • next_recordFrom:Address offset from the real data in the current record to the real data in the next record. It can be simply understood as a one-way linked list. The next smallest record is the first record, and the next last record is the largest record. To visualize this, we can replace the address offset in next_record with an arrow:As you can see from the picture,User records are actually a one-way linked list, sorted positively by primary key size. If one entry is deleted from the list, the linked list will also change. For example, if we delete the second entry:
    • The second record is not removed from the storage spacedelete_maskThe value is set to 1.
    • Article 2next_recordThe value changes to 0, meaning that there is no next record for this record.
    • Article 1next_recordThis points to record 3.

Page Directory

We already know that records are concatenated in a single linked list in positive order of primary key size in a page. If we want to find a specific record based on the primary key, the simple way is to traverse the list. However, in the case of large amount of data, this method is obviously inefficient. So mysql uses Page Directory to solve this problem. The principle of Page Directory is as follows:

  1. Divide all normal records (including maximum and minimum records, excluding those marked as deleted) into groups. We don’t care how we divide it.
  2. The n_owned attribute in the header of the last record in each group (that is, the largest record in the group) indicates how many records there are in the group.
  3. The address offset of the last record of each group is separately extracted and sequentially stored near the end of the page, which is calledPage Directory.

Mysql allows only one record in the smallest group, only 1-8 records in the largest group, and only 4-8 records in the remaining groups. Like nowpage_demoThere are 18 normal records in the table, and InnoDB divides them into 5 groups. The first group has only one minimum record, as shown below:

The process of finding a record with a specified primary key value in a data Page through Page Directory is two-step:

  1. Determine the slot where the record resides by dichotomy and find the record with the smallest primary key value in the group where the record resides.
  2. recordednext_recordProperty traverses the records in the group that the slot is in.

The optimization of query performance of linked list is basically realized by dichotomy. This is true of Page Directory, skip lists, and lookup trees described above.

Page Header

Page headers are specifically used to store various state information related to the data Page, such as how many records have been stored in the Page, what is the address of the first record, how many slots have been stored in the Page directory, and so on. It occupies 56 bytes. The attributes of each part of the bytes are as follows:

The name of the Space occupied describe
PAGE_N_DIR_SLOTS 2 – Number of slots in the page directory
PAGE_HEAP_TOP 2 – The minimum address of the space not yet used, that is, after that addressFree Space
PAGE_N_HEAP 2 – Number of records on this page (including minimum and maximum records and records marked for deletion)
PAGE_FREE 2 – The address of the first record that has been marked for deletion (each deleted record also forms a single linked list through next_record, which can be reused)
PAGE_GARBAGE 2 – Number of bytes occupied by deleted records
PAGE_LAST_INSERT 2 – The position where the record was last inserted
PAGE_DIRECTION 2 – The insertion direction of the last entry
PAGE_N_DIRECTION 2 – The number of consecutive insertions in one direction. If the insertion direction of the last record changes, the value of this state will be reset.
PAGE_N_RECS 2 – The number of records on the page (excluding minimum and maximum records and records marked for deletion)
PAGE_MAX_TRX_ID 8 bytes Modifies the maximum transaction ID of the current page, which is defined only in the secondary index
PAGE_LEVEL 2 – The level of the current page in the B+ tree
PAGE_INDEX_ID 8 bytes Index ID, indicating which index the current page belongs to
PAGE_BTR_SEG_LEAF 10 bytes Header information for the leaf segment of the B+ tree, defined only on the Root page of the B+ tree
PAGE_BTR_SEG_TOP 10 bytes The header information of the non-leaf segment of the B+ tree, defined only on the Root page of the B+ tree

Here is just listed, do not need to understand all temporarily.

File Header

The File Header is used to describe general information that applies to various pages and consists of the following:

The name of the Space occupied describe
FIL_PAGE_SPACE_OR_CHKSUM 4 bytes Page checksum
FIL_PAGE_OFFSET 4 bytes Page number
FIL_PAGE_PREV 4 bytes The page number of the previous page
FIL_PAGE_NEXT 4 bytes Page number of the next page
FIL_PAGE_LSN 8 bytes The position of the Log Sequence corresponding to the last modification of the page
FIL_PAGE_TYPE 2 – The type of the page
FIL_PAGE_FILE_FLUSH_LSN 8 bytes Defined only on one page of the system tablespace, representing that the file has been flushed to at least the corresponding LSN value
FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID 4 bytes Which table space the page belongs to

Here is just listed, do not need to understand all temporarily. Let’s focus on a few attributes:

  1. FIL_PAGE_SPACE_OR_CHKSUM

The checksum of the current page. For a long string of bytes, we can use some algorithm to calculate a short value to represent the long string of bytes. The short value is called the checksum. The efficiency of string equivalence comparison can be greatly improved by checksum. 2. FIL_PAGE_OFFSET Each page has a unique page number that InnoDB uses to locate a page. 3. FIL_PAGE_TYPE represents the current page type. As we mentioned earlier, InnoDB divides pages into different types for different purposes.

Type the name hexadecimal describe
FIL_PAGE_TYPE_ALLOCATED 0x0000 New assignment, not used yet
FIL_PAGE_TYPE_ALLOCATED 0x0000 New assignment, not used yet
FIL_PAGE_UNDO_LOG 0x0002 The Undo log page
FIL_PAGE_INODE 0x0003 Segment information node
FIL_PAGE_IBUF_FREE_LIST 0x0004 Insert Buffer free list
FIL_PAGE_IBUF_BITMAP 0x0005 The Insert Buffer bitmap
FIL_PAGE_TYPE_SYS 0x0006 The system page
FIL_PAGE_TYPE_TRX_SYS 0x0007 Transaction system data
FIL_PAGE_TYPE_FSP_HDR 0x0008 Table space header information
FIL_PAGE_TYPE_XDES 0x0009 Extended Description page
FIL_PAGE_TYPE_BLOB 0x000A Overflow page
FIL_PAGE_INDEX 0x45BF Index pages are what we call data pages
4. FIL_PAGE_PREVAnd FIL_PAGE_NEXT
Indicates the number of the previous and next pages of the page. Each page passesFIL_PAGE_PREVAnd FIL_PAGE_NEXTForm a bidirectional linked list.

File Trailer

The basic unit of memory and disk interaction in mysql is the page. If an in-memory page is modified, it must be synchronized to disk at some point. If there is a problem with the system during the synchronization process, the page data on the disk may not be fully synchronized, that is, dirty pages may occur. To avoid this problem, mysql adds File Trailer to the end of each page to verify page integrity. File Trailer consists of 8 bytes:

  1. The first four bytes represent the checksum of the page and this part corresponds to the checksum in the File Header. That’s the simple answerFile HeaderandFile TrailerBoth have checksums, and if they agree, the data page is complete. Otherwise, the data page isDirty pages.
  2. The last four bytes represent the log sequence position (LSN) corresponding to the last modification of the page.

Original is not easy, feel that the article is written well small partners, a praise 👍 to encourage it ~

Welcome to my open source project: a lightweight HTTP invocation framework for SpringBoot