“This article has participated in the call for good writing activities, click to view: the back end, the big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!”
This article is based on MySQL 8
In the previous two articles, we analyzed two row record storage formats for the MySQL InnoDB engine:
- Compact format
- Redundant format
Here is a brief summary:
- Compact formatStructure:
- Variable-length field length table: Contains the length of each variable-length field whose data is not NULL, in reverse order of column
- NULL value list: A BitMap is used to identify NULL fields that can be NULL
- Record header information: Fixed 5 bytes, including:
- Useless bits: 2 bits
- Deleted_flag: 1 bits, which indicates whether a record is deleted
- Min_rec_flag: 1 bits. It is the minimum record flag of a non-leaf node in the B+ tree
- N_owned: 4 bits indicates the number of records in the slot
- Heap_no: 13 bits. The number of the record in the heap can also be interpreted as the position of the record in the heap
- Record_type: 3 bits record type. The common data record is 000. The node pointer type is 001
- Next_record pointer: 16 bits, relative position of the next record on the page
- Hidden columns:
- DB_ROW_ID: 6 bytes. This column may not be generated. A user-defined primary key is preferred. If no Unique key is defined, a Unique key is selected as the primary key. If no Unique key is defined, a hidden column DB_ROW_ID is added as the primary key by default
- DB_TRX_ID: 6 bytes, specifies the id of the transaction that generated the current entry. The system version is automatically incrementalized each time a new transaction is started. The system version at the start of the transaction is used as the transaction ID
- DB_ROLL_PTR: 7 bytes. The undo log pointer points to the undo log of the current entry. If the transaction is rolled back, the original value is read from the undo Log and placed in the record
- Data column:
- Bigint: if the value is not NULL, it occupies 8 bytes. The first byte is the sign bit, and the rest bits store digits. The number range is -2^63 to 2^63-1 = -9223372036854775808 to 9223372036854775807. If NULL, no storage space is occupied
- Double: a non-null column that complies with IEEE 754 floating-point “Double format” bit Layout. If the column is NULL, it does not occupy any storage space
- For fixed-length fields, the data can be directly stored without the length information. If the length is insufficient, the data can be supplemented. For example, char, supplemented by 0x20, corresponds to a space.
- Varchar storage: Since the data starts with a list of variable-length field lengths, vARCHAR only needs to store the actual data and does not need to populate additional data. But we haven’t considered the case of storing extremely long data
- Redundant formatThe structure andCompact formatThe difference between:
- All field length list: Unlike the Compact row format, Redundant begins with an all field length list: Records the length offset of all fields, including hidden columns. The offset is, if the first field is of length A and the second field is of length B, then the first field in the list is a and the second field is a + b. All fields are in reverse order
- Record header information: Fixed 6 bytes
- Useless bits: 2 bits
- Deleted_flag: 1 bits, which indicates whether a record is deleted
- Min_rec_flag: 1 bits. It is the minimum record flag of a non-leaf node in the B+ tree
- N_owned: 4 bits indicates the number of records in the slot
- Heap_no: 13 bits. The number of the record in the heap can also be interpreted as the position of the record in the heap
- N_field: 10 bits, the number of columns in the record, ranging from 1 to 1023
- 1byte_offS_flag: 1 bit. 1 indicates that the storage length of each field is 1byte, and 0 indicates that the storage length is 2 bytes
- Next_record pointer: 16 bits, relative position of the next record on the page
- Data column:
- CHAR storage: Regardless of whether the field is NULL or of any length, CHAR (M) takes up as many bytes as the maximum M * byte encoding length. If it is NULL, 0x00 is filled in. If the length is insufficient, 0x20 is added at the end.
We did not analyze how to store long fields, but we will do so in this article.
Review the previous page again. Since each piece of data is a hard disk addressing read, we want to reduce the number of hard disk addressing reads and consider reading the data piece by piece, so that we are likely to need the data in memory for the next request, rather than reading from the hard disk. Based on this idea, InnoDB divides the data of a table into pages, which are linked by b-tree indexes. The default page size is 16384 Bytes, or 16KB (innodb_page_size).
For large fields, such as those of Text type, if they also exist on the clustered index, the node data will be too large and many pages will be read at once, which will be inefficient (for example, if we don’t have a request to read the Text column). As a result, InnoDB tends to store long variable-length fields elsewhere, which refers to the off-page column design mode. Different line formats are handled differently.
InnoDB is a persistent storage engine, which means data is stored on disk. But reading and writing data, processing data, that happens in memory. That is, data needs to be read from disk into memory. So how does this read read? It is also inefficient to read every piece of data that is processed into memory. Since each piece of data is a hard disk addressing read, we want to reduce the number of hard disk addressing reads and consider reading the data piece by piece, so that we are likely to need the data in memory for the next request, rather than reading from the hard disk. Based on this idea, InnoDB divides the data of a table into pages, which are linked by b-tree indexes. The default page size is 16384 Bytes, or 16KB (innodb_page_size). The value can be changed when MySQL is started. The value can only be one of 4096,8192,16384.
Redundant off-page column processing
For long columns in the Redundant row format, only the first 768 bytes are stored on the data row and the rest of the data is put into other pages. For an example, run the following SQL to create a test table and insert test data:
drop table if exists long_column_test; CREATE TABLE `long_column_test` ( `large_content` varchar(32768) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=REDUNDANT; Insert into long_column_test values (repeat("az", 384)); insert into long_column_test values (repeat("az", 384)); Insert into long_column_test values (repeat("az", 4050)); Insert into long_column_test values (repeat("az", 16384)); insert into long_column_test values (repeat("az", 16384));Copy the code
We use 64 hexadecimal encoder to see table file long_column_test. Ibd, you can see the first data is a normal data, the storage and before we speak of Redundant storage, no special:
List of all field lengths (8 bytes, 4 columns, 1 data column, 3 hidden columns) : 03 13(768+7+6+6), 00 13(7+6+6), 00 0C (6+6), 00 06(6) Record header (6 bytes) : 00 00 10 08 03 AC Hide column DB_ROW_ID (6 bytes) : 00 00 00 00 02 22 Hide column DB_TRX_ID (6 bytes) : 00 00 00 00 58 B7 Hide column DB_ROLL_PTR (7 bytes) : 82 00 00 01 0C 01 10 Large_content (768 bytes) : 61 7a......Copy the code
For the second row, we find that this rowlarge_content
Instead of storing all of the data in a column in one row, some of it is stored in one row and some of it is stored elsewhereoff-pageColumns stored elsewhere are calledOverflow page, its structure is as follows:
The first is the data column
List of all field lengths (8 bytes, 4 columns, 1 data column, 3 hidden columns) : The first two bits of the first byte do not represent the length, the highest bit is NULL, the second mark is whether the record is on the same page, because it is not NULL, so the highest bit is 0, because it is not on the same page because it exists on overflow page, so the second bit is 1, followed by 3, 27 represents the length, That is, 20+768+7+6+6), 00 13(7+6+6), 00 0C (6+6), 00 06(6) Record header (6 bytes) : 00 00 10 08 03 AC Hidden column DB_ROW_ID (6 bytes) : 00 00 00 00 02 22 DB_TRX_ID (6 bytes) : 00 00 00 00 58 B7 DB_ROLL_PTR (7 bytes) : 82 00 00 01 0C 01 10 Large_content (768 bytes) : 61 7a...... Pointer to the address where the remaining data resides (20 bytes) : 00 00 05 23 00 00 00 05 00 00 00 01 00 00 00 00 00 00 00 00 1c A4Copy the code
For the off-page column, there is a pointer at the end of the column data to the address of the rest of the data. This pointer takes 20 bytes and has the structure:
Then the column stores the rest of the data to the Overflow page:
Data column large_content (remaining 7332 bytes) : 61 7a......Copy the code
What about when the field is longer than the one page data limit? Let’s look at the third data structure:
As you can see, excessively long columns of data are stored on the Overflow page as linked list links.
The off-page structure of the Redundant line format is:
This leads us to three questions:
- When does a column become an off-page column?
- When are Overflow pages stored as linked list nodes?
- For which column types are stored this way?
1. When does a column become an off-page column?
First of all, innoDB’s default page size is 16KB, or 16384 bytes, and InnoDB’s data is loaded on a per-page basis. Then, the data structure that organizes innoDB engine data is the B+ tree. Scanning B+ trees for data is also page by page loaded search. The more rows of data that can be contained on a page, the more efficient the search is obviously. But if there is only one data on a page, then the B+ tree is about as efficient as a linked list. So, for efficiency, you need to have at least two pieces of data on a page. So there are:
The innoDB page also has some metadata of its own (132 bytes, which we will examine in detail in a later section). Here we use long_column_test as an example:
It can be deduced that:
In practice, more than one column of data may be long. Also, since data is not stored together with row data, search reads are inefficient, so the Redundant row format minimizes the number of columns that become off-page and minimizes the number of columns that become off-page.
2. When do overflow pages split into linked list nodes?
Unlike table data, the Overflow page does not organize data through B+ trees and does not do complex searches; it is simply a linked list. So we just need to ensure that the data size is no more than one page, that is:
The data node also has some additional information, and the page also has its own additional information, which will be seen in a later article. So, the actual data size would have to be less than 16,384 after all this extra information is stripped out. If not, the storage is divided into multiple pages, and the nodes are linked by a linked list.
3. For which column types are stored this way?
Variable length fields, such as VARCHar, varbinary, text, blob, etc., are stored using this mechanism. Fields of fixed length, such as char, are also stored like vARCHar if they are too long, in which case the char is not padded with whitespace. However, this is unusual; char is limited to 255 characters in length and must be larger than three bytes to be larger than 768, such as UF8MB4, where each character is larger than three bytes.
Off-page column processing in Compact
Off-page handling in Compact is basically the same as Redundant, just because the data structure is different:
Long_column_test = long_column_test; long_column_test = long_column_test;
It can be deduced that:
Dynamic processing of the off-page column
Dynamic is basically the same as Compact except that the off-page column handling is different from Compact.
The main difference between Dynamic and off-page column handling is thatAll data is stored on overflow pages, with only 20-byte Pointers stored in the off-page column, this pointer has the same structure as a 20-byte pointer in the Redundant format:
Compressed off-page column processing
Compressed line format is basically the same as Dynamic, including the off-page column processing, in fact, on the basis of Dynamic, add compression processing. For compression processing, this will be discussed in detail in the compression page section below.
Wechat search “my programming meow” public account, a daily brush, easy to improve skills, won a variety of offers