The article goal

Through the introduction of row storage structure and data page structure, it helps you master the storage structure and provides a solid theoretical foundation for index optimization

Let’s take a look at the general content of the whole

Compact row format diagram

Relationships between rows on the same page (via linked lists)

Relationships between pages and records

Page to page relationships (done by bidirectional linked lists)

Schema diagram for index lookup

Further refinement

InnoDB row storage structure

InnoDB page Introduction

InnoDB divides data into pages, using pages as the basic unit of interaction between disk and memory, which is the basic unit of storage space managed by MySQL. InnoDB pages are typically 16kb in size (you can set them, of course). That is, at least 16kb is read from the disk to the memory at a time, and at least 16kb is flushed to the disk at a time. The physical addresses within the page are contiguous.

InnoDB page structure diagram

InnoDB row format type, row format syntax, COMPACT row format

InnoDB row format type

  • Compact (focus)
  • Redundant
  • Dynamic
  • Compresseed

Additional information recorded (key points)

Variable-length field length list

MySQL supports variable length data types, such as VARCHAR(M), VARBINARY(M), TEXT, BLOB, etc. These variable length data types take up storage space in two parts:

  1. Real data content
  2. Number of bytes occupied

If you do not save the number of bytes occupied by the real data, the MySQL service cannot determine how long the real data is, resulting in an incorrect count. In Compact row format, the lengths of all variable-length types are stored in reverse column order in a table of parameters at the beginning of the row record.

Insert two pieces of data for later explanation

Schematic diagram:

Actual data storage diagram:

We can see that aAAA, BBB and CC are stored together. If we do not record the number of bytes occupied, we cannot distinguish them

It is important to note that the variable-length field length list stores only the occupied length of non-NULL values, not NULL values.

A NULL value list

Some columns in a table may store NULL values, which would take up too much space if they were stored in the real data of the record. All Compact row formats manage these null-valued columns and store them in the NULL value list.

  1. First count which columns can be NULL

If there are null-allowed columns in the table, each null-allowed column corresponds to a binary bit, and the binary bit is sorted in reverse order of the column

The record_format_demo table has three columns whose values are allowed to be NULL

  1. The list of Null values is populated with either 1 or 0, depending on the actual value of the column
  • When the value of the binary bit is 1, the value of the column is NULL
  • When the value of the binary bit is 0, the value of the column is not NULL
  • The list of NULL values must be represented as integer byte bits, and if the number of binary bits used is not an integer byte, zeros are added at the high end of the byte
  • NULL values occupy only flag bit space, not real data space

Two records filled with a list of NULL values would look like this:

  1. A NULL value list is not allocated if there is no NULL allowed column in the table

Record header information

In addition to the variable-length field length list and the NULL value list, there is another record header that describes the row, which consists of a fixed five bytes. Five bytes is equal to 40 binary bits. Different bits mean different things, as shown in the figure below:

What is each bit used for

Recorded real data

MySQL adds some columns (also known as hidden columns) to each record by default.

Note that the MySQL server adds transaction_id and roll_pointer columns to each record, but row_id is added to the record only if the table does not have a primary key defined.

VARCHAR(M) for COMPACT row format

  • Real data
  • Variable length field length
  • NULL indicates that there is no storage space if the column has a NOT NULL attribute

If a VARCHAR column does NOT have a NOT NULL attribute, it can only store 65532 bytes of data at most, because variable length fields require two bytes and NULL values require one byte.

If a VARCHAR column has a NOT NULL attribute, it can only store 65533 bytes of data at most, because variable length columns require two bytes and do NOT require NULL identification.

Line overflow

The size of a page is generally 16KB, that is, 16384 bytes, and a VARCHAR (M) type column can store 65532 bytes at most. Thus, a page may not be able to store one record, so multiple pages need to store one record.

As you can see in the Compact row format, if there is too much data in a column, only the first 768 bytes of that column and an address pointing to another page are stored in the real data of the record, and the rest of the data is stored in another page, a process also known as row overflow.

Here’s the schematic:

Question: Will rows overflow if data exceeds 768 bytes?

A: Look at the row overflow critical point below

  • MySQL specifies that a page contains at least two rows of records
  • In addition to records, each page needs to store some metadata information (136 bytes of space)
  • Each record requires 27 bytes of metadata

Calculate the critical point of row overflow

InnoDB page storage

Let’s take a look at the InnoDB page structure diagram

Take a look at the page structure field description

Storage of records in a page

Earlier we talked about the different bits of the record header representing different information. If you don’t have an image, you can check the record header information above.

Delete_mask: This property marks whether the current record was deleted and takes up a binary bit. A value of 0 indicates that the record was not deleted and a value of 1 indicates that the record was deleted.

Min_rec_mask: This attribute marks whether the modified record is the smallest record in a non-leaf node of the B+ tree. A value of 0 indicates that it is not the smallest record in a non-leaf node of the B+ tree.

Heap_no: This property represents the position of the current record on the page. As you can see from the figure, the four records we inserted are at positions 2, 3, 4, and 5 on the page. InnoDB automatically adds two records, also called pseudo-records or virtual records, to each page, one for the minimum record and one for the maximum record.

Next_record: indicates the address offset from the current record to the next record. The next entry does not refer to the order in which we inserted, but to the primary key. The next record with the minimum primary key value is the record with the minimum primary key value. The next record with the maximum primary key value is the record with the maximum primary key value. Let’s see the following figure.

InnoDB maintains a single linked list of records no matter how many entries we add, delete, or delete on a page. Each node in the linked list is connected in ascending order of primary key values.

If data is deleted from a page, it will not be recycled for a short time. If data is deleted, it will be recycled for a long time, otherwise it will waste too much storage space. If data is deleted, it will be immediately added back to the page, so the record point will be restored to the original.

Page directory

  • Divide all normal records (including maximum and minimum records, excluding those marked as deleted) into groups
  • The n_owned attribute in the header of the last record in each group indicates how many records there are in the group
  • The address offset of the last record of each group is stored sequentially. Each address offset is also called a Slot. These address offsets are stored near the end of the Page. The part of the Page where the address offsets are stored is also called Page Directory.

  • There are now two slots in the Page Directory section, which means that our records are divided into two groups. Slot 0 contains the address offset of the smallest record. The maximum recorded address offset is recorded in slot 1.
  • Notice the n_owned attribute in the headers for the smallest and largest records
  1. The minimum record has an N_owned value of 1, which means that there is only one record in the group, the minimum record itself
  2. The largest record has an n_owned value of 5, which means that there are only five records in the group, including the largest record itself and the four inserted records
  • Why does the smallest record have an N_owned value of 1 and the largest record have an N_owned value of 5?

InnoDB has rules on the number of records in each group. For the smallest group, there can only be 1 record, for the largest group, there can only be 1 or 8 records, and for the rest of the group, there can only be 4 or 8 records. So the grouping is carried out according to the following steps:

  1. Initially, there are only two records in a data page, the minimum record and the maximum record, which belong to two groups
  2. After each record is inserted, it is placed in the group with the largest record until the number of records in the group with the largest record equals 8
  3. When the number of records in the group where the maximum record is located is equal to 8, the group where the maximum record is located is split into 2 groups on average when a record is inserted, and then there are 4 records left in the group where the maximum record is located, and then the inserted record can be put into the corresponding group.

Insert 6 entries into the table

How do I now find records from this page directory? Because the primary keys of the records represented by each slot are sorted from smallest to largest, you can use dichotomy for quick lookup.

Summary: Each data page can form a bidirectional linked list, and the records in each data page can form a one-way linked list. Each data page will generate a page directory for the records stored in it. When searching for a record through the primary key, dichotomy can be used in the page directory to quickly locate the corresponding slot. Then iterate through the records in the corresponding group of the slot to quickly find the specified record.

Relationships between pages and records

MySQL index

Look in a page

Search by primary key

Use dichotomy to quickly locate the corresponding slot in the page directory, and then iterate through the records in the corresponding group of the slot to quickly find the record

Use other columns as search criteria

There is no page directory for non-primary key columns in the data page, so it is not possible to quickly locate the corresponding slot by dichotomy. In this case, you can only traverse each record in the single linked list starting from the smallest record and then compare whether each record meets the search criteria.

Look in multiple pages

In most cases, the number of records in a table is very large and many data pages are required to store the records. Looking for records in multiple pages can be divided into two steps:

  1. Navigate to the page where the record resides
  2. Find the corresponding record from the page you are on

No matter by the value of the primary key column or other columns, we cannot quickly locate the page where the record is located, so we can only find the specified record from the first page according to the corresponding search method. All of this is obviously super time consuming because you have to traverse all the data pages

Why traverse all the data pages when looking in multiple pages?

Because the records in each page are not regular, mysql does not know which page the records that meet the search criteria are in, so it has to be traversed one by one. What should I do if I want to quickly locate the records I need to find in which data pages? Just like creating a directory for records in a data page, we can create a directory for all data pages. This directory must do the following:

  • The primary key value in the next data page must be greater than the primary key value in the previous page
  • Create a directory entry for all pages

InnoDB uses pages as the basic unit for managing storage space, with a maximum of 16KB of contiguous storage space. As the number of records in a table increases, it requires very large contiguous storage space to put all the entries in a table, which is not practical for a very large number of records.

We often add and delete records. If we delete all the records on a page, the page no longer needs to exist, which means that the corresponding directory entry no longer needs to exist, which means that we need to move the directory entry after the directory entry forward

InnoDB needs a flexible way to manage all directory entries. The designer found that these directory entries are actually similar to user records, except that the two columns of the directory entry are the primary key and the page number. Therefore, the designer reused the user records to store directory entries. In order to distinguish them from user records, we call these records used to represent directory entries as directory entry records. How does InnoDB distinguish between a user record and a directory entry record? The record_type attribute in the record header, whose values represent the following:

  • 0: indicates common user records
  • 1: records of directory entries
  • 2: minimum record
  • 3: indicates the maximum record

Differences between directory entry records and user records:

  • The record_type value of a directory entry record is 1, whereas the record_type value of a common user record is 0
  • The directory entry record has only two columns: primary key and page number, whereas the normal user record column is user-defined and may contain many columns, as well as hidden columns added by InnoDB itself.
  • Remember we talked about a property called min_rec_mask earlier when we were nagging about logging headers? Only the directory entry record with the smallest primary key value in the page where the directory entry record is stored has a min_rec_mask value of 1, and all other records have a min_rec_mask value of 0

The data to find

There is more than one page that stores directory entry records. If you want to find a user record based on the primary key value, there are roughly 3 steps:

  1. Identify the catalog entry record page. At present, there are two pages storing directory entry records, namely page 30 and page 32. Because the primary key value range of the directory entry represented by page 30 is [1,320], and the primary key value of the directory entry represented by page 32 is not less than 320, the directory entry corresponding to the record with the primary key value of 20 is recorded in page 30.
  2. Use the catalog entry record page to determine which page the user record actually resides on
  3. Locate the specific record in the page where the actual user record is stored

In the first step of this query, we need to locate the page that stores the directory entry records, but these pages cannot be discontiguous in the storage space. If we have too much data in the table, there will be many pages that store the directory entry records. How can we quickly locate a page that stores the directory entry records according to the primary key value? In fact, it is very simple to create a more advanced directory for the pages that store the entries, like a multi-level directory, with smaller directories nested within the larger directory where the actual data is stored, so now the schematic diagram of each page is as follows:

A more advanced directory entry page 33 is generated, with two records representing page 30 and page 32. If the primary key value of the user record is between [1,320], look for a more detailed directory entry record on page 30. The directory level continues to increase as the table records are added, and if you simplify, So we can use this graph to describe it:

Because the data pages are stored in the B+ tree data structure, they are called nodes. As can be seen from the figure, user records are actually stored in the lowest nodes of the B+ tree, which are also called leaf nodes or leaf nodes. The rest nodes are used to store directory items, and these nodes are all called inner nodes or non-leaf nodes. The top node is also called the root node.

Clustering index

  • Sorting records and pages using the size of the record primary key has three implications
  1. The records in the page are arranged in a single linked list by the size of the primary key
  2. Each page that stores user records is also arranged in a bidirectional linked list according to the size of the primary key recorded in the page
  3. The pages of each directory entry are also arranged in a bidirectional linked list according to the size of the primary key recorded in the page
  • The leaves of the B+ tree store the complete user record

We call the B+ tree with these two features a clustered index. The user records are stored in the leaf node of the clustered index. The clustered index does not need to be created by the mysql statement. In addition, in the InnoDB storage engine, clustered indexes are how data is stored (all user records are stored in leaf nodes), so called indexes are data.

Secondary indexes

The clustering index described above only works if the search criteria are primary keys, because the data in a B+ tree is sorted by primary keys. How do we iterate if we want to search for other columns?

You can build several more B+ trees, and the data in different B+ trees can be sorted differently. For example, if we use the size of the C2 column value as the sorting rule for the records in the data page, then we create a B+ tree, as shown in the following figure:

There are several differences between secondary indexes and clustered indexes:

  • Ordering records and pages using the size of the record C2 column has three implications:
  1. The records in the page are arranged in a one-way linked list in order of size of c2 column values.
  2. Each page storing user records is also arranged in a bidirectional linked list according to the size of C2 columns recorded in the page
  3. Each page storing directory items is also arranged in a bidirectional linked list according to the size of C2 columns recorded in the page
  • Instead of storing a complete user record, the leaves of the B+ tree store c2 columns + primary keys
  • Instead of the primary key + page number in the directory entry record, c2 column + page number

If you want to find some records by the value of c2, you can use the B+ tree. For example, if you want to find some records by the value of C2, you can use the B+ tree as follows:

  1. Identify the catalog entry record page
  2. Use the catalog entry record page to determine which page the user record actually resides on
  3. Locate the specific record in the page where the actual user record is stored
  4. The records in the leaf node of the B+ tree store only c2 and primary key (C1) columns, so we must look up the complete user records in the cluster index again based on the primary key (C1) value

According to this sort of B + tree to c2 column size can only be sure we want to find the record of the primary key, so if we want to consult to complete according to the c2 column values user record, still need to clustering index check, again in this process is also called back to the table, that is, the value of the column according to the c2 query a complete user record need to 2 B + tree.

Union/composite index

Let’s say we want the B+ tree to be sorted by the size of the C2 and C3 columns. This contains two levels:

  1. Start by sorting individual records and pages into c2 columns
  2. In the case that the C2 columns of the records are the same, the C3 column is used for sorting

Note that a B+ tree sorted by the size of the C2 and C3 columns is called a joint index, which means something different from the statement that indexes the C2 and C3 columns, respectively.

  • Creating a joint index only creates a B+ tree like the one above.
  • Indexing the C2 and C3 columns creates two B+ trees sorted by the size of the C2 and C3 columns, respectively.

Index in MyISAM

InnoDB’s primary key index is data. The leaf node of the B+ tree already contains all the complete user records. MyISAM’s index scheme also uses B+ tree, but it stores indexes and data separately:

  • Records in a table are stored in the storage space in the order of insertion time. You can access a record quickly by address.
  • MyISAM creates a separate B+ tree index for the primary key of the table, except that instead of storing the complete user record in the leaf of the B+ tree, it stores a combination of primary key values and addresses.

Advantages and disadvantages of clustered indexes

Advantage: When querying an entry based on a primary key, there is no need to go back to the table, and the data record is under the primary key node

Disadvantages: frequent page splitting (row migration) in the event of irregular primary key data inserts (e.g., 1,3,2,8,7,6,4 inserts), which can cost performance.

The primary key value of the clustered index should be continuously increasing rather than random (do not use random strings and UUID), which can cause a lot of page splitting and page movement.

Leftmost matching principle

For example: federated indexes (C2, C3)

  • Sort records and pages by column C2 (left-most first)
  • In the case that the C2 columns of the record are the same, the C3 column is sorted

At this point, there are two orders in the B+ tree: C2 and C2,c3

At this point, a B+ tree contains two index trees :(c2), (c2,c3)

For example: federated indexes (C2, C3,C4)

  • Sort records and pages by column C2 (left-most first)
  • In the case that the C2 columns of the record are the same, the C3 column is sorted
  • In the case that the C3 columns of the records are the same, the C4 column is sorted

At this point, there are three orders in the B+ tree: C2 and C2, C3 and C2, C3 and C4

At this point, a B+ tree contains three index trees :(c2), (c2,c3), (c2,c3,c4)

When retrieving data such as (c2,c3,c4), B+ tree searches from left to right. For example, when retrieving data such as (5,2,8), B+ tree will compare c2 first to determine the direction to search for the next step. If c2 is the same, it will compare c3 and c4 successively to obtain the retrieved data

When there is no C2 data retrieval for (c3, C4), the B+ tree does not know which node to search in the first step, because C2 is the first comparison factor when the B+ tree is established, and it is necessary to search according to C2 first to know where to search next

When searching for (c2,c4) data, b+ tree can specify the search direction with C2, but the next field c3 is missing, so we have to find all data equal to C2, and then match the data with C4 (ICP feature).

There are two instructions for the use of the left-most prefix:

  • MySQL matches from left to right until it encounters a range query (>,<,between,like)

Example: Where a = 1 and b = 2 and C >3 and d = 4

If you create an index in (a, B, c, d) order, d will not use the index

If you want to create an index in order (a, B, D, c), you can use both

  • = and in in the WHERE condition can be out of order, mysql’s query optimizer will help you optimize it into a form that can be recognized by the index

Example: Create (a, B,c) indexes

where a=1 and b=2 and c=3

where b=2 and c=3 and a=1

Case Study:

The index length is 16 bytes, all used

Using 8 bytes, the extra: Using index condition is filtered in the index, which is the ICP feature

I’m using 4 bytes, so I don’t have to sort twice, because C1, C2, and C3 are all ordered, and c1 is equal to x,c2 and C3 are ordered

It took four bytes to sort it twice

It takes 4 bytes, and instead of doing a second sort, we’re going to do a table lookup of C5

I don’t have to double sort, because C2 is fixed, and it’s actually c1, C2,c3

Indexes cover

  • If an index contains (or overwrites) the values of all the fields that need to be queried, it is called a ‘overwrite index’. That is, you only need to scan the index without returning to the table.

Advantages:

  • No need to return to the table, reduce secondary query or reduce random query IO
  • Reduced the number of data pages traversed. Because the index entry is smaller than the size of the data row, the number of data pages consumed by the index leaf node is much smaller than the number of data pages consumed by the original table record.

Extra: Using index means index overwriting

Index selectivity

Since the index can speed up the query, so is not as long as the query statement to build an index?

Answer: No, because while indexes speed up queries, they come at a cost: indexes themselves take up storage space, and indexes increase the burden of inserting, deleting, and modifying records, so more indexes aren’t always better. Indexes are generally not recommended in two cases.

  • The first case is that the table records are relatively small, such as one thousand two thousand or even hundreds of records of the table, there is no need to create an index.
  • Another case where indexes are not recommended is when they are less selective. Selectivity of indexes is the ratio of non-repeating Index values (Cardinality) to the number of table records (#T) : Index Selectivity = Cardinality / #T

Summary: Prefix indexes have both index size and selectivity, but their disadvantages are that they cannot be used for ORDER BY and GROUP BY operations, and cannot be used for index overwriting because index values are incomplete.

Big data paging query and deferred association

If offset is set to N, the statement time increases as offset increases. Because mysql takes offset+ N rows, and then drops the previous offset and returns n rows, when offset is very large, it becomes very inefficient.

Deferred association: Use index override queries to return the desired primary key, and then associate the original table with the primary key to obtain the desired data.

Use deferred association to solve the bottleneck of big data page query

  1. The required primary key is first obtained through index overrides
Select emp_no from employees limit 100000,10; Alter table employees add index idx_EMP_NO (EMP_NO); -- Index value + primary key valueCopy the code
  1. The main table associated
Select * from employees m inner join (select emp_no from employees limit 100000,10) as s on m.ep_no = s.ep_noCopy the code

Indexing and sorting

Two ways of sorting

  • The result of a query is ordered by the Order of an index
  • Obtain data first and form a temporary table to make filesort (files may be stored on disk or in memory). You are not advised to use this table because of poor performance

Order by will use index in both cases:

  • The order BY statement uses the left-most prefix of the index
  • The combination of the WHERE condition and the Order BY condition satisfies the left-most index prefix

Principles for creating federated indexes

  • The higher the overall selectivity of a federated index, the better
  • When grouping and sorting, sort along an index and place a second sort
  • When grouping and sorting are not considered, the leading column of the joint index should be selected with high differentiation as far as possible. B+ trees are sorted by leading columns, and the most selective columns are placed first to filter out the desired records fastest
  • Union indexes should not be too long. Prefix indexes can be used to reduce the length of index entries