1. The cost of indexing

Before we look at the cost of indexing, we need to revisit the index’s data structure, the B+ tree

As shown above, it is a b + tree, about the definition of b + tree can see b + tree, said only a few key here, light blue piece of what we call a disk block, you can see every disk block contains several data items (shown in blue) and pointer (shown in yellow), such as disk blocks 1 contains data item 17 and 35, contains a pointer (P1, P2, P3, P1 indicates a disk block smaller than 17, P2 indicates a disk block between 17 and 35, and P3 indicates a disk block larger than 35. Real data exists in leaf nodes 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes only store the real data and only the data items that guide the search direction, such as 17 and 35, which do not really exist in the data table.

1.1 B + tree search process

As shown in the figure, if data item 29 is to be searched, disk block 1 will be loaded from disk to memory first. At this time, I/O will occur. In memory, binary search is used to determine that 29 is between 17 and 35. By disk 1 piece of P2 pointer disk address the disk block 3 by disk loaded into memory, in the second IO, 29 between 26 and 30, locking disk blocks of the 3 P2 pointer, pointer loading disk blocks through eight into memory, in the third IO, binary search to find memory do 29 at the same time, the end of the query, a total of three IO. The reality is that a 3-tier B + tree can represent millions of data. If millions of data lookups take only three IO, the performance improvement is huge. If there is no index and each data item takes one IO, the total number of IO will be millions, which is obviously very, very expensive.

The cost in space

Each node of the index is a data page. By default, a page occupies 16KB of storage space. A large B+ tree consists of many data pages, which is a lot of storage space.

The cost in time

Each B+ tree index needs to be modified every time data in the table is added, deleted, or modified. And as we have said, each level of the B+ tree nodes are sorted in ascending order according to the value of the index column to form a bidirectional linked list. Both the records in the leaf node and the records in the inner node (that is, the user record and the directory entry record) form a one-way linked list in ascending order of the index column values. Add, delete, and modify operations may damage the order of nodes and records, so the storage engine needs extra time to perform some operations such as record shifting, page splitting, and page reclamation to maintain the order of nodes and records.

1.2 When should indexes be used?

Any use of an index has a price, so we should not blindly use the index

  • Primary key The primary key index is automatically created
  • Frequently used as a query condition in WHERE
  • The foreign key relationship is used to index the fields associated with other tables in the query
  • The sorted columns need to be indexed, and the sorted fields are accessed by the index, which greatly improves the sorting speed
  • Tendency of composite index under high concurrency;
  • Columns that are counted or grouped in a query or used for an aggregate function can be indexed. For example, column_1 needs to be indexed if Max (Column_1) or count(column_1) is used

1.3 When not to build indexes

  • Too few table records (full table scan is also fast and unnecessary)
  • Do not index fields that are frequently added or deleted
  • Columns with a large number of duplicate and evenly distributed data are not indexed

2. Multi-column index

In each of the above examples, the index is a single column

Multi-column index refers to a composite index, which combines multiple columns to create an index. Many people do not understand the multi-column index, and it is common to create an independent index for each column, or create a composite index in the wrong order.

Creating a single column index over multiple columns does not improve MySQL query performance in most cases. MySQL5.0 and newer versions introduce a strategy called index merge, which partially uses multiple single-column indexes on a table to locate a given row.

In earlier versions of MySQL, only one of these single-column indexes can be used. But in MySQL5.0 and later, queries can scan simultaneously with multiple single-column indexes and merge the results. The special new is mainly applied to the following three scenarios:

  1. SELECT * FROM TB1 WHERE c1= “XXX” OR c2=” XXX “; SELECT * FROM TB1 WHERE c1= “XXX” OR c2=” XXX “; SELECT * FROM TB1 WHERE c1= “XXX” OR c2=” XXX”
  2. SELECT * FROM TB1 WHERE c1= “XXX” AND c2= “XXX” SELECT * FROM TB1 WHERE c1= “XXX” AND c2= “XXX” Get the end result
  3. Evaluate the combination of AND AND OR statements

2.1 Index Applicable queries

  • All values match

If the column in our search criteria is the same as the index column, this is called a full-value match

  • Matches the column on the left

In fact, our search statement could have included only the left column instead of all the columns in the union index

  • Matching column prefix

For a string indexed column, we can also quickly locate the record by matching only its prefix

  • Matching range value

All records in the B + tree are sorted according to the value of the index column in ascending order, so it is very convenient for us to find records with the value of the index column in a certain range. However, when using the joint range search, it should be noted that if the range search is carried out for multiple columns at the same time, B+ tree indexes are only used for range look-up of the leftmost column of the index.

2.2 Precautions for Sorting by Using federated Indexes

One thing to note about joint indexes is that the ORDER of the columns following the ORDER BY clause must also be given in the ORDER of the index columns

2.3 Cases in which indexes cannot be used for sorting

2.3.1 ASC and DESC are mixed

For scenarios that use a federated index to sort, we require that the sorting order of the columns be consistent, that is, either the columns are sorted by ASC or DESC.

2.3.2 A row contains columns with different indexes

There are times when multiple columns used for sorting are not in the same index, and indexes cannot be used for sorting

2.3.3 Complex expressions are used for sorting

To use indexes for sorting operations, you must ensure that the index columns appear as individual columns, not as embellished ones

2.4 Multi-column index grouping

If no index, grouping process all need to implement in memory, and if the index, happened to the grouping sequence and we are in the B + tree index column order is consistent, and our B + tree index is in accordance with the indexed column sorted, it’s not just yao, so can be used directly to group B + tree index.

2.5 Cost of returning to the table

The query optimizer calculates statistics on the records in the table and then uses those statistics to calculate the number of records that need to be returned to the table based on the query conditions. The more records that need to be returned to the table, the more likely it is to use full table scan, and the less likely it is to use secondary index + return. In general, limiting the number of records retrieved by a query makes the optimizer more inclined to use the secondary index + back table approach, because the fewer records returned to the table, the higher the performance improvement

2.6 Overwriting an Index

To completely eliminate the performance penalty associated with back-table operations, we recommend that only indexed columns be included in the query list.

2.6.1 Overwriting indexes is a very useful tool that can greatly improve performance

Index entries are usually far less than the data line operations, index if need only read the index, the MySQL will greatly reduce the data traffic Because the index is stored in the order column values (at least within a single page), so the range queries for the I/O intensive than random each line of data read from the disk I/O is much less

Some storage engines, such as MyISAM, only cache indexes in memory and rely on the operating system to cache data, so accessing the data requires a system call. This can cause serious performance problems

Overwrite indexes are especially useful for InnoDB tables because of InnoDB’s clustered indexes. Since InnoDB’s secondary index holds the primary key of rows in leaf nodes, if the secondary key overwrites the query, it can face a secondary query of the primary key index

Note: Not all index can be overwritten index, overwrite index must store the value of the index column, index hash index, spatial index, full text index does not store the value of the index column, so MySQL can only use the B-tree index for overwrite index

3. How to select indexes

3.1 Create indexes only for columns used for searching, sorting, or grouping

Only columns that appear in the WHERE clause, join columns in the join clause, or columns that appear in the ORDER BY or GROUP BY clause are indexed. Columns that appear in the query list do not need to be indexed

3.2 Consider the cardinality of the column

The cardinality of a column refers to the number of numbers in a column that are not duplicated. For example, if a column contains values 2, 5, 8, 2, 5, 8, 2, 5, 8, even though there are nine records, the cardinality of that column is 3. In other words, in the case of a certain number of recorded rows, the larger the cardinality of the column, the more scattered the values in the column, and the smaller the cardinality of the column, the more concentrated the values in the column. The cardinality of this column is very important and directly affects whether we can use the index effectively. Assume that the base of a column is 1, that is, all the records in the column value is the same, that it is no use for the column index, because all the value will not be able to sort, to quickly find the ~ and if secondary indexes is established in a column of duplicate values much more special, so records also identified by using the secondary indexes can be operated to do back to the table, So the performance loss is even greater. So the conclusion is that it is better to index columns with a large cardinality than columns with a small cardinality.

3.3 Index column types should be as small as possible

If we want to index an integer column, we should try to index the column with a smaller type if the represented integer range allows. For example, we should not use BIGINT if we can use INT and we should not use INT if we can use MEDIUMINT. This is because:

The smaller the data type, the faster the comparison at query time (this is CPU level stuff)

The smaller the data type, the less storage space the index occupies, and the more records you can fit into a data page, which reduces disk I/O performance costs. This means that more data pages can be cached in memory, which increases read and write efficiency.

3.4 Index string value prefix

Only the first few characters of the string are indexed, that is, only the first few characters of the string are retained in the records of the secondary index. In this way, although the location of the record cannot be accurately located, the location of the corresponding prefix can be located. Then, the complete string value can be queried back to the table according to the primary key value of the record with the same prefix, and then the comparison is good. In this way, only the encoding of the first few characters of the string is stored in the B+ tree, which not only saves space, but also reduces the string comparison time, and probably solves the sorting problem

  • Advantages: This method greatly saves index space and improves index efficiency.
  • Disadvantages: Using prefixed indexes reduces index selectivity, and you cannot use prefixed indexes for ORDER BY and GROUP BY, and you cannot use prefixed indexes for overlay scanning

3.4.1 How to Select the Prefix Index Length

The rule for selecting a prefix index is to choose a length that is sufficient to ensure high selectivity. The selectivity of a prefix index should be close to the entire column of the index, but not too long.

We can determine the length of the cardinality by saying that the cardinality of the prefix should be close to the cardinality of the whole column, and we can find the appropriate length by taking characters of different lengths and comparing them to the whole column. Another way is to calculate the selectivity of the whole column, and use the selectivity of the prefix to approximate the selectivity of the whole column

3.5 Make index columns appear separately in comparison expressions

If the index column appears in a comparison expression not as a single column, but as an expression, or as a function call, the index is not needed

3.6 Insertion Sequence of primary keys

If the primary key values of the records we insert are increasing, then we move on to the next page after each page is filled. If the primary key values of the records we insert are increasing and decreasing, this becomes more troublesome. What if the data page is already full? We need to split the current page into two pages and move some records from this page to the newly created page