1. What is an index

An Index is a data structure that helps MySQL obtain data efficiently. You get the essence of an index: an index is a data structure.

For example, when you read a book, the first thing you see is the table of contents. It is very quick to search the contents of the book through the table of contents, as follows:

The table of contents of the books is arranged in order, there is chapter one, chapter two… It is a sequential data structure. It is a sequential structure.

But what if we have to go to the library to find a book? The best way to do this is with the following guidelines:

As can be seen from above, the whole index structure is an upside-down tree, in fact, it is a kind of data structure, this kind of data structure is better than the linear directory mentioned above to increase the speed of the query, or you may find a floor of the floor.

For example, the default primary key index is created when the primary key ID is created. When we use the ID to query the content, we first check the index library. After finding the index, we can quickly locate the specific location of the data according to the index.

So indexing is an important aspect of application design and development. Too many indexes can affect application performance, while too few indexes can affect query performance. Finding the right balance is critical to the performance of your application.

Note: The following description of indexes refers to indexes in InnoDB unless otherwise specified.

2. Classification of indexes

I don’t know if you’ve ever had a situation where every time you go on the back of an exam it looks like there are all sorts of indexes, and you can’t make sense of them, but we actually sort the indexes.

Data structure perspective

  • B + Tree index
  • A Hash index
  • Full-text index

Physical Storage Angle

  • Clustered index (clustered index)
  • Secondary index (secondary index, non-clustered index)

The logical point of view

  • The primary key index
  • The only index
  • Normal index
  • The prefix index

Actual service Angle

  • Single index
  • Combined index (composite index)

3. Index data structure

InnoDB storage engine supports the following common indexes: B+Tree index, full-text index, hash index, among which the key is B+Tree index,

B+Tree index is the traditional index, which is the most commonly used and most effective index in the current relational database system. The structure of a B+Tree index is similar to that of a binary Tree. It can quickly find data by Key Value. Note that B in B+Tree does not stand for binary, but for balance, because B+Tree evolved from the earliest balanced binary Tree, but B+Tree is not a binary Tree.

Knowledge supplement

3.1 Binary tree

An ordered tree in which nodes have no more than 2 children.

3.2 Binary search tree

Binary Sort Tree (BST) is a Binary Sort Tree. A binary search tree satisfies the following conditions:

  • First of all, it’s a binary tree
  • All values of the left subtree are less than those of the root node
  • All values of the right subtree are greater than those of the root node
  • The left and right subtrees satisfy both of these

3.3 Balanced binary trees

Through the search operation of binary search tree, it can be found that the search efficiency of a binary search tree depends on the height of the tree. If the height of the tree is the lowest, the search efficiency of the tree will become higher.

For example, the following binary tree is composed of right subtrees:

Now, you can see that the binary search tree actually degenerates into something like a linked list, and if we want to find 60 we have to go all the way to the last one, and the search time is O(n), so it’s not very efficient.

So what is a balanced binary tree?

Balanced binary tree is a binary sort tree, at the same time difference in height between the two subtrees around any node (or balance factor, referred to as “BF”) the absolute value of not more than 1, and about two subtrees also meet, balanced binary search trees query performance close to the binary search method, the time complexity is O (logn), so search performance is higher than the normal binary search trees.

So why not use a balanced binary Tree instead of B+Tree?

The lookup performance of balanced binary trees is relatively high, but not the highest, just close to the highest. The best performance requires the establishment of an optimal binary tree, but the establishment and maintenance of the optimal binary tree requires a lot of operations, so users generally only need to establish a balanced binary tree.

The query speed of balanced binary tree is very fast, but the cost of maintaining a balanced binary tree is very high. In general, one or more left and right rotations are required to maintain balance after insertion, update, and deletion of the tree. (The AVL adjustments can be made as I wrote earlier: AVL tree adjustments).

What’s the problem with balancing binary trees for a database?

Because each node of a binary tree has at most two child nodes, the height of the binary tree increases rapidly when the number of nodes is large, such as 1000 nodes, the height of the tree is almost 9 to 10 layers. We know that the database is persistent, data is to be read from the disk, the general mechanical disk can do at least 100 IO per second, the time of one IO is basically in 0.01 seconds, 1000 nodes in the search will need 0.1 seconds, if it is 10000 nodes, 100000 nodes? So for the database, in order to reduce the height of the Tree, put forward B+Tree data structure, before this, first look at B-tree.

3.4 B – Tree

In binary Tree, each node has a data item can be understood as the value of the node (key), and each node has at most two child nodes, allowing each node can have more items of data and child nodes, is more Tree, if this Tree still balanced Tree, you can call him a b-tree (B Tree, don’t call B minus Tree), The main characteristics of B-tree are as follows:

  1. The nodes of B tree store multiple elements, and each inner node has multiple branches.
  2. The elements in the node contain key values and data, and the key values in the node are arranged from largest to smallest. That is, data is stored at all nodes;
  3. Elements in the parent node do not appear in the child node;
  4. All leaf nodes are in the same layer, leaf nodes have the same depth, and there are no pointer connections between leaf nodes.

For example, a two-three-four tree is a kind of B tree.

A 2-3-4 tree is a B-tree with order (order, which can be understood as the branches of nodes) of 4, so it can also be called a 4-fork tree. It mainly has the following properties:

  1. A node with one key has two children, a node with two keys has three children, and a node with three keys has four children. The nodes corresponding to keys are called two nodes, three nodes, and four nodes, and that’s why it’s called a 2-3-4 tree.

  2. All leaf nodes are always in the same layer;

  3. The values of each node remain in ascending order from left to right. All keys in the subtree between the two keys must be larger than the left key of its parent node and smaller than the right key of its parent node. As follows:

Like this two-three-four tree:

For example, if there are 1-100 numbers, the binary tree can only be divided into two ranges, 0-50 and 51-100, while the B-tree can be divided into four ranges, 1-25, 25-50, 51-75, 76-100, and three quarters of the data can be filtered once. So why choose B+Tree instead of B-tree?

Let’s look at a B-tree:

As we know, database indexes are stored on disk, and when the data volume is large, the whole index cannot be loaded into memory, only each disk page (corresponding to the node of the index tree) can be loaded one by one. So we want to reduce the number of I/OS, and for trees, the number of I/OS is the height of the tree.

For example, in the figure above we want to check the value of the target element 8:

  1. In the first IO, the node 5 and 20 are loaded into memory, and the target element is compared with 5 and 20 to locate the node corresponding to the p2 pointer in the middle region.
  2. In the second IO, the node where 7 and 10 are located is loaded into memory, and the target element is compared with 7 and 10 to locate the node corresponding to p2 pointer in the middle region.
  3. On the third IO, load nodes 8 and 9 into memory, compare the target element with 8 and 9, and find the target element.

It can be seen that the comparison times of B-tree query are not less than that of binary Tree, especially when the number of nodes is very large. However, the memory comparison speed is very fast and the time consuming is negligible. Therefore, the query performance can be improved as long as the Tree height is low and THE I/O is low, which is one of the advantages of B-tree.

Let’s look at B+Tree.

3.5 B + Tree index

B+ tree is a classical data structure, like binary tree and balanced binary tree. B+ trees evolved from B-tree and indexed sequential access methods, but there is almost no use of B-trees in real life.

The definition of B+Tree can be found in many data structure books. It is very complicated. Let’s outline its definition:

B+Tree is a variant of B-tree. Leaf nodes on B+Tree store keywords and corresponding recorded addresses, and the layers above the leaf nodes serve as indexes. A B+Tree of order M is defined as follows:

  • Each node can have up to m elements;
  • Each node has at least (m/2) elements except the root node;
  • If the root node is not a leaf node, then it has at least 2 child nodes;
  • All the leaf nodes are in the same layer;
  • A non-leaf node with K child nodes has (K-1) elements, in ascending order;
  • All the elements in the left subtree of an element are smaller than it, and all the elements in the right subtree are greater than or equal to it;
  • Non-leaf nodes only store keywords and indexes pointing to the next child node. Records are only stored in leaf nodes.
  • Adjacent leaf nodes are connected by Pointers.

The concept is very complicated, but we can summarize the difference between B+Tree and B-tree:

  1. B+Tree non-leaf nodes do not store data;
  2. Bidirectional Pointers are used to connect leaf nodes of B+Tree, and the lowest leaf nodes form a bidirectional ordered linked list.

Look at this B+Tree:

If we want to find the target element 8, we must also do 3 IO operations (same as b-tree), this is the equivalent query case, what if we want to do a range query? For example, we want to query data between 8 and 25.

  1. The first step must be to find the element with the minimum value of 8, and then do three IO.
  2. Since each leaf node is connected by a pointer, it is an ordered linked list, so we only need to traverse the linked list composed of leaf nodes until we find the maximum value of 25.

How about comparing b-tree range lookups?

Since there is data in each node of b-tree, when 8 is found in leaf node, but the maximum value cannot be found, we need to go back and search for traversal, that is, we need to carry out middle-order traversal.

There are three ways to traverse a tree. Preorder, inorder, postorder.

So why use B+Tree instead of B-tree for MySQL index?

In general, affect the mysql to find the main or the number of disk I/o performance, while the b-tree whether leaf node or a leaf node, will save the data, this leads to the number of Pointers in the leaf node can save less, pointer to save a large amount of data, under the condition of less can only increase the height of the Tree, lead to IO operations more, The query performance deteriorates.

3.6 a Hash index

Hash indexes and B+ trees have the following differences:

  1. For a single data query, the Hash index query time is O(1) and the B+Tree index time is O(logN).
  2. Hash indexes are only suitable for equivalent queries, but cannot be used for range queries because they are randomly distributed.
  3. Hash indexes can’t be sorted using indexes, again because the data is randomly distributed;
  4. Hash indexes do not support the leftmost matching rule for federated indexes;
  5. If there are a large number of duplicate key values, the efficiency of hash index will be very low, because there are hash collisions, the larger the amount of data, the higher the probability of hash collisions.

3.7 Full-text Index

It is used primarily to find keywords in the text, not to compare them directly to values in the index. The fulltext index is very different from other indexes in that it is more like a search engine than a simple where statement parameter matching. The fulltext index is used with the match against operation, rather than the usual WHERE statement plus like. It can be used on create TABLE, ALTER TABLE, or CREATE Index, but only char, vARCHar, and TEXT columns can be used to create full-text indexes. It is worth mentioning that when there is a large amount of data, it is much faster to put the data into a table with no global index and then CREATE the fulltext index with CREATE Index than to CREATE the fulltext for a table and then write the data.

Now that you know the data structure of an index, let’s look at the differences between indexes.

4. Classify physical storage

4.1 Clustered Index (Clustered Index)

InnoDB indexes are also organized as B+ trees. The leaves of a B+ tree are used to store data, but what data? B+ tree is a data structure designed to quickly retrieve data. Why not put an index? But in the database table, data is the data we really need, index is only secondary data, even a table can not have a custom index. So how is data organized in InnoDB?

InnoDB uses a clustered index, that is, the primary key of a table is used to construct a B+Tree, and the row record data of the whole table is stored in the leaf node of the B+Tree, that is, the leaf node stores “primary key and current row data”. The index is the data, the data is the index. Since clustered indexes are built using the primary key of the table, each table can have only one clustered index.

B + Tree structure

The leaf node of the clustered index is the data page. In other words, the data page contains a complete row of records. So the advantages of clustered indexes are:

  1. The whole row of data can be obtained by clustering index.
  2. Sort and range lookups for primary keys are very fast.

What if we didn’t define a primary key? MySQL uses unique indexes. Without unique indexes, MySQL also creates an implied column RowID as the primary key, and then uses this primary key to create the clustered index.

4.2 Secondary Index (Secondary Index)

The clustered index is only useful if the search criteria are the primary key, because the data in a B+Tree is sorted by the primary key. What if we want to search by other columns? We typically create multiple indexes, called secondary indexes (secondary indexes, or non-clustered indexes), and leaf nodes store “primary key and current index column values.”

For secondary indexes, leaf nodes do not contain all the data for row records. In addition to containing key values, each leaf node contains a bookmark in the index row. The booktab is used to tell the InnoDB storage engine where to find the row data corresponding to the index. Thus, the bookmarks of InnoDB storage engine’s secondary indexes are clustered index keys for the corresponding row data.

Create index for name, then the leaf node stores the following data:

Back to the table 4.3

Since the secondary index node does not store all the data, how to query all the data by name?

The presence of secondary indexes does not affect the organization of data in clustered indexes, so there can be multiple secondary indexes on each table. When looking for data through the secondary index, InnoDB storage engine iterates through the secondary index and obtains the primary key to the primary key index through leaf level Pointers, and then finds a complete row record through the primary key index (clustered index). This process is also known as table back. To query a complete user record based on the value of the secondary index, we need to use two B+ trees: one secondary index, one clustered index, as follows:

4.4 Overwriting an Index

Obviously, retrieving the table requires additional B+Tree search, which inevitably increases the query time. So when do you not need to return to the table?

We know that the primary key and the current index column value are stored in the leaf node of the secondary index. If we only need to query the value in the leaf node, then there is no need to return to the table. This situation is called overwriting index or triggered index overwriting.

select id,name from account where name='H';
Copy the code

An overwrite index is not strictly an index structure, but can be understood as a means of optimization, such as creating a federated index.

5. Categorize logically

5.1 Primary Key Indexes

An index built on a primary key field. The values in the index column must be unique. No null values are allowed.

For example: order table (order number, name, price, etc.), the order number is uniquely identified, can be used as the primary key.

5.2 Unique Index

An index created on a UNIQUE field is a UNIQUE index. A table can have multiple UNIQUE indexes. The index column value is allowed to be NULL to avoid duplicate values in a data column of the same table.

Such as: id number and so on.

5.3 Common Indexes

A primary key index or a UNIQUE index requires a primary key or UNIQUE field. A common index does not require a primary key or UNIQUE field. A common index can be set for either index or key.

5.4 Prefix Index

A prefix index is an index that indexes the first few characters of a character type field or the first bytes of a binary type field, rather than the entire field.

For example, you can index the first five characters of the name field in the table above.

Prefix indexes can be created on char, vARCHar, text, binary, and Varbinary columns, which greatly reduces the storage space occupied by indexes and improves index query efficiency.

6. Actual use Angle

All of the above indexes are built on a single column, so they can be called single-column indexes. In practice, we often build indexes with two or more columns.

6.1 Joint Index

Combining multiple columns on a table to create an index is called a joint index or compound index.

For example, index(a,b) is the combination of a and B columns to form an index.

If a joint index is created, only one B+Tree will be created. If multiple columns are indexed separately, a B+Tree will be created for each column. For example, index(a) and index(B), two B+ trees will be created for each column.

Index (a,b);

  1. First, sort each record according to column A;
  2. In the case of the same a column of records, column B is used for sorting.

Therefore, for a jointly indexed B+Tree, the nodes are as follows:

According to the above statement, it can be seen that the values of a are ordered, i.e., 1,1,2,2,3,3, while the values of b are 1,2,1,4,1,2.

At the same time, we can also find that when a values are equal, b values are arranged in order, but this order is relative. So the left-most matching rule stops with range queries, and the remaining fields cannot be indexed. For example, if a = 1 and b=2, both a and B fields can be indexed, because B is relatively ordered when a value is determined, while a>1 and b=2, A field can match the index, but b value cannot, because a value is a range, in which B is unordered.

6.2 Left-most matching Principle

The above matching method is called the leftmost matching principle:

Leftmost takes precedence, starting from the leftmost any consecutive index will match. Matching stops when a range query (>, <, between, like) is encountered at the same time.

Suppose we create such a joint index index index(a,b,c), equivalent to creating a, A-b, a-b-C three indexes.

1, full value matching query

SELECT * FROM users WHERE a=1 AND b=3 AND c=1;
​
SELECT * FROM users WHERE b=3 AND a=1 AND c=1;
​
SELECT * FROM users WHERE c=1 AND a=1 AND b=3;
Copy the code

Analyze whether the index is used by executing the plan

EXPLAIN SELECT * FROM users WHERE c=1 AND a=1 AND b=3; .Copy the code

After testing, it is found that all indexes are used:

I thought it was the left-most principle? Why should c be indexed on the far left?

This is because Mysql has the query optimizer, which automatically optimizes the order of the query. It determines the most efficient order in which to correct the SQL statement before producing the actual execution plan.

2. Match the left column

Equivalent query:

① Follow the leftmost principle, use index.

select * from users where a = '1' 
​
select * from users where a = '1' and b = '2'  
​
select * from users where a = '1' and b = '2' and c = '3'
Copy the code

② Do not follow the left principle, no index, full table scan.

select * from users where  b = '2'; 
​
select * from users where  c = '3';
​
select * from users where  b = '1' and c = '3'; 
Copy the code

③ If it is discontinuous, only the index of A is used.

select * from users where a = '1' and c = '3';
Copy the code

3. Matching range value

Scope query status:

① Range query on leftmost column, using index.

select * from users where  a > 1 and a < 3;
Copy the code

Select * from B+Tree where 1< A <3 and B > 1; select * from B+Tree where 1< A <3 and B > 1;

select * from users where  a > 1 and a < 3 and b > 1;
Copy the code

③ The column on the left matches the value, and the range matches the other column, using the index.

select * from users where  a = 1 and b > 3;
Copy the code

7. Other indexes

7.1 Adaptive Hash Index

InnoDB storage engine in addition to the various indexes we mentioned earlier, there is an adaptive hash index. We know that the number of searches for B+Tree depends on the height of the B+Tree. In production environment, the height of the B+Tree is usually 3 or 4 levels, so it requires 3 or 4 IO queries.

Therefore, InnoDB storage engine monitors Index tables internally. If an Index is frequently used, it is considered as hot data, and then creates a hash Index internally, which is called Adaptive Hash Index (AHI). If the index is queried again next time, the hash algorithm is used to derive the address of the record and the data can be queried at one time, which is much more efficient than repeatedly querying nodes in the B+Tree index three or four times.

InnoDB storage engine uses a division hash function, and its collision mechanism uses a linked list. Note that adaptive hash indexes are only created and used by the database itself, and we cannot interfere with them.

The show engine Innodb status\G command is used to see the current use of adaptive hash indexes.

8. Indexes in MyISAM

We know that InnoDB index is data, that is, the leaf node of the B+Tree that gathers indexes already contains all the complete user records, while MyISAM index scheme uses Tree structure, but it stores indexes and data separately.

MyISAM stores the records in a table in a separate file, called a data file, in the order in which the records were inserted. The file is not divided into several data pages, just as many records are crammed into the file. We can quickly access a record by line number.

We can’t use binary look-up on the data because we didn’t intentionally sort the data by primary key size when inserting it.

Tables using the MyISAM storage engine store index information in a separate file called an index file. MyISAM creates a separate index for the primary key of the table, but instead of storing the complete user record in the leaf node of the index, it stores a combination of the primary key value and the row number. That is, first find the corresponding row number through the index, and then find the corresponding record through the row number.

This is completely different from InnoDB. In InnoDB storage engine, we only need to search the clustered index according to the primary key value to find the corresponding record, while in MyISAM, we need to perform a table back operation, which means that the index created in MyISAM is equivalent to all secondary indexes. If necessary, we can also create separate or joint indexes for other columns, similar to InnoDB indexes, but the corresponding column + row number is stored in the leaf node, and these indexes are all secondary indexes.

Index files are stored in. In MYI, the data files are stored in.myd.

9. Index creation policy

Proper creation and use of indexes are fundamental to high-performance queries. Now that you’ve seen the data structures associated with indexes, the various types of indexes and their relative advantages and disadvantages, how to create and use indexes correctly?

9.1 Type size of index columns

When we define the table structure, we need to explicitly specify the type of the column. For example, the integer types include TTNYINT, NEDUMNT, INT and BIGTNT, which occupy increasing storage space. The type size refers to the size of the data range represented by the type. If we want to index an integer column, we should keep the index column smaller if the integer range allows. For example, we should not use BIGINT if we can use INT and we should not use INT if we can use NEDIUMINT. This is because:

  • The smaller the data type, the faster the comparison at query time (CPU level)
  • The smaller the data type, the less storage space the index occupies, and the more records that can be placed on a data page, thus reducing disk I/O performance loss, which means that more data pages can be cached in memory, thus speeding up read and write efficiency.

This recommendation is especially true for primary keys of tables, because not only do clustered indexes store primary keys, but all secondary indexes store primary keys of a record. If the primary key is applicable to smaller data types, it means more storage space and more efficient IO.

Note: In InnoDB, the default B+Tree node size is 16KB, which means that if a Key is 8 bytes, then a node can hold about 1000 keys, which means that B+Tree can have 1000 forks. At the same time, InnoDB reads 16KB data in integer multiples every disk IO. In other words, InnoDB can make full use of the high-speed read and write feature of sequential DISK I/O on nodes.

9.2 Searching, Sorting, or Grouping

You can generally create indexes for columns that are used only for searching, sorting, or grouping.

That is, only columns that appear in the WHERE clause and join columns that appear in the join clause are indexed. Columns that appear in a query list are not necessarily indexed unless an overwrite index is required. Or create indexes for columns that appear in the ORDER BY or GROUP BY clause.

9.3 Index Selectivity

In the book High Performance MySql:

Index selectivity refers to the ratio of non-repeating index values (also known as cardinality) to the total number of records in the table (#T), ranging from 1/#T to 1.

The more selective an index is, the more efficient the query is, because a more selective index allows Mysql to filter out more rows in a lookup. Unique index selectivity is 1, which is the best index selectivity and the best performance.

Poor index selectivity is when the data in a column is very repetitive, such as a gender field, and generally there are only two possibilities, male or female. So when we query, even if we use this index, from the point of view of probability, we can still find half of the data.

Take this table for example:

So which of the above table is suitable for indexing, of course is the name field, because there is no duplication in the data, gender field is the least suitable for indexing, because the data duplication is very high.

How to calculate index selectivity? As the table above:

Sex selectivity:

Selection of names:

9.4 Creating a Joint Index

The most important aspect of creating a federated index is the order of the indexes. There is a rule of thumb for selecting the column order of an index: put the most frequently used columns and the most distinguished columns first.

When sorting and grouping are not a concern, it is usually good to put the most selective columns first. The purpose of the index is to optimize the lookup of the WHERE condition, and in this case, the index is designed to filter out the desired rows the fastest.

In addition, syndication is recommended if overwriting indexes are available.

PS: What is differentiation? My understanding is selectivity, but it involves prefix indexing.

9.5 Creating Prefix Indexes

As mentioned above, the prefix index can be set for the character type. Generally, the first few characters of the field will be set as the index, that is, the length of the index is not the longer the better.

For example: king, king two, king two eggs

  • If the length of the prefix index is 1, then each row of the index is king, there is no distinction at all;
  • The length of the index is 3 and the distinction is 100%.

Prefix index selectivity (distinction) calculation:

select count(distinct left(field,5))/count(*) from table;   
Copy the code

For some special fields, such as the URL in the above table, are already http://www. At the beginning, obviously the prefix index here needs to be more than 11, the index length is too long is not what we want, but is there a suffix index?

MySQL does not natively support reverse indexing, but it is possible to store a string reversed and build a prefix index based on it.

In addition, MySQL cannot use prefix indexes to do ORDER BY and GROUP BY, nor can it use prefix indexes to do override scans.

9.6 Redundant and Duplicate Indexes

MySQL allows multiple indexes to be created on the same column, intentionally or unintentionally. MySQL needs to maintain duplicate indexes separately, and the optimizer also needs to consider them individually when optimizing queries, which can affect performance.

Duplicate indexes are indexes of the same type that are created in the same order on the same columns. You should avoid creating duplicate indexes in this way and remove them as soon as they are discovered.

9.7 Deleting An Unused Index

In addition to redundant and duplicate indexes, there may be some indexes that the server never uses. Such an index is completely redundant and is recommended to be removed.

10. Summary

Just because an index can help us make our queries more efficient doesn’t mean it’s always good or that we need to use it all the time, so we need to be clear about its pros and cons.

  • Advantages: Can greatly speed up the retrieval of data

  • Disadvantages:

    • Time: It takes time to create and maintain indexes
    • Spatial: Indexes need to occupy physical space

When to create an index:

  1. The primary key automatically creates a unique index
  2. Fields frequently used as query criteria
  3. Associated fields in multi-table associated queries
  4. Sorted fields
  5. For frequently searched fields, you can create a joint index to cover the index
  6. Statistics or group fields in the query

When you do not need to create indexes:

  1. Too few table records
  2. Frequently add, delete, change and check the operation of the field
  3. Infrequently used fields in the WHERE condition

Reference 11.

  • B tree and B + tree
  • Discussion on the classification of MySQL index
  • Mysql left-most matching principle