What is an index?

Indexing is for faster query of data. For example, the table of contents of a book is the index of the contents of the book. Readers can quickly find the content they want in the table of contents and then find specific chapters according to the page number.

The same is true for a database. If a query statement uses an index, it first searches the index to obtain the physical address of the row where the data resides, and then accesses the data.

Advantages and disadvantages of indexes

Advantages: fast retrieval, reduce I/O times, speed up retrieval; Grouping and sorting by index can speed up grouping and sorting;

Disadvantages: Indexes are tables themselves and therefore take up storage space. Index maintenance and creation require time costs, which increase with the volume of data; Building an index reduces the efficiency of table modification operations (delete, add, modify) because the index table needs to be modified at the same time as the table is modified.

Classification of indexes

In MySQL, the common index types are: primary key index, unique index, normal index, full text index, composite index. The create syntax is:

Among them, composite index is also called multi-column index, the last example in the above code is to create a 3-column index. MySQL follows the “leftmost matching” principle when querying by index, that is, first querying by col1, then querying by COL2, and then querying by COL3.

If you skip a column and go straight to a later column, as in the following statement, you cannot use the index created above:

A little trick here is that if your previous column is a simple enumerated type, such as gender, you can “skip” the COL1 column by adding col1 in(MALE, FEMALE) to the WHERE statement and using the index above.

If a column is a long string (such as a UUID), it is recommended to use a prefix index that matches the first n characters. The value of n depends on your data. In high Performance MySQL, there is a trick: use the LEFT function to query from 1, and keep increasing the value of n until the number of rows in the query is close to the number of rows in the full column.

Index implementation principle

MySQL indexing is implemented by the storage engine. Due to different storage engines, there are different index types, such as BTree index, B+Tree index, hash index, full-text index, etc. Since we mainly introduce BTree index and B+Tree index, we usually use the most InnoDB engine is based on B+Tree index.

The current version of MySQL InnoDB engine already supports full text indexing. Since MySQL 5.7.6, MySQL has built-in Ngram full text parser to support Chinese, Japanese, and Korean word segmentation.

Let’s start with binary search trees

For those of you who know data structures, there’s a data structure called binary trees. Binary trees have different variants depending on their use, such as heaps, such as binary search trees.

In binary search trees, in order to prevent the extreme tree height from affecting the query efficiency, some balanced binary search trees are derived, the most typical of which are AVL and red-black trees.

But the binary tree in the large amount of data, the depth is too deep, not suitable for the database query, so the database uses the multi-fork tree.

BTree

BTree (also called B-tree) is a balanced search multi-fork Tree. The structure of BTree is shown as follows:

If the degree of the tree is 2d (d>1) and the height is h, then the BTree has the following properties:

  • Each leaf node has the same height, equal to h;
  • Each non-leaf node is composed of N-1 keys and N Pointers. Keys and Pointers are isolated from each other, and both ends of the node must be keys.
  • The leaf pointer is null;
  • The key of a non-leaf node is [key,data], where key represents the key as an index, and data is the data of other columns in the row where the key value resides.

In BTree, the index columns are stored sequentially, so it is good for looking up range data and ORDER BY operations.

B+Tree

B+Tree is a variant of BTree. The differences between B+Tree and BTree are as follows:

  • Non-leaf nodes in B+Tree do not store data, only store key values;
  • The leaf node of B+Tree has no pointer. All key values appear on the leaf node, and the key value stored by key corresponds to the physical address of data data.
  • Each non-leaf node of B+Tree is composed of N key values and N pointer points.

Structure:

Advantages of B+Tree compared with BTree:

In general, B+Tree is more suitable than BTree to implement the index structure of external memory, because the storage engine design experts cleverly take advantage of the external memory (disk) storage structure.

The minimum storage unit of a disk is sector. For a sector, the number of blocks in an operating system is an integer multiple. An operating system manages memory by page. Therefore, the nodes of the index structure are designed to be the size of a page, and then the “read ahead” principle of external memory is used to read the data of the entire node into memory each time, and then look up in memory.

It is known that the memory read speed is hundreds of times faster than the external memory read I/O speed, so the key to improve the search speed is to minimize the disk I/O, so it can be known that the more the number of keys in each node, the smaller the Tree height, the fewer I/O times, so generally speaking, B+Tree is faster than BTree. Since no data is stored in the non-leaf nodes of B+Tree, more keys can be stored.

B+Tree with sequential index

Generally, the B+Tree structure used in database system or file system is optimized on the basis of classical B+Tree, adding sequential access Pointers.

A B+Tree with sequential access Pointers is formed by adding a pointer to each leaf node of a B+Tree that points to adjacent leaves. The purpose of this optimization is to improve the performance of interval access. For example, if you want to query all data records with keys from 18 to 49, when 18 is found, you can access all data nodes at one time by traversing the sequence of nodes and Pointers, instead of having to query again from the beginning, which greatly improves the interval query efficiency.

Clustered index and non-clustered index

The two most common storage engines in MySQL are MyISAM and InnoDB, which implement non-clustered index and clustered index respectively.

Do you know why InnoDB non-primary key indexes are generally slower than primary key indexes? The answer is that InnoDB uses clustered indexes, primary indexes need to be queried once, and non-primary indexes need to be queried twice.

Why should a non-primary key index be queried twice? Let’s see what happens next.

Primary and secondary indexes

First, introduce the basic concepts. In the index classification, we can be divided into “primary index” and “secondary index” according to whether the key of the index is the primary key. The index established by the key value of the primary key is called “primary index”, and the other is called “secondary index”. Therefore, there can only be one primary index and many secondary indexes.

Why do I need a secondary index? As we explained earlier, queries that want to use indexes need to satisfy the leftmost matching rule. Sometimes our queries do not use primary key columns, so we need to create indexes on other columns, known as secondary indexes.

Non-clustered index

The primary index of a non-clustered index is almost the same as the secondary index, except that the primary index is not allowed to duplicate, null values are not allowed, and the keys of their leaf nodes store the physical address of the data corresponding to the key value.

Data tables with non-clustered indexes are stored separately from index tables. Data in a non-clustered index is stored according to the insertion order of the data. Therefore, non-clustered index is more suitable for single data query. Insert order is not affected by key values.

Clustering index

The leaf node of the primary index of the cluster index stores the data corresponding to the key value, while the leaf node of the secondary index stores the primary key value of the data corresponding to the key value. Therefore, the primary key value length should be as small as possible and the type should be as simple as possible.

The data for the clustered index is stored together with the primary key index.

The data in the cluster index is stored according to the order of the primary key. Therefore, it is suitable to search by the interval of the primary key index, which can have less disk I/O and speed up the query. However, for this reason, it is better to insert the clustered index in the same order as the primary key, otherwise it will frequently cause page splitting (an operation during BTree insertion), which seriously affects performance.

In InnoDB, if you only need to find the column of the index, try not to add other columns. This will improve query efficiency.

A diagram illustrates the difference between clustered and non-clustered indexes: