Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

concept

Index is a data structure that can speed up data query. The most common index is B+ tree index or hash index, mainly from MYSQL database for example

Create index syntax

Syntax for creating a single INDEX: CREATE INDEX name

The index classification

Clustering index

Clustered indexes store both primary key values and row data. Clustered indexes are usually built according to the following principles:

  1. MySQL InnoDB uses the primary key as the cluster index column
  2. When no primary key is specified, columns with unique keys are automatically selected as the cluster index
  3. None of the above, generate a hidden cluster index

Secondary index

Secondary index, InnoDB USES way is to preserve the primary key in the leaf node, back and forth through the primary key table query to a complete record, so by the auxiliary index retrieval will appear the second query, so efficiency is not high clustering index, if the query to the query result is auxiliary index, then don’t return back table query. An index of a non-clustered index is a secondary index, which is used to optimize the query conditions other than the non-clustered index column. The secondary index can be divided into single-column index and joint index. Joint index has the principle of final matching

Index left-most matching principle

If you create a joint index, the order is the name, age, phone, so in the SQL query, if only use the name, age, MYSQL will use the name, age, index to look for the element, if the name, phone, MYSQL will only according to the name index to check, MYSQL does not select an index based on age,phone, or the left-most matching rule. MYSQL does not select an index based on age,phone, or name because MYSQL has a query optimizer that matches the most appropriate query selection.

The prefix index

Prefix index is aimed at the index column value we selected too long, will lead to the index tree height, so you can choose the front part of the large field as the index generation condition

B+tree Indicates the factors affecting the index tree height

  1. Index field long: prefix index
  2. Too many rows: partitioned tables, archived tables, distributed architectures
  3. Data type: Select the appropriate data type

Characteristics of each storage engine index

In MYSQL, the storage engine of MYISAM and InnoDB storage engine is B+ tree data structure by default, MYISAM does not support hash index, while InnoDB will automatically hash index according to the situation of the table, artificially cannot hash index

Underlying principles of indexing

B tree

When we talk about B+ trees, we’re going to talk about B trees, which is a balanced multi-fork search tree that can have at most m crosses (m >=2), called m-order B trees, as shown here

  • Each node can have at most M subtrees
  • The root node has at least two nodes
  • Non-root and non-leaf nodes have at least m/2 subtrees
  • Each path from the root to the leaf node has the same length.
  • All nodes store a keyword

B + tree

The B+ tree is an enhanced version of the B tree

Non-leaf node keywords do not store data, they are used only for indexing, and all data is stored in the leaf node.

In MySQL, B+ trees add sequential access Pointers

As shown in the figure, a B+tree with sequential access Pointers is formed by adding a pointer to each leaf node of B+Tree pointing to adjacent leaf nodes. In this way, interval access performance is improved. For example, if all data records from 18 to 49 are queried, when 18 is found, all data nodes can be accessed at one time by traversing the sequence of nodes and Pointers

B trees are different from B+ trees

Non-leaf nodes of B+ tree do not store data, so disk pages can store more node elements, while non-leaf nodes of B+ tree store data. Leaf nodes must be found in B+ tree query, as long as B tree matches, so data may be found before traversing leaf nodes. For range lookup, B+ trees only need to traverse the linked list of leaf nodes, while B trees need to repeat the middle order traverse

Why use B+ trees as indexes

Data structures such as red-black trees can be used to implement indexes, but MySQL uses B+Tree as the index structure. In general, indexes themselves are too large to be stored in memory, so they are often stored on disk as index files. Index lookups incur disk I/ O costs. I/0 costs are much higher than memory accesses, so indexes should be structured to minimize disk I/ O accesses. For B+ tree, compared with red black tree, balanced binary tree, the whole tree is very low, a node can store more than one element, so it can improve disk I/O efficiency, can also improve the efficiency of range search

InnoDB index implementation

For clustered indexes, the key held by the node in the B+ tree is the primary key of the data table, and the leaf node holds the complete record of the data

For the secondary index, the data field of the secondary index stores the value of the primary key of the response record. Therefore, the query data of the secondary index references the primary key and searches for specific data in the cluster index through the primary key value

MYISAM index implementation

MYISAM uses a B+ tree as its index structure. Leaf nodes store the address of the data record, not the value, because MYISAM’s index and data four are in two separate files.

In MYISAM, there is no difference in structure between primary and secondary indexes, except that clustered indexes require keys to be unique, while secondary keys can be repeated.

The index optimization

Advantages and disadvantages of indexes

advantages

  • Creating a unique index ensures the uniqueness of each row in a database table
  • Can speed up the data query speed
  • Speed up joins between tables
  • Grouping and sorting time in queries can be reduced when grouping and sorting clauses are used for retrieval

disadvantages

  • Creating and maintaining indexes takes time and increases with the volume of data
  • Indexes take up physical memory
  • When table data is added, deleted, or modified, indexes are also maintained dynamically, which slows down data maintenance

Principles for indexing

  1. Fields that are frequently queried should be indexed
  2. The foreign key relationship is used to index the fields associated with other tables in the query
  3. A single index or a combination of indexes, which are more cost-effective
  4. Sort fields in queries that are accessed more efficiently through indexes
  5. Fields that are frequently counted or grouped in a query should be indexed
  6. The table has too few records for indexing
  7. Fields that are frequently added or deleted are not suitable for indexing
  8. There is a large number of duplicate field data that is not suitable for indexing

Problems with a large number of indexes

  1. Each index occupies disk I/O space. The more indexes there are, the more disk space is required
  2. Refactoring and updating indexes is cumbersome when modifying tables. The more indexes there are, the more time it takes to update an index
  3. The optimizer burden is heavier, which may affect the choice of optimizer.

Index optimization specification

  • If a function is used in SQL, the added index is invalid
  • Do not perform calculations on index columns, which will invalidate the index
  • If the result set of the query exceeds 25% of the total rows, the optimizer will evaluate that no index walk is necessary and will not run the index.
  • Table contents change frequently, and statistics are inaccurate. Indexes may fail
  • ! = <> is null,is not null
  • If the SQL statement uses LIKE, the index can only be used for ‘33%’, but not for ‘%33%’ or ‘%33’.
  • If the index column has a range query, the following fields will not move the index, so try to move the index column of the range query to the end
  • When using “like” for fuzzy matching, it cannot be preceded by % such as %123%. If it is added, the index will not be moved. If % is added, the index will be moved normally.
  • Or >in>union. If there is an index in the column before or and there is no index in the column after or, then the related index will not go to the index. For example, SELECT * FROM payment WHERE customer_id = 203 OR amount = 3.96; If customer_id has an index, and amount does not, then the index will not be indexed, because subsequent queries will definitely have a full table scan, so there is no need to create an index.
  • Range conditions can be indexed by name, such as >=,between, and so on
  • You are advised not to create an index column that is null. If a column is null, the expected result may be obtained.

Cover index

Select col1,col2,col3 from test where col1=1 and col2=2. In this way, MySQL can directly obtain data by traversing the index. Overwrite index is the index to be created, which is the data to be queried. This avoids the query back to the table, reduces many random I/O operations, and improves the query efficiency

Data is empty

An index is invalid if the indexed field data has a null value