Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
concept
Index is a data structure that can speed up data query. The most common index is B+ tree index or hash index, mainly from MYSQL database for example
Create index syntax
Syntax for creating a single INDEX: CREATE INDEX name
The index classification
Clustering index
Clustered indexes store both primary key values and row data. Clustered indexes are usually built according to the following principles:
- MySQL InnoDB uses the primary key as the cluster index column
- When no primary key is specified, columns with unique keys are automatically selected as the cluster index
- None of the above, generate a hidden cluster index
Secondary index
Secondary index, InnoDB USES way is to preserve the primary key in the leaf node, back and forth through the primary key table query to a complete record, so by the auxiliary index retrieval will appear the second query, so efficiency is not high clustering index, if the query to the query result is auxiliary index, then don’t return back table query. An index of a non-clustered index is a secondary index, which is used to optimize the query conditions other than the non-clustered index column. The secondary index can be divided into single-column index and joint index. Joint index has the principle of final matching
Index left-most matching principle
If you create a joint index, the order is the name, age, phone, so in the SQL query, if only use the name, age, MYSQL will use the name, age, index to look for the element, if the name, phone, MYSQL will only according to the name index to check, MYSQL does not select an index based on age,phone, or the left-most matching rule. MYSQL does not select an index based on age,phone, or name because MYSQL has a query optimizer that matches the most appropriate query selection.
The prefix index
Prefix index is aimed at the index column value we selected too long, will lead to the index tree height, so you can choose the front part of the large field as the index generation condition
B+tree Indicates the factors affecting the index tree height
- Index field long: prefix index
- Too many rows: partitioned tables, archived tables, distributed architectures
- Data type: Select the appropriate data type
Characteristics of each storage engine index
In MYSQL, the storage engine of MYISAM and InnoDB storage engine is B+ tree data structure by default, MYISAM does not support hash index, while InnoDB will automatically hash index according to the situation of the table, artificially cannot hash index
Underlying principles of indexing
B tree
When we talk about B+ trees, we’re going to talk about B trees, which is a balanced multi-fork search tree that can have at most m crosses (m >=2), called m-order B trees, as shown here
- Each node can have at most M subtrees
- The root node has at least two nodes
- Non-root and non-leaf nodes have at least m/2 subtrees
- Each path from the root to the leaf node has the same length.
- All nodes store a keyword
B + tree
The B+ tree is an enhanced version of the B tree
Non-leaf node keywords do not store data, they are used only for indexing, and all data is stored in the leaf node.
In MySQL, B+ trees add sequential access Pointers
As shown in the figure, a B+tree with sequential access Pointers is formed by adding a pointer to each leaf node of B+Tree pointing to adjacent leaf nodes. In this way, interval access performance is improved. For example, if all data records from 18 to 49 are queried, when 18 is found, all data nodes can be accessed at one time by traversing the sequence of nodes and Pointers
B trees are different from B+ trees
Non-leaf nodes of B+ tree do not store data, so disk pages can store more node elements, while non-leaf nodes of B+ tree store data. Leaf nodes must be found in B+ tree query, as long as B tree matches, so data may be found before traversing leaf nodes. For range lookup, B+ trees only need to traverse the linked list of leaf nodes, while B trees need to repeat the middle order traverse
Why use B+ trees as indexes
Data structures such as red-black trees can be used to implement indexes, but MySQL uses B+Tree as the index structure. In general, indexes themselves are too large to be stored in memory, so they are often stored on disk as index files. Index lookups incur disk I/ O costs. I/0 costs are much higher than memory accesses, so indexes should be structured to minimize disk I/ O accesses. For B+ tree, compared with red black tree, balanced binary tree, the whole tree is very low, a node can store more than one element, so it can improve disk I/O efficiency, can also improve the efficiency of range search
InnoDB index implementation
For clustered indexes, the key held by the node in the B+ tree is the primary key of the data table, and the leaf node holds the complete record of the data
For the secondary index, the data field of the secondary index stores the value of the primary key of the response record. Therefore, the query data of the secondary index references the primary key and searches for specific data in the cluster index through the primary key value
MYISAM index implementation
MYISAM uses a B+ tree as its index structure. Leaf nodes store the address of the data record, not the value, because MYISAM’s index and data four are in two separate files.
In MYISAM, there is no difference in structure between primary and secondary indexes, except that clustered indexes require keys to be unique, while secondary keys can be repeated.
The index optimization
Advantages and disadvantages of indexes
advantages
- Creating a unique index ensures the uniqueness of each row in a database table
- Can speed up the data query speed
- Speed up joins between tables
- Grouping and sorting time in queries can be reduced when grouping and sorting clauses are used for retrieval
disadvantages
- Creating and maintaining indexes takes time and increases with the volume of data
- Indexes take up physical memory
- When table data is added, deleted, or modified, indexes are also maintained dynamically, which slows down data maintenance
Principles for indexing
- Fields that are frequently queried should be indexed
- The foreign key relationship is used to index the fields associated with other tables in the query
- A single index or a combination of indexes, which are more cost-effective
- Sort fields in queries that are accessed more efficiently through indexes
- Fields that are frequently counted or grouped in a query should be indexed
- The table has too few records for indexing
- Fields that are frequently added or deleted are not suitable for indexing
- There is a large number of duplicate field data that is not suitable for indexing
Problems with a large number of indexes
- Each index occupies disk I/O space. The more indexes there are, the more disk space is required
- Refactoring and updating indexes is cumbersome when modifying tables. The more indexes there are, the more time it takes to update an index
- The optimizer burden is heavier, which may affect the choice of optimizer.
Index optimization specification
- If a function is used in SQL, the added index is invalid
- Do not perform calculations on index columns, which will invalidate the index
- If the result set of the query exceeds 25% of the total rows, the optimizer will evaluate that no index walk is necessary and will not run the index.
- Table contents change frequently, and statistics are inaccurate. Indexes may fail
- ! = <> is null,is not null
- If the SQL statement uses LIKE, the index can only be used for ‘33%’, but not for ‘%33%’ or ‘%33’.
- If the index column has a range query, the following fields will not move the index, so try to move the index column of the range query to the end
- When using “like” for fuzzy matching, it cannot be preceded by % such as %123%. If it is added, the index will not be moved. If % is added, the index will be moved normally.
- Or >in>union. If there is an index in the column before or and there is no index in the column after or, then the related index will not go to the index. For example, SELECT * FROM payment WHERE customer_id = 203 OR amount = 3.96; If customer_id has an index, and amount does not, then the index will not be indexed, because subsequent queries will definitely have a full table scan, so there is no need to create an index.
- Range conditions can be indexed by name, such as >=,between, and so on
- You are advised not to create an index column that is null. If a column is null, the expected result may be obtained.
Cover index
Select col1,col2,col3 from test where col1=1 and col2=2. In this way, MySQL can directly obtain data by traversing the index. Overwrite index is the index to be created, which is the data to be queried. This avoids the query back to the table, reduces many random I/O operations, and improves the query efficiency
Data is empty
An index is invalid if the indexed field data has a null value