Why is there a need for indexes?
In the real world, an index is just like a table of contents. We can find a specific page of a book according to the table of contents, which can reduce our search time. The table of contents takes up a certain number of pages, which is usually a small fraction of the content pages. But this paging also takes up space, so the index itself is a space-for-time data structure. The reason we need indexes is that we can sacrifice some space for the speed of querying data. How do you choose an appropriate index data structure in a database? An index is also a data structure in a computer that consists of two infrastructures, linked lists or arrays, combined according to different scenarios.
What data structure should an index use to organize data in a database scenario?
When we think about what data structures to use, we first need to identify the scenarios and characteristics of the main applications, so let’s review the database operations we often use:
- select cols form table where col = ? ;
- select cols from table where col = ? order by col;
We generally use database query is divided into two kinds, equivalent query and sorting according to a column, two kinds of use in the process of database is often seen, based on these two scenarios, what data structure should we choose?
- Hash table Hash table is fast, uses arrays as the basis, has O(1) query, add and delete time complexity (excluding hash conflict is high), but hash table has a big disadvantage is not suitable for us
Store data. Since the hash table is unordered, sorting by a given field requires at least O(n) time. The data structure of indexed hash table is not widely used in databases as an index structure to organize data. Instead, it is generally used in some scenarios. For example, InnoDB has adaptive hash index, which will automatically add hash index after some row equivalent query exceeds a specified number of times.
Is there a data structure that has the same time complexity as an equivalent query and a sorted query?
Of course, the answer is to use linked lists as a tree structure to organize data contact. There are many different implementations of tree structure, such as binary search tree, AVL tree, red black tree and so on. The average time complexity of red-black tree operations such as query, add and delete is O(logN). Although it is very complicated to implement, it is still widely used in different fields. For example, Java TreeSet is implemented by red-black tree.
Does the database use binary search tree, red-black tree as the data structure to organize the data?
No, because according to the characteristics of binary tree and red-black tree, the data depth is extremely high. If the depth of mysql data is stored in a file database, it will lead to too many I/O query requests and poor efficiency. The two are more suitable for organizing data in memory.
What data structure does the database use as an index to store data?
B+Tree B+Tree B+Tree
Why B+Tree
According to the above description, the data structure we choose must have the following characteristics:
- The complexity of equivalent and sort query time should not be too high
- The depth of the tree should not be too high
The b-tree index is an N-fork Tree. The more forks there are, the fatter the Tree structure will be. In this way, the size of a single node will increase, but the depth will decrease, and the I/O request times will be reduced. Nodes store more data and are stored in pages, which is more local.
For example: If the primary key is bigint, then each primary key occupies 8bytes. If the page size is 4KB, then a node can store 500 data, and M is about 1000. Then three layers can store 500 * 1000 * 1000 = 500 million keys
On the basis of B-tree, Mysql introduces B+Tree.B+Tree is an upgrade of B-tree. The main upgrade point is that the non-leaf nodes of B+Tree store indexes but not data, and the leaf nodes of B+Tree add linked list relationship.
- Non-leaf nodes store only indexes and not data, which ensures that all data is on a single row and that range queries do not need to access the parent node again, thus reducing the number of requests to other pages
- At the same time, the data are in the child node, which is more suitable for range query, because B+Tree itself is ordered, so it can be directly traversed through the linked list.
- Leaf nodes can store fewer indexes per page if they hold data
What is page splitting in B+Tree? Why does page splitting occur? How can I avoid page splitting
In B+Tree, the page size is fixed, which means,
B+Tree allocates data by page, so splitting is necessary to maintain the validity of the data. Sequential storage reduces page splitting and thus disk I/O times.
Because of sequential insertion, splitting only happens when the nodes are full. If it’s unordered, multiple cases will be paginated.
Clustered indexes and secondary indexes
Clustered index
The data is stored with the index
- InnoDB storage engine is an index organized table, that is, the data in the table is stored according to the primary key, and the primary key of each table forms a B+ tree. At the same time, the leaf node stores the row record data of the whole table, and the leaf node that aggregates the index also becomes the data page.
- Data pages can only be sorted by a B+ tree, so each table has only one clustered index
- Records are indexed in the same order as the logical physical order of the data
Secondary index
- All indexes except the clustered index are secondary indexes. The difference is that the leaf node of the secondary index does not contain all the data of the row record
- Leaf nodes do not point directly to the data page
- Each table can have multiple non-clustered indexes, requiring more disks and content, and multiple indexes can affect insert and UPDATE speed
Federated and overwrite indexes
Joint index: Index multiple columns of a table (idx_COL1_COL2)
Why did mysql design a leftmost matching principle?
The core reason is that the joint index will regard the field on the left of the joint as a mark to maintain the index order. Since the two indexes (A and B) are sorted according to the field of A, if B is used alone, the index itself is out of order and cannot be searched in order, but can only be scanned.
Col1,col2 if idx_col1 is the same,col2 is ordered.
Therefore, the use of federated indexes can avoid memory sorting. Overwriting indexes in explain Using fileSort can reduce the number of back tables, that is, the index contains the contents of the fields to be queried.
The use of using index in the description of Extra in Explain indicates the use of an overwrite index
Cardinality
Indexes have maintenance costs. For example, when you add a row of data, you need to maintain multiple indexes at the same time, reducing insertion efficiency. Therefore, it is better to add some fields with low repetition, such as ID and name
SHOW INDEX FROM TABLE shows the estimated number of non-duplicate entries in the result column.
Cardinality does not update every time it inserts and modifies; DB updates with Sample:
- The update policy is that 1/16 of the data in the table changes
Sample: Randomly select multiple leaf nodes and count the different numbers of each page ANALYZE TABLE and SHOW TABLE STATUS and SHOW INDEX Update Cardinality
What would make an index invalid
- Fields plus functions such as sum()
- With join, the join table ids have different character sets (in fact, the default is the character conversion function).
- Take the reverse operation, such as! = not in not exists