Writing in the front

This is just a reading note of “high performance MySQL”. If you feel water, please close the upper right corner. Thank you ~

Index types include B-tree, hash index, R-tree, and full-text index. This section summarizes b-tree and hash index.

B-Tree

Before we talk about indexes, we have to talk about B-tree.

A B-tree is a balanced search Tree. Its structure is similar to a binary Tree, except that each node can have more children.

B – Tree structure

B-tree, the figure here directly quotes the figure of the first article in the reference ~ if there is infringement, please send me a private letter ~

B + Tree structure

B+Tree is a variant of B-tree, which is also a balanced binary Tree. B+Tree is shown in the figure below. The figure here directly refers to the figure in the first article in reference

The definition of the b-tree

This section refers to the method of calculating derivatives

A B-tree T has the following properties,

  • Each node X has the following attributes:
    • X.n indicates the number of keywords stored in node x
    • X.n keywords are stored in non-descending order
    • X. leaf indicates whether X is a leaf node
  • The inner node contains x.n+1 Pointers to its children
  • Each leaf node has the same depth, the height of the tree
  • The number of keywords in each node has upper and lower bounds, which is represented by t, the minimum degree of b-tree.
    • Except for root and leaf nodes, the number of children in each node is
    • Except for root node and leaf node, the number of keywords of each node is
  • B+Tree Has the same number of children and keywords as non-leaf nodes

B-Tree vs B+Tree

Look at the picture, to sum up ~

B-tree,

  • Each node has corresponding data storage
  • Each keyword appears and only appears on one node
  • The search may end at a non-leaf node

B + Tree,

  • Non-leaf nodes do not store data
  • The leaf node adds a bidirectional linked list, so it can also be seen from the figure that the leaf node contains all the keywords

Advantages of B+Tree,

  • Because non-leaf nodes do not store actual data, more keywords are stored in the memory. In other words, the disk I/O information at a time is larger than that of b-trees
  • Leaf nodes add sequential linked list, more suitable for interval query

Why B-Tree

In fact, red black trees can also be used as indexes, why MySQL uses B/B+ Tree to implement indexes? Because MySQL is a disk-based database, index lookups inevitably involve disk IO, so the two key points of index performance are:

  • Number of DISK I/O operations
  • CPU computing speed

Therefore, when designing indexes, we should minimize the number of disk I/O, while MySQL will manage records according to the way of pages. When B-Tree creates nodes, it directly applies for the space of a page, so that a node is physically stored in a page, and computer storage allocation is aligned according to the page. So you have a node that only needs one IO. A search in a Tree requires h-1 IO at most, because the root node is resident in memory. B-tree can have many nodes, so H is very small, and the number of nodes is related to the size of the page. For the same data, h of red-black Tree is obviously much deeper, so B/B+Tree is usually used as the index structure.

The progressive time complexity of B-tree is, where N is the number of keywords, d is 1/2 of the output degree of internal nodes,

The progressive time complexity of red-black tree is,

So, it is obvious why MySQL uses B/B+Tree to implement indexes.

The operation of the b-tree

For b-tree operations, we don’t want to show the diagrams, because we have given the pseudo code and very detailed diagrams.

  • Insert keyword: first find the leaf node where the keyword should be inserted. If the node is full, that is, the number of keywords isThe middle keyword will be moved up to the parent node of the node. If the parent node is also full, repeat the above operation until the root node is reached. If the root node is reached, it means that the height has been increased by one layer
  • Delete keyword: delete keyword is more complicated than insert keyword, because delete not only leaf nodes, but also can be internal nodes, at this time we need to consider how to place the children of internal nodes, and ensure that the deleted B-tree meets the requirements, so delete keyword can be divided into several situations.
    • If k is in node X and x is a leaf, then simply delete k from x
    • If the number of keywords in node U1 before K in x is T, find the largest key in U1, delete the key in U1, and replace K with key in X
    • If the number of keywords in the node U1 before K in X is less than T, find the node U2 after K in X. If the number of keywords in U2 is T, find the smallest key in U2, delete the key in U2, and replace K with key in X
    • If the number of keywords of the two nodes before and after is t-1, combine U1 and U2, delete k in X, and point the pointer in X to the new node
    • If k is not in the current inner node, perform the above operation after finding the inner node where K is located

B-tree indexes

B-tree index is the most common index type. For example, InnoDB and MyISAM both use b-tree index structure. In fact, b-tree index structure is B+Tree. B+ Tree adds sequential access Pointers to leaf nodes to facilitate the range traversal of leaf nodes. InnoDB and MyISAM are two different indexes.

InnoDB

InnoDB support clustering index, index of clustering and the clustering index is not a strictly index, but a way of data storage, the name has something to do with its own storage, “cluster” said data line and the adjacent keys stored together, in short, is stored in the leaf node is actually the real data. InnoDB aggregates data by primary key, so a table can only have one clustered index and must have a primary key. If no primary key is defined and there is no non-empty index to replace it, InnoDB implicitly defines a primary key as the clustered index.

The secondary index of the cluster index stores the primary key value of the row rather than the pointer to the physical position of the row. Therefore, if a row is searched through the secondary index, it needs to find the leaf node of the secondary index to obtain the corresponding primary key value, and then search for the corresponding row. With InnoDB, an adaptive hash index can reduce such rework.

MyISAM

MyISAM supports non-clustered indexes, which differ from InnoDB in that leaf nodes store physical addresses pointing to rows, meaning that indexes and data are actually stored separately.

MyISAM uses prefix compression so that more indexes can be put into memory. By default, only strings can be compressed, but integers can also be compressed through parameter Settings. The compression method is to completely save the first IE in the index block, and then compare other values with the first value to get the number of bytes with the same prefix and the rest of the different suffix parts, and store this part. For example, if the first value in the index block is “Perform” and the second value is “performance”, the prefix of the second value is compressed to store something like “7, ance”. MyISAM uses similar prefix compression for line Pointers.

Compression can improve performance to some extent by making indexes use less space, but at the expense of slower operations. Because the compression prefix of each value depends on the previous value, MyISAM cannot use binary search in the index block, and can only scan from the beginning. The positive scan speed is good, and the reverse scan operation to find a row needs to scan half of the index block on average. For CPU-intensive applications, scanning requires random lookups, so compressing indexes makes index lookups much slower, while for I/O intensive applications, optimization for some queries is more pronounced.

The lock

InnoDB uses row locks and therefore supports transactions, while MyISAM uses table locks and does not support transactions.

Scope of application

B-tree index is suitable for interval query, because the leaf nodes after b-tree storage are themselves ordered, and THE B+ Tree structure also adds sequential Pointers to leaf nodes, which is more convenient for interval query.

The hash index

Hash indexes are implemented based on hash tables, and only queries that exactly match all columns of the index are valid. The method is to compute a Hash code for all index columns, which acts as an index and holds a pointer to each row in the hash table.

advantages

  • The index itself only stores hash code, so the structure is compact and lookup is fast

limit

  • The hash codes in the index are stored sequentially, but the data corresponding to the hash code is not sequential and therefore cannot be used for sorting
  • Partial index column match lookups are not supported because a hash index uses the entire contents of the index column to compute a hash code
  • Only equivalence comparison is supported, but range query is not supported
  • If hash conflicts are serious, all row Pointers in the linked list must be traversed
  • If hashing conflicts are severe, index maintenance operations can be costly

InnoDB’s adaptive hash index

First, note that adaptive hash indexing is insensitive to the user; it is a completely automatic, internal behavior that the user cannot control or configure, but can turn off.

When InnoDB notices that an index value is being used very frequently, it creates a hash index in memory based on the B-tree index, so that the B-tree can have some of the benefits of hash indexes, such as fast hash lookups.

Of course, if the storage engine does not support the hash index, users can also customize the hash index, so that the performance will be relatively high, the defect is that you need to maintain the hash value, if this method is adopted, do not use SHA1() and MD5() as the hash function, because these two are strong encryption functions, the design goal is to minimize the conflict. The resulting Hash code is a very long string that wastes a lot of space, and there are less conflicting requirements for indexes in a hash index.

Advantages of indexes

  • Using indexes reduces the amount of data that the server needs to scan
  • Using indexes helps the server avoid sorting and temporary tables
  • You can use indexes to turn random I/ OS into sequential I/ OS

However, indexes are not always the best solution. For very small tables, a simple full table scan is more efficient in most cases, for medium to large tables, indexes are more efficient, and for very large tables, partitioning is more efficient.

Reference

B tree and B plus tree

From B trees, B+ trees, B* trees to R trees

Introduction to Algorithms

High Performance MySQL