Locality principle
In order to understand the underlying principle of mysql indexing, you must know the local principle. In order to read data from disk, the user must pass through disk I/O. As we all know, I/O reads disk data are time-consuming. To improve the read efficiency, you must reduce THE I/O count. Disk read data, therefore, is not as required reading, but every time to proofread a part of the data, even if only one byte, disk will begin from the bytes, length of data read back into memory Behind this part of the read data is not redundant, because after scientists experiment, a data to be used by that time, Near its data can be used normally, this is the principle of locality Because of the high efficiency of the order of the disk read (for mechanical drive, once you find the data, after its near the data read do not need to seek, the rotation of the little time), so for the application of the principle of locality, disk read ahead can improve the efficiency of the IO. Therefore, the operating system reads disks in the unit of pages. The size of a page is usually 4K. An I/O operation usually reads an integer multiple of the page
Why not Hash indexes
Mention Hash, and anyone who has studied Java will immediately think of HashMap. The underlying HashMap in JDK1.8 is stored by a Hash table + a linked list + a red-black tree **, which is very fast for equivalence searches, as shown below
select * from user where name='zhangsan';
Copy the code
If you hash a string zhangsan and then search it in a linked list and a red-black tree, it is very efficient. However, it is based on memory storage ** However, the index itself is also very large, it cannot be stored in memory, and most databases are range lookup, because hash does not guarantee the order of the data. So hash is not a good fit
Why don’t indexes be stored in binary or red-black trees
Because the index itself is very big, so the index is often in the form of index files stored on disk, in this case, the index lookup can produce disk IO consumption, more memory lookup, disk IO on the higher consumption of several orders of magnitude, so the evaluation of a data structure to store the index in the process of the pros and cons of important indicators is to find the number of disk I/o
Data structures such as binary trees and red-black trees are stored on disk. Nodes that are logically close to each other may actually be too far apart to take advantage of the locality principle. Therefore, it takes one IO to read a node. Using red-black trees to store data, even a small amount of data will make the tree too deep, requiring many I/OS, so this data structure is not suitable
Why not use B trees to store indexes
The figure above shows the data structure of a B-tree. Like a binary tree or a red-black tree, it takes disk I/O to read a node of a B-tree. However, a node of a B-tree can store multiple pieces of data, which can be stored in a contiguous disk area
A typical operating system reads four pages of disk data at a time, that is, 16K of data. To take full advantage of locality, the node size of a B-tree is also set to 16K, so a node is called a disk block
The nodes of a B-tree store not only Pointers but also keys and data. As a result, a disk block cannot store too many Pointers and nodes, and the tree depth becomes too large, resulting in too many I/O times. Therefore, this data structure is not suitable
B+ tree advantage
B + Tree is an optimization on B Tree. The optimization is as follows:
- Non-leaf nodes do not store data. The advantage of this is that non-leaf nodes can store more Pointers and keys, which will split the data into more data segments, reducing the depth of the tree and speeding up retrieval
- A bidirectional linked list is established between disk blocks of leaf nodes, which conforms to the preread feature of disk and is more suitable for scope and paging search
- If a key and pointer are 10B and a disk block is 16K, a disk block can store 1600 keys and Pointers, and 1600 * 1600 data can be indexed in two IO. Therefore, millions of data can be indexed in three IO, which is very efficient
InnoDB index implementation
The InnoDB table data file itself is an index file organized in a B+ tree. The key of the index is the primary key of the table, so this kind of index also aggregates indexes.
Index InnoDB requires a table to have a primary key. If the table does not have a primary key, a unique field is used as the primary key. If the table does not have a unique field, an implicit field is automatically generated as the primary key
In the case of a normal index, the leaf node stores the value of the primary key rather than the entire record. Therefore, after establishing a normal index, the search for the corresponding data needs to go through two indexes, the first is to find the primary key value of the record, and the second is to find the data based on the primary key value. Therefore, the primary key cannot be too large, because if the primary key is large, the index is large
MyISAM index implementation
MyISAM storage engine also uses B+ tree to store indexes, but its index files and data are separate, and the leaf nodes in the index store the address of the actual data, so this kind of index is called non-clustered index
In MyISAM, there is no difference in storage structure between a primary key index and a normal index, except that a primary key index requires a unique primary key, while a normal index can be repeated
Compare MyISAM with InnoDB
MyISAM | InnoDB | |
---|---|---|
The transaction | Does not support | support |
Table locks | support | support |
Row locks | Does not support | support |
A foreign key | Does not support | support |
The full text indexing | support | Supported (after 5.6) |
Usage scenarios | A large number of select | Lots of INSERTS, deletes, updates |
Because MyISAM manages non-transactional tables and supports high-speed storage and full-text retrieval, it is suitable for a large number of query scenarios. While InnoDB supports transactions, it is suitable for performing a large number of INSERTS, updates, and DELETES to improve performance in concurrency