This is my third article on getting Started

I had an interview with the interviewer about the principle and use of MySQL index, which was completely broken. I was determined to summarize it, but I didn’t have time (actually it is lazy ha ha ha). Here is a brief summary.

What is the index?

A database index is a sorted data structure in a database management system (DBMS) that sorts the values of one or more columns in a database table to facilitate faster access to specific data in the database table. Colloquially speaking, we can compare the database index to the table of contents at the front of a book, it can speed up the database query speed.

Why do you need indexes?

Consider: How to find a book in a library? Imagine, if there is no other auxiliary means in the library, only a way to go to the dark, book by book, after 3 hours of continuous search, finally found the book you need to read, but, the day is dark.

In order to avoid such a thing, every library is equipped with a set of library management system, we want to look for books, to look for to the books in the system of house number, shelf number and the position of the books on the shelf a few layers, and then you can walk directly to get the book, you can very quickly find the books we need. This is the principle of indexing, which helps us to retrieve data quickly.

The general application system to the operation of the database, encountered the most, the most easy to problem is some complex query operation, when the data in the database is large, the search data will become very slow, so it will affect the efficiency of the entire application system, we can use the index to improve the efficiency of the database query.

B-tree and B + Tree

At present, most database systems and file systems use B-tree or its variant B+Tree as the index structure, I will talk about them here:

A b-tree is a multipath search Tree. Using the B-Tree structure can significantly reduce the intermediate process when locating records, thus speeding up access.

B-tree has the followingCharacteristics of the:

(1) Define that any non-leaf node has at most M child nodes, and M>2. ② The number of sons of the root node is [2, M]. ③ The number of sons of non-leaf nodes except root nodes is [M/2, M], rounded up. (4) Each node stores at least M/2-1 (take the upper integer) and at most m-1 keywords; (At least 2 keywords). ⑥ Number of keywords in non-leaf nodes = number of Pointers to sons -1. ⑦ Keywords of non-leaf nodes: K[1], K[2]… , K[m-1], and K[I] <= K[I +1]. Non-leaf pointer: P[1], P[2]… ,P[M] (where P[1] points to the subtree where the keyword is less than K[1],P[M] points to the subtree where the keyword is greater than K[m-1], and other P[I] points to the subtree where the keyword belongs to (K[i-1], K[I])). ⑨ All leaves are on the same layer.

Something about b treesfeatures:

① The keyword set is distributed in all nodes of the whole tree. ② Any keyword appears in only one node. ③ It is possible for a search to end at a non-leaf node. (4) Its search performance is equivalent to a binary search in the complete set of keywords.

B tree search: start from the root node, binary search is carried out for the keyword (ordered) sequence in the node. If the keyword is matched, the search will end. Otherwise, the search will enter the son node of the range of the keyword; This operation is repeated until the corresponding node pointer is empty or is already a leaf.

For example, in the following B tree, the process of finding element 43 is as follows:

Locate nodes 18 and 37 according to the root node pointer, read these nodes into the memory, and perform the first disk I/O. When 43 is greater than 37, locate p3.

Locate nodes 42 and 51 according to pointer p3, read these nodes into the memory, and perform the second disk I/O. At this time, find that 42<43<51. Locate p2.

According to the pointer p2, find the node where 43 and 46 are located, read this node into memory, and perform the third disk IO. At this time, we have found the element 43.

A total of three disk I/OS were performed during this process.

B+Tree B+Tree is a variant of B-tree. Compared with b-tree, B+Tree has the following differences:

(1) Non-leaf nodes with n subtrees contain n keywords (b-tree is n-1), that is, non-leaf nodes have the same number of subtree Pointers and keywords. These keywords do not hold data, only indexes, and all data is stored in leaf nodes (b-trees hold data for each keyword). (2) All leaf nodes contain the information of all keywords, and Pointers to records containing these keywords, and leaf nodes themselves are linked in order of the size of keywords. (3) All non-leaf nodes can be regarded as the index part of leaf nodes. (4) The same number will appear repeatedly in different nodes, and the largest element of the root node is the largest element of the B + tree.

The advantage of B+ trees as indexes over B trees

(1) Lower disk I/O cost of B+ tree: Non-leaf nodes of B+ tree have no Pointers to data rows, so the same disk capacity can store more nodes, and the corresponding I/O read and write times must be reduced. ② The query efficiency of B+ tree is more stable: because all the data is stored in the leaf node. The length of all keyword query paths is the same, and the query efficiency of each data is similar. ③ All leaf nodes form an ordered linked list, which is easier to find.

MyISAM and InnoDB, two common storage engines for MySQL, use B+ trees as their indexes, but they are different.

MyISAM uses B+ tree as its index structure. The data field of the leaf node holds the address of the data stored. The primary key index has a unique key value and the secondary key can be repeated. Therefore, the index retrieval algorithm in MyISAM firstly searches the index according to the B+Tree search algorithm. If the Key to be found exists, the value of its data field will be retrieved, and then the corresponding data records will be read with the value of the data field as the address. Therefore, the index file is separate from the data file, and the address of the data is retrieved from the index, not the data.

Innodb also uses B+ trees as its index structure, but the implementation is quite different from MyISAM. First, the tables themselves are organized in B+ trees, so the data files themselves are primary key index files. The leaf key is the primary key of the table, and the Data field is the complete data record, so the InnoDB table data file itself is the primary key index (this is why MyISAM allows no primary key, but InnoDB must have a primary key). The second difference with MyISAM indexes is that InnoDB’s secondary indexes have a data field that stores the primary key of the corresponding data record rather than the address. In other words, all InnoDB secondary indexes refer to the primary key as the data field.

The index type

Ordinary indexes :(indexes defined by the keyword KEY or INDEX) whose only job is to speed up access to data.

Unique index: A common index allows columns to contain duplicate values, while a unique index does not, but can be null. So the task is to ensure access speed and avoid data duplication.

Primary key index: An index created on a primary key field. A table has only one primary key index.

Composite index: Multiple column values form an index that is used exclusively for composite searches.

Full-text index: Segmentation of text content, search. (In MySQL5.6 and later, MyISAM and InnoDB storage engines support full-text indexing.)

Use strategies and advantages and disadvantages of indexes

Using the index

The primary key automatically creates a unique index. Columns that are frequently used as a query condition in a WHERE or ORDER BY statement are indexed. To create indexes for fields associated with other tables in the query, use the foreign key relationship. Often used for aggregate functions whose columns are indexed, such as min(), Max (), and so on.

Not using indexes

Do not index columns that are frequently added or deleted. A large number of duplicate columns are not indexed. If the table has too few records, do not create an index. Because there is too little data, it may take less time to query all the data than it takes to traverse the index, and the index may not be optimized.

Leftmost matching principle

When creating a joint index, the index column starts from the left, so the order of the index column is important. When creating an index, the most commonly used column should be placed on the left. When using select, the same is true.

advantages

The uniqueness of data for each row in a database table can be guaranteed. Can greatly speed up the data index. Connections between accelerometers and tables. Can significantly reduce the time to group and sort queries.

disadvantages

Index creation and index maintenance take time, which increases with the amount of data. Indexes need to occupy physical space. In addition to the data space occupied by data tables, each index needs to occupy a certain amount of physical space. If a cluster index needs to be built, the space required will be larger. When data in a table is added, deleted, or modified, indexes must be maintained dynamically, which reduces maintenance efficiency.

Verify that the index improves query performance

Create test table index_testAdd 100,000 pieces of data to the table using a Python script using the PyMSQL module

import pymysql


def main() :
    # Create Connection
    conn = pymysql.connect(host='localhost', 
                           port=3306,
                           database='db_test',
                           user='root',
                           password='deepin',
                           charset='utf8')
    # Get Cursor object
    cursor = conn.cursor()
    Insert data 100,000 times
    for i in range(100000):
        cursor.execute("insert into index_test values('haha-%d')" % i)
    # Submit data
    conn.commit()


if __name__ == "__main__":
    main()
Copy the code

Set profiling=1; Find the 10,000th data, ha-99999

select * from index_test where name='haha-99999';
Copy the code

Check the execution time:

show profiles;
Copy the code

Create index for name column of table index_test:

create index name_index on index_test(name(10));
Copy the code

Run the query again to check the execution time:You can see that the execution time is significantly reducedThat said, proper indexes can significantly improve query efficiency for certain fields.

conclusion

In fact, I don’t use indexes very much in my work, probably because I haven’t worked on projects with large volumes of data. Hahaha, but it’s all about understanding how and even using them. Don’t ask why.The interview will be asked! Finally, I would like to thank my girlfriend for her tolerance, understanding and support in work and life.