Ali Architect lecture interview: Database indexing

Why index

In general application system, the read and write ratio is about 10:1, and insert operation and general update operation rarely appear performance problems, encountered most, is also the most prone to problems, or some complex query operation, so the optimization of query statement is obviously a top priority.

If a table has one hundred million data, need to find one pair data, according to the normal logic, a one to match, the worst case need to match one hundred million times to get the results, with big O notation is O (n) is the worst time complexity, this is unacceptable, and this article one hundred million data obviously cannot one-time read into memory for the use of procedures, So this 100 million matches is 100 million I/O costs without cache optimization, and with current disk I/O capacity and CPU power, it can take months to get results. ** If the table is converted to a balanced tree structure (a very dense tree with many nodes), assuming that the tree has 10 levels, it only takes 10 IO overhead to find the data needed, and the speed increases exponentially, using the big O notation is O(log n), where n is the total tree and the base is the number of branches of the tree. The result is the number of levels of the tree. ** In other words, the number of searches is the logarithm of the total number of records based on the number of branches in the tree, which can be expressed by the formula:

Programmatically, this is math.log (100000000,10), 100000000 is the number of records,10 is the number of branches in the tree (in real life, there are far more than 10 branches), and the result is the number of lookups, down from hundreds to single digits. Therefore, the use of indexes can provide surprising performance improvements for database queries.

However, things are has two sides, can let a database query data rate index, and decrease the speed of writing data, the reason is very simple, because the balanced tree this structure must be maintained in a correct state, increases the censored data will change in a balanced tree each node index data content, destroy the tree structure, therefore, at every time of data change, The DBMS must rearrange the structure of the index tree to make sure it is correct, which incurs a significant performance cost, which is why indexes can have side effects on operations other than queries.

The index principle

Data structure B+ tree (multi-path balanced lookup tree)

Single index

** The index node is a single disk block, and memory loads one disk block data per I/O. ** Each node contains an index key value and a next pointer. Therefore, the smaller the key value in the index column is, the more data can be stored in each disk block. The more branches there are in the B+ tree, the flatter the B+ tree is.

B+ tree building process

Reference: www.jianshu.com/p/1775b4ff1…

Joint index

The difference from the single-column index is that the data in the joint index column is used as the value value, and the value comparison weight decreases from the first column to the next. The index key is a collection of federated index fields (in order).

The federated index BCD looks like the figure in the index tree. In the process of comparison, b is judged first, then C is judged and then D is judged. So there is the leftmost matching principle for federated indexes.

According to the joint index tree, find the data element under the index, namely the ID value, and then find the final data from the primary key index tree. (Non-clustered index)

Clustered index

The data for all nodes of the tree (except the bottom) is made up of data from the primary key field, which is the ID field that we normally specify as the primary key. The bottom part of the leaf node is the data in the actual table. The leaf node of the cluster index contains a complete record row. Suppose we execute an SQL statement:

select * from table where id = 1256;

First, locate the leaf node where the value 1256 is located according to the index, and then get the data row with ID equal to 1256 through the leaf node.

Nonclustered index

Non-clustered index leaf nodes only contain index columns and primary key values. To search for records through non-clustered index, first find the primary key, then go to the clustered index to find the corresponding record row. This process is called back table.

The difference between a non-clustered index and a clustered index is that the data to be searched can be found through a clustered index, while the primary key value corresponding to a record can be found through a non-clustered index, and then the required data can be found through the clustered index using the primary key value, as shown in the following figure:

No matter how the table is queried, the data is eventually located using the primary key by clustering the index, which is the only path to the real data.

Cover index

One exception, however, is that you can query the data you want without using a clustered index. This alternative is called a “covered index” query, or compound or multi-field index query. As stated above, when a field is indexed, the contents of the field are synchronized to the index. If you specify two fields for an index, the contents of both fields are synchronized to the index.

If an index contains (or overwrites) the values of all the fields to be queried, it is called a ‘overwrite index’.

Because non-clustered indexes do not contain complete data information, searching for complete data records needs to return to the table, so a query operation actually requires two index queries. If all index queries have to be found twice, then it will definitely cause a loss of efficiency, after all, there is no less than one search.

For example, the age index above is a non-clustered index. If I want to query the user ID by age, I execute the following statement:

select id from userinfo where age = 10;
Copy the code

Is it necessary to return to the table in this case? Because I only need the value of id, I can already get the id through the index age, if I go back to the table once, I will do useless operation. In fact, it is not necessary. In an indexed query, there is no need to go back to the table if the secondary index already has all the information about the query.

Index analysis and optimization

The index rules

The left matching

Mysql > select * from (a,b,c,d); mysql > select * from (a,b,c,d); mysql > select * from (a,b,c,d); If the index (a,b,d,c) can be used, the order of a,b,d can be arbitrarily adjusted. B = 2 If the index (a,b) is created, the index (a,b) cannot be matched.

Keep index fields as small as possible

The size of a disk block, that is, the size of a data page, is fixed. The smaller the space occupied by data items, the more keys there are, the more branches there are, and the lower the height of the tree. This is why each data item, the index field, should be as small as possible.

Select a highly differentiated column as the index column

Low differentiation results in fewer branches, higher tree height, more I/OS, and lower efficiency.

Theta and in can be out of order

For example, if a = 1 and b = 2 and c = 3, you can create (a,b,c) indexes in any order. Mysql’s query optimizer will help you optimize them into a form that the indexes recognize.

Expand indexes as much as possible, do not create new ones.

For example, if you want to add (a,b) to a table that already has an index of A, you only need to modify the original index.

An index pushdown

A new feature was introduced in MySql 5.6 called index condition pushdown, also known as index pushdown. So what does the index push down look like? In fact, the name “index condition push” indicates that this feature can be used to judge the conditions of the index field, filter the records that do not meet the conditions, and reduce the number of times to return to the table. 台湾国

Such as:

select * from person where name like 'A%' and age = 19; (Index fields are name, age)

If there is no index pushdown, all records whose names start with A will be queried according to the index, then ID will be queried, and then the corresponding ID records will be queried back to the table. Finally, age=19 will be judged, and the statement meeting the condition will be returned. Since there are two records that satisfy the start of A, the table is returned twice in this case.

In the index push case, InnoDB directly determines whether age=19 meets the index condition and filters out records that do not meet the condition, so only one entry is returned, that is, only one entry is returned. This improves performance.

The index application

The only index

A unique index is an index that is not allowed to have the same index value. The system checks for duplicate key values when creating the index, and checks for this each time a pair is updated or added. A primary key index is a unique index.

Joint index

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Ali Architect lecture interview: Database indexing

Why index