preface

  • What is the index? What are the pros and cons? Once asked in an interview, it can be a tricky question for a newbie.
  • This article will detail what an index is, its pros and cons, data structures, and other common knowledge.

What is an index

  • An index is a data structure that stores and sorts the values of a particular column in a table, so it is created above the columns of a table. Indexes speed up searches by narrowing the number of records that need to be queried in a table. If there is no index, the database has to do a full table scan. An index is like a table of contents in a book. It helps you find the content faster.

Advantages of indexes

  1. By creating unique indexes, you can ensure that each row of data in a database table is unique.
  2. It can greatly speed up data retrieval, avoid data scanning of the entire table, and greatly reduce the number of rows traversed to match, which is also the main reason for creating indexes.
  3. Can speed up joins between tables, especially in terms of achieving referential integrity of data.
  4. When using grouping and sorting clauses for data retrieval, the grouping and sorting time in queries can also be significantly reduced.
  5. By using indexes, you can improve the performance of the system by using optimization hiders during the query process.

Disadvantages of indexes

  1. Creating and maintaining indexes takes time, which increases with the volume of data.
  2. Indexes need to occupy physical space. In addition to the data table occupies data space, each index also needs to occupy a certain physical space. If the cluster index is to be established, the space will be larger.
  3. Indexes also need to be maintained dynamically as data is added, deleted, or modified to a table, which slows down data maintenance.

Which columns to index

  1. You can speed up searches on frequently searched columns;
  2. Enforces uniqueness of primary key columns and organizes the arrangement of data in the table;
  3. Columns that are often used to join are mainly foreign keys that speed up the join;
  4. Create indexes on columns that often need to be searched by range because the indexes are sorted and specify a contiguous range;
  5. Create an index on a column that often needs to be sorted because the index is already sorted so that queries can take advantage of the index sort to speed up the sort query time;
  6. Create indexes on columns that are often used in the WHERE clause to speed up the determination of conditions.

Which columns are not indexed?

  1. Indexes should not be created for columns that are rarely used or referenced in queries. This is because, since these columns are rarely used, having an index or no index does not improve query speed. On the contrary, due to the increase of indexes, the maintenance speed of the system is reduced and the space requirements are increased.
  2. You should also not add indexes to columns that have very few data values. This is because, because the values of these columns are very small, such as the gender column of the personnel table, in the result of the query, the data rows of the result set account for a large proportion of the data rows in the table, that is, a large proportion of the data rows that need to be searched in the table. Adding indexes does not significantly speed up retrieval.
  3. Columns defined as text, image, and bit data types should not be indexed. This is because the data volume of these columns is either quite large or has very few values.
  4. Indexes should not be created when the modification performance is much greater than the retrieval performance. This is because modification performance and retrieval performance are incompatible. When indexes are added, retrieval performance is improved, but modification performance is reduced. When indexes are reduced, modification performance is improved and retrieval performance is reduced. Therefore, indexes should not be created when the modification performance is much greater than the retrieval performance.

The data structure of the index

Common index data structures include B+Tree, Hash, FullText, and R-tree.

A Hash index

1. Overview:

In MySQL, only the Memory storage engine supports Hash indexing, which is the default index type for Memory tables. Hash indexes organize data indexes in the form of hash values. Therefore, the search efficiency is very high and can be located at one time, unlike B-/+Tree indexes that require multiple I/O operations from the root node to the leaf node.

2. Disadvantages of Hash indexes:

(1) Hash indexes can only meet equivalent queries, not range queries. This is because the size relationship of the data may change after the Hash algorithm. ② The Hash index cannot be sorted. Sorting is meaningless because the size relationship may change after the data is Hash.

③ Hash indexes cannot avoid table data scanning. When a Hash collision occurs, it is not enough to compare the Hash value. You need to compare the actual value to determine whether it meets the requirements.

④ The Hash index may not perform better than the B-tree index when a large number of Hash values are the same. Collisions may result in multiple scans of table data, resulting in poor overall performance. This problem can be resolved to some extent by using appropriate Hash algorithms.

⑤ Hash indexes cannot be queried using partial index keys. Because the Hash value is computed by combining multiple database columns when using composite indexes, it does not make sense to compute the Hash value for individual column data.

FullText indexing

1. Overview:

Full text index, currently only supported by MyISAM storage engine, and only supported by char, VARCHar, text type. It is used to replace the less efficient like fuzzy matching operation, and it can fully fuzzy match multiple fields at once through the full-text index of multi-field combination.

2. Storage structure:

B-tree is also used to store index data, but a specific algorithm is used. The field data is divided and then indexed (generally divided every 4 bytes). The index file stores the set of index strings before segmentation, and the index information after segmentation. The node corresponding to the Btree structure stores the word information after segmentation and its position in the index string set before segmentation.

– / B + Tree index

  • B+Tree is one of the most frequently used index data structures in mysql. It is the index type of Innodb and Myisam storage engine schema. Compared to Hash indexes, B+ trees are not as fast at finding a single record as Hash indexes, but are better for sorting and other operations.

1. B+Tree index

  • B+Tree: all index data of B+Tree are on leaf nodes, and sequential access Pointers are added. Each leaf node has Pointers to neighboring leaf nodes. This is to improve the interval query efficiency. For example, all data records whose key is from 18 to 49 can be queried. When 18 is found, all data nodes can be accessed at one time by traversing the sequence of nodes and Pointers, which greatly improves the interval query efficiency.
  • Greatly reduces the number of disk I/O reads.

– / B + Tree index:

  • File systems and database systems generally use B-/+Tree as the index structure: generally, the index itself is too large to be stored in memory, so the index is often stored on disk in the form of index files. Therefore, the disk I/O consumption of index lookups is several orders of magnitude higher than memory access, so the most important indicator of a data structure as an index is the progressive complexity of disk I/O operations during lookups. In other words, indexes should be structured to minimize the amount of disk I/O needed to access a lookup.

Local processing and disk prefetch

  • Due to the characteristics of the storage medium, the disk itself access is much slower than the main memory, coupled with the cost of mechanical movement, the disk access speed is often one hundred percent of the main memory, so in order to improve efficiency, to minimize the disk I/O. To achieve this goal, the disk does not read data strictly on demand. Instead, the disk prereads each time. Even if only one byte is required, the disk reads a certain length of data backward from this position into memory.
  • Because sequential disk reads are efficient (no seek time required, very little spin time required), preread can improve I/O efficiency for localized programs. The prefetch length is an integer multiple of a page. A page is a logical block of computer-managed memory. Hardware and operating systems often divide the main memory and disk storage areas into contiguous, equal-sized blocks. Each block is called a page (in many operating systems, the page size is usually 4k). When the data that the program is trying to read is not in main memory, a page-missing exception will be triggered. At this time, the system will send a disk read signal to the disk. The disk will find the starting location of the data and load one or more pages back into memory successively.

B-/+Tree index performance analysis

  • As mentioned above, disk I/O counts are generally used to evaluate index structures. Based on the definition of the B-tree, a maximum of H nodes need to be accessed in a search. The designers of the database system took advantage of disk prefetch by setting the size of a node equal to one page so that each node could be fully loaded with only one I/O. In order to achieve this, the following techniques are also needed to implement the B-tree in practice:
  • Each time a new node is created, space for a page is requested directly. This ensures that a node is physically stored on a page, and computer storage allocation is page-aligned, thus achieving only one I/O per node.
  • A search in b-tree requires at most H-1 I/ OS (root node resident memory), and the progressive complexity is O(h)=O(logdN). In general practice, the degree D is a very large number, usually more than 100, so h is very small (usually no more than 3).
  • In summary, using B-tree as index structure is very efficient.
  • In red-black trees, the H is significantly deeper. Because nodes (parent and child) that are logically close may be physically far away and cannot take advantage of locality, the I/O progressive complexity of red-black Tree is also O(h), and the efficiency of red-black Tree is significantly lower than that of B-tree.
  • In addition, B+Tree is more suitable for external storage index, because of the inner node out degree D. As can be seen from the above analysis, the larger D is, the better the index performance is, while the upper limit of out-degree depends on the size of key and data in the node. Since the data field is removed from the node in B+Tree, it can have a larger out-degree and better performance. (See point 3 of this section for details)

Comparison between B-tree and B+Tree

  • According to the structure of B-tree and B+Tree, we can find that B+Tree has more advantages than B Tree in file system or database system, for the following reasons:

1. The disk read and write cost of B+ tree is lower

The internal nodes of the B+ tree do not have Pointers to specific information about the keyword. So the internal nodes are smaller than the b-tree. If all the keywords of the same internal node are stored in the same disk block, then the disk block can contain more keywords. Read into memory at a time to find more keywords. The number of I/O reads and writes is relatively low.

2. The query efficiency of B+ tree is more stable

Because the internal node is not the node that ultimately points to the content of the file, it is only the index of the keyword in the leaf node. So any keyword lookup must take a path from root to leaf. The length of all keyword query paths is the same, resulting in the same query efficiency of each data.

3. B+ tree is more conducive to database scanning

B tree not only improves disk I/O performance, but also does not solve the problem of low efficiency of element traversal. However, B+ tree only needs to traverse leaf nodes to solve the scanning of all keyword information. Therefore, for range Query frequently used in the database, B+ tree has higher performance.

MySQL index implementation

  • In MySQL, index belongs to the concept of storage engine level. Different storage engines implement index in different ways. This part mainly discusses the index implementation of MyISAM and InnoDB storage engines.

MyISAM index implementation

1. Primary key index

MyISAM engine uses B+Tree as the index structure, and the data field of the leaf node stores the address of the data record. Here is the MyISAM index schematic:

 

  • Select Col1 as Primary key from MyISAM; select Col1 as Primary key from MyISAM; You can see that MyISAM’s index file only holds the address of the data record.

2. Secondary indexes

In MyISAM, there is no difference in structure between primary and Secondary keys, except that the primary index requires a unique key, while Secondary keys can be repeated. If we create a secondary index on Col2, the structure of this index is as follows:

 

  • Also a B+Tree, data field holds the address of the data record. Therefore, the index retrieval algorithm in MyISAM is to search the index according to THE B+Tree search algorithm. If the specified Key exists, the value of its data field is extracted, and then the corresponding data record is read with the value of the data field as the address.
  • MyISAM’s index is also called “non-clustered” to distinguish it from InnoDB’s clustered index.

InnoDB index implementation

  • InnoDB also uses B+Tree as an index structure, but the implementation is different.

1. Primary key index

The first big difference from MyISAM is that InnoDB’s data files are themselves index files. MyISAM index files and data files are separate, and the index file only holds the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+Tree, and the data field of the leaf node of this Tree holds complete data records. The key of this index is the primary key of the table, so the InnoDB table data file itself is the primary index.

 

Above is a diagram of the main index (which is also a data file) of InnoDB. You can see that the leaf node contains the complete data record. This kind of index is called a clustered index. InnoDB requires tables to have primary keys (MyISAM does not have primary keys) because InnoDB data files themselves are aggregated by primary keys. If this is not explicitly specified, the MySQL system automatically selects a column that uniquely identifies the data record as the primary key. MySQL automatically generates an implied field for the InnoDB table as the primary key. This field is 6 bytes long and of type long integer.

2. Secondary indexes

The second difference from MyISAM index is that InnoDB’s secondary index data field stores the value of the primary key of the corresponding record rather than the address. In other words, all InnoDB secondary indexes refer to the primary key as the data field. For example, here is a secondary index defined on Col3:

 

  • The ASCII code of English characters is used as the comparison criterion. Clustered index implementations make searching by primary key very efficient, but secondary index searches require retrieving the index twice: first, retrieving the secondary index for the primary key, and then using the primary key to retrieve records from the primary index.
  • InnoDB tables are built based on clustered indexes. So InnoDB indexes provide very fast primary key lookup performance. However, its secondary index will also contain primary key columns, so if the primary key uses too long a field, it will cause the other secondary strings to become larger. If you want to define many indexes on a table, try to keep the primary key definition as small as possible. InnoDB does not compress indexes.