preface

In the last article we discussed the underlying principles of indexing and why we often use B+ trees as data structures for indexing. In this article we will look at the principles of index creation and ask when to create an index. See when is not a good time to create an index?

When is it appropriate to create an index

1. The value of a field has unique restrictions, such as the user name

The index itself can play a role of constraint, such as unique index, primary key index can play a unique constraint, so in our data table, if a field is unique, we can directly create a unique index, or primary key index.

2. Fields that are frequently used as criteria for WHERE queries, especially if the table is large

In the case of large data volumes, a field that is frequently used in the WHERE condition of an SQL query needs to be indexed. Creating plain indexes can greatly improve the efficiency of data queries.

3. The GROUP BY and ORDER BY columns are always required

Indexing is to store or retrieve data in a certain ORDER, so when we use GROUP BY to query data, or use ORDER BY to sort data, we need to index the grouped or sorted fields.

4. Create an index for the UPDATE and DELETE WHERE columns

When data is queried according to certain criteria and then updated or deleted, creating an index for the WHERE field can greatly improve efficiency. This works because we need to retrieve the record based on the WHERE condition column before updating or deleting it. The efficiency gains are even greater when non-indexed fields are updated, because non-indexed field updates do not require index maintenance.

However, in practical work, we also need to pay attention to balance, if too many indexes, when updating data, if it involves index update, it will cause a burden.

5. Indexes for DISTINCT fields need to be created

Sometimes, a field needs to be de-duplicated. If DISTINCT is used, creating an index for this field can also improve query efficiency.

6. Pay attention to the following principles when creating indexes for multiple table joins

  1. The number of join tables should not exceed 3, because each additional table is equivalent to adding a nested loop, which increases by an order of magnitude and seriously affects the efficiency of the query.
  2. Index WHERE conditions because WHERE is the filter for data conditions. Without WHERE conditional filtering can be very scary if the data volume is very large.
  3. Create an index for the field used for the join, and the field must be of the same type across multiple tables.

When is it appropriate to create an index

1. Frequently updated fields are not suitable for indexes.

Why is that? The reason why an index improves query efficiency is because it sorts the data when it inserts it. Therefore, if frequently updated fields are involved, this is not a good time to build indexes.

Here, take B+ tree index as an example (B+ tree index is the most common index data structure in the database, almost all relational databases support it). When B+ tree is inserted, it needs to sort the data. In fact, the cost of sorting is not much. That’s because sorting is a CPU operation (the CPU can process hundreds of millions of instructions in a current clock cycle). If face data sequence or reverse insert is good, this time B + tree index maintenance cost is lower, because the leaf node is from left to right to insert, such as on the ID insertion, time insert (if create indexes on the increase ID, create indexes on time column, the B + tree insert is usually more quickly).

However, if the inserted data is unordered, the B+ tree requires expensive operations such as splitting and rotating pages to maintain sorting, and even for solid-state drives, random writes are not as good as sequential writes, so disk performance suffers. For example, when a user registers with a nickname, most nicknames are randomly selected. If an index is created on a nickname, the insertion will be disordered, and the index maintenance will cost a lot.

2. The columns that are not used in the WHERE condition are not suitable for creating indexes.

Why is that? The value of an index is that it can be quickly located. If you cannot locate a field, you usually do not need to create an index.

3. A small amount of data is not suitable for index building.

Why is that? This is easy because there is no need to use an index for a small amount of data. Full table scans are already very fast, and even if you use an index, it won’t work.

4. Fields with poor uniqueness are not suitable for indexing.

Why is that? This is also easy, but you can imagine a very extreme case where if all the data in the table is the same (impossible), you can use indexes just as well.

We can use the gender field as an example. Gender can be either male or female, because you need to pay extra IO to access the index. All you get from the index is the address. If you are trying to fetch a few data from a million rows in a table, the IO overhead of accessing the index is also well worth using the index to quickly locate it. But if you’re taking 500,000 rows out of a million rows, like a gender field, then you’re accessing the index 500,000 times and the table 500,000 times, and you can imagine the overhead. No less than a full scan of the table. So even if you use the index, there’s no egg.

5. Columns that participate in column calculation are not suitable for index construction.

Why is that? This is because if there is a function operation on the index column, the index will not move, i.e. the index will be invalidated.

SELECT * FROM T WHERE FUN(Y) = XXX; SELECT * FROM T WHERE FUN(Y) = XXX; The index is not used at this point, because all rows in the column you want to query need to be computed.

conclusion

Today we have a brief overview of the principles of indexing. Remember: indexes can improve the efficiency of SQL queries, but indexes are not a panacea.