The easiest way to understand the concept of indexing is through an example, such as this one.

Suppose we need to design an online dating site with a user profile that has many columns, such as nationality, province, city, gender, age, eye color, and so on. The site must support searching user profiles in a variety of combinations. You also need support for sorting and limited returns based on the last time a user has been online and other user evaluations. How do we design indexes for this complex scenario?

Somewhat oddly, the first thing to do is to decide whether we must use index sort or whether it is acceptable to sort after retrieval. Index sorting limits how indexes and queries can be built. For example, we cannot use the same index in a query condition such as WHERE Age BETWEEN 18 AND 25 AND a scenario that sorts based on other user ratings. If MySQL uses one index in a range query, it cannot use another index in a sort. Assuming this is the most common WHERE condition, we also need to support most queries to be sorted.

Supports multiple types of filtering

Now we need to look at which columns have scattered values and which columns are most common in the WHERE condition. Data column values are more scattered and filter well. This is usually a good thing because it allows MySQL to efficiently filter out irrelevant rows.

The nationality column may not be very selective, but it is probably the most frequently queried. The gender column is usually not filtered, but it is often used in queries. Based on this understanding, we create a series of indexes for many different combinations of columns that start with (sex, country).

The conventional wisdom is that it is useless to build indexes for low-filter columns. So why do we have a non-filter column at the beginning of every index? There are two reasons for this. The first reason is that, as mentioned earlier, gender is used for almost every query. We even designed users to search for only one gender at a time. But more importantly, there’s not much downside to adding such columns, because we’re using a trick.

This is our trick: even if we don’t limit the gender query, we can make sure that the index works by adding AND sex IN(‘m’, ‘f’) to the WHERE statement. This does not filter out the rows we need, so it has the same effect as not including gender in the WHERE statement. However, because MySQL prefixes this column in the index of more columns, we need to include this column. This trick works IN such scenarios, but it doesn’t work if the column has many different values, because it leads to too many columns IN IN().

This example illustrates a basic principle: keep all options in your data table design. When you design an index, don’t just think about the type of index in that query, but also think about optimizing the query. When you need an index and find that other queries may be affected by it, you should first ask yourself if you can change the query. You should optimize both queries and indexes to find solutions. You don’t have to design perfect indexes.

Next, we need to consider WHERE conditions for other combinations we might use, and then consider which of these combinations would be slower without proper indexing. Indexes like (sex, country, age) are the obvious choice, but we might also need indexes like (sex, country, region, age) and (Sex, country, region, City, age).

This leads to a lot of indexing. If we can reuse the index, there won’t be too many combinations. We can remove (sex, country, age) and (sex, country, region, age) indexes using the IN() trick. If these are not specified in the search form, we can use a list of countries, or a list of regions, to ensure that the pre-index constraint is met (all countries, all regions, and all genders can be many combinations).

These indexes will satisfy most of the specified search queries, but how do we design for less common filters such as uploaded images (has_pictures), eye color (eye_color), hair color (hair_color), and education level? If these columns are not so filter and not so frequently used, we can just skip them and let MySQL scan for some additional rows. Accordingly, we can add them before the age column and use the IN() technique to describe them IN advance to handle cases where the columns are not specified.

You may notice that we put age at the end of the index. Why treat this column in particular? We are trying to ensure that MySQL uses as many indexed columns as possible. Since MySQL uses the leftmost matching rule until the first range query condition is encountered. All of the columns we mentioned can use equality conditions in the WHERE statement, but age is most likely a range query.

We can also change the range query to a list AND use IN queries, such as age IN(18, 19, 20, 21, 22, 23, 24, 25) instead of age BETWEEN 18 AND 25, but this is not always possible. A general rule of thumb is that we try to put the scope decision condition at the end of the index, so the optimizer uses the index as much as possible.

We mentioned that you can use the IN query with as many columns as possible to override index conditions that are not specified IN the WHERE condition. But you can go too far and cause new problems. Using more of these IN query lists results IN the optimizer needing to evaluate a large number of combinations, which IN turn can slow down the query. Consider the following query condition statement:

WHERE eye_color 	IN('brown'.'blue'.'hazel')
	AND hair_color	IN('black'.'red'.'blonde'.'brown')
  	AND sex 	IN('M'.'F')
Copy the code

The optimizer translates to 432=24 combinations, and the WHERE condition checks each case. 24 isn’t a big combination yet, but if it’s in the thousands. Older versions of MySQL can have more problems with too many IN queries. The query optimizer executes more slowly and consumes a lot of memory. Newer versions of MySQL will stop evaluating when there are too many combinations, but this will affect MySQL’s use of indexes.

Avoid multiple range queries

Let’s assume that we have a last_online column and we need to display the users who have been online in the last week:

WHERE eye_color		IN('brown'.'blue'.'hazel')
	AND hair_color	IN('black'.'red'.'blonde'.'brown')
  	AND sex 	IN('M'.'F')
 	AND last_online	 > DATE_SUB(NOW(), INTERVAL 7 DAY)
 	AND age		BETWEEN 18 AND 25 
Copy the code

The problem with this query is that it has two scope queries. MySQL can use the last_online or age condition, but not both. If the last_online constraint appears without the age constraint, or if last_online is more selective than age, we might want to add another set of indexes and place last_online at the end. But what if we can’t convert age to IN and we also want to be able to speed up the query when we have both last_OINline and AGE range queries? At this point, we don’t have a direct way. But we can convert a range to an equality comparison. To do so, we add a pre-computed active column that we maintain periodically. If the user logs in, we mark it as 1, and if there is no consecutive login within 7 days, we mark it as 0 again.

This method allows MySQL to use indexes such as (active, sex, country, age). This column may not be that precise, but this type of query may not require a great deal of precision. If we need a precise query, we can keep last_Online in the WHERE condition without adding the index. This technique is similar to the case with URL lookups. This condition does not use any indexes because it is unlikely to filter out rows that the index hits. Adding indexes does not necessarily make queries profitable.

Now, you can see the pattern: if the user wants to find both active and inactive results, we can use the IN query. We have added many of these list queries. One workaround is to index the queries separately from each other. For example, we could use the following index: (Active, sex, country, age), (Active, country, age), (sex, country, age) and (Country, age). While such indexes may be a better choice for a particular query, the negative effects of maintaining these combinations, and the extra storage required for the combination, can result in a weak strategy.

This is a case where an optimizer change can really affect index optimization. If index scans are truly discarded IN a future MySQL release, it may be possible to use multiple range conditions on an index, and we no longer need to solve such problems with IN queries.

Optimization of sorting

The final issue is sorting. Results with small amounts of data are fast with filesort, but what about millions of rows? For example, if only gender is specified in the WHERE condition.

For such low-filter scenarios, we can add specific indexes for sorting. For example, an index (sex, rating) could be used for the following query:

SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 10;
Copy the code

This query has both sort and LIMIT clauses and can be slow without an index. Even with indexes, this query can be slow when the user interface is paginated and the page number is not near the starting position. The following example of ORDER BY and LIMIT makes for a bad combination:

SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000.10;
Copy the code

Even with indexes, such queries can cause serious problems. This is because very high offsets result in a lot of time being spent scanning a lot of data that is discarded. Anti-paradigm design, pre-computation, and caching may be able to solve the problem of such queries. A better strategy is to limit the number of pages a user can query. This is unlikely to detract from the user experience, as no one will actually care about the search results on page 10,000.

Another good strategy is to use inferential federated queries, which is how we get rows of data after using an override index to get primary key columns. You can combine all the columns you need to fetch, which reduces MySQL’s effort to collect data that needs to be discarded. Here’s an example:

SELECT <cols> FROM profiles INNER JOIN (
  SELECT <primary key cols> FROM profiles
  WHERE x.sex='M' ORDER BY rating LIMIT 100000.10
AS x USING(<primary key cols>);
Copy the code