The query that optimizes the COUNT function ranks among the top 10 most misunderstood topics in MySQL. You can search the web for more misconceptions about COUNT optimizations. Before tuning, it’s important to understand what COUNT does.
What does the COUNT function do?
COUNT is a specialized function, usually in two different ways: counting values and rows of data. A value is a non-null expression (NULL means a value is missing). If we specify a column name or other expression in the argument to COUNT, the COUNT function counts the number of times that expression has a value. This confuses a lot of people, in part because the concepts of value and NULL are vague.
Another form of COUNT is simply to COUNT the number of rows of data in the result set. This is how MySQL evaluates if it knows that the expression of the COUNT function argument cannot be NULL. The most typical example is COUNT(*), which you might expect to be an alternative to expanding all columns of a table. In fact, it ignores all columns and simply counts the number of rows of data.
A common mistake we make is to specify the column name in the COUNT argument and then think we are counting the rows. If you want to get the number of rows in the result, you should always use COUNT(*), which makes your query more explicit and avoids performance problems.
The magic of MyISAM
A common misconception is that MyISAM is very fast for COUNT queries. MyISAM’s COUNT query is fast, but the fast scenarios are limited: COUNT() queries with no WHERE condition can achieve this effect, which is rare. The reason MySQL can optimize this statement is that the storage engine always knows the exact number of rows in the data table. If MySQL knows that a column col cannot be NULL, it will also optimize by converting COUNT(col) to COUNT().
There is nothing “magical” about MyISAM having WHERE conditions in COUNT queries, or other counts of values. It can be faster or slower than other storage engines, depending on many other factors.
Simple COUNT optimization
When you want low index coverage of rows and need to COUNT all rows, you can use MyISAM’s COUNT(*) to optimize. The following example uses a standard world database to show how much optimization is needed to find the number of cities with ids greater than 5. You might write an SQL statement like this:
SELECT COUNT(*) FROM world.City WHERE ID > 5;
Copy the code
If you check the query with SHOW STATUS, 4079 rows are scanned. If you use a negative conditional query and subtract the number of cities with ids less than or equal to 5, you can reduce the scan result to 5 rows.
SELECT (SELECT COUNT(*) FROM world.City) - COUNT(*) FROM world.City WHERE ID < = 5;
Copy the code
This query reads fewer rows because the query was converted to a constant during the query optimization phase, as you can see using EXPLAIN:
id | select_type | table | rows | Extra |
---|---|---|---|---|
1 | PRIMARY | City | 6 | Using where; Using index |
2 | SUBQUERY | NULL | NULL | Select tables optimized way |
A common problem is how to perform a query on the number of different values of the same column in a single query statement. For example, you want to find the number of different colors in a query. You can’t use things like SELECT COUNT(color =’ blue’ OR color=’red’) FROM items to complete the query, because it doesn’t distinguish the corresponding quantities of different colors. You can’t put colors in a WHERE condition, such as SELECT COUNT(*) FROM items WHERE color = ‘blue’ AND color = ‘red’. Since colors are mutually exclusive, you can use the following method to solve this problem:
SELECT SUM(IF(color = 'blue'.1.0)) AS blue,
SUM(IF(color = 'red'.1.0)) as red FROM items;
Copy the code
A workaround is not to use SUM, but to use COUNT, but to ensure that the decision of an expression with no value is false:
SELECT COUNT(color = 'blue' OR NULL) as blue,
COUNT(color = 'red' OR NULL) as red FROM items;
Copy the code
Use approximations
Sometimes an exact amount is not required, so approximations can be used. The estimated number of rows given in the EXPLAIN optimizer usually satisfies this scenario, in which EXPLAIN can be used instead of the real query.
In many cases, an accurate number is much less efficient than an approximation. A client once asked to count the number of active users on their site. User numbers are cached and updated every 30 minutes. This is inherently inaccurate, so using estimates is acceptable. This query uses multiple WHERE conditions to ensure that inactive users or default users (with special ids) are not counted. Removing these conditions and slightly modifying the count operation can be more efficient. A further optimization is to remove unnecessary DISTINCT operations, thereby removing a Filesort operation. The optimized queries are faster and return nearly accurate results.
More complex optimizations
In general, COUNT queries are difficult to optimize because they usually COUNT a lot of rows (accessing a lot of data), and the alternative in MySQL is to use overridden indexes. If that’s not enough, the entire system application architecture may need to be adjusted. Consider statistics tables, for example, or use an external caching system, such as Memcached. We tend to face a similar dilemma: fast, accurate, and easy — you can only choose two!