What happens when Mysql executes GroupBy and Union

In the process of grouping or subquery, Mysql generally uses the mechanism of internal temporary tables to help complete data statistics, while group by and Union may also use internal temporary tables to complete data statistics in this process

Group By execution process

The group field has no index

Execute the statement

select id%10 as m, count(*) as c from t1 group by m;

This simple grouping statement will group id%10 into rows and count the number of data after each calculation, and will be output sorted by the result of m.

As can be seen from Explain:
- using temporaryIndicates that temporary tables are used when grouping
- using filesortIndicates that file sort is used (either memory sort or disk file sort, depending onsort bufferWhether the queried data can be put down)
Execute the process
- Create temporary table (m); create temporary table (m); create temporary table (m)
- Scan index A of t1, extract the id values of leaf nodes (here without other indexes, the primary key index can be directly done, so can use overwrite index), calculate the result of ID %10, denoted as x
  - If the temporary table already has data with primary key X, c+1 is added
  - If there is no data in the temporary table pool with primary key X, a new data is added and c is 1
- After traversal, sort by field M and return the result set to the client
Optimized sorting result

In the group by the execution of the process, after the screening to the grouping of the result set, the default will be carried out on the field of grouping sorting, then return, if demand does not need to be sorted, you can omit the sorting steps, to speed up the SQL execution efficiency, prevent the excessive amount of data in the process of sorting cost too much time.
- order by null
  
  select id%10 as m, count(*) as c from t1 group by m order by null;
  
  This skips the final sorting stage and returns the data directly from the temporary table
Internal temporary table and disk temporary table

In the preceding example, memory temporary tables are preferred because the amount of data is small and the memory space is large enough. However, the size of temporary tables in memory is limited by the tmp_table_size parameter, which is 16 MB by default.

When a large amount of data is grouped and the memory temporary table cannot be put down, the system converts the memory temporary table to a disk temporary table. Sorting a disk temporary table is a very slow operation because disk operations are required.

Group fields are index fields

As you can see, both in-memory temporary tables and disk temporary tables are used to create a temporary table with a unique index to help with grouping. If the data volume is large, SQL execution will be slow.

Because in the process of statistics, in the process of grouping uncertain whether subsequent will appear the same data (that is, because the original data is not orderly), so can’t directly will return to the client, the current statistical data need to be after the completion of all the scan statistic can will return to the client data, so I need a temporary table records the current state of statistics.

Therefore, since indexes are ordered, adding indexes to grouped fields can solve the problem of grouping without using temporary tables to complete the statistical operation of grouping.

Group fields with indexes

alter table t1 add column z int generated always as(id % 100), add index(z);

MySQL 5.7 supports the generated column mechanism to update column data

Query statement:

select z, count(*) as c from t1 group by z;

Explain the results to:
Execute the process
- Because is to group according to the index, the data is ordered, so to statistics, from left to right in turn order accumulation, when the next number, to know the current Numbers have been completed, later won’t appear the current digital, the current statistical results can be added to the result set, continue to statistics the next number.
  
  Because it is ordered, so the group statistics know the start and end of the current statistics, directly add the statistical results to the result set, do not need to use temporary table to complete the record of data, and do not need additional sorting, this is ordered.

As can be seen, when group by columns have indexes, group statistics need neither temporary tables nor additional sorting, which improves query efficiency. This is also why group by columns are often indexed when optimizing SQL.

Large data groups have no index

It would be nice if you could do group by logic by adding indexes. However, how can group by be optimized in cases where indexes are not appropriate

During group by execution, data is first put into the memory temporary table. If the memory temporary table is insufficient, the memory temporary table is converted to the disk temporary table. If the amount of data is known to be too large, can you skip the conversion process of in-memory temporary tables and use disk temporary tables directly?

Mysql provides SQL_BIG_RESULT to tell the optimizer that the current amount of data is large and to use disk temporary tables directly.

select SQL_BIG_RESULT id%100 as m, count(*) as c from t1 group by m;

The execution flow of Union

The execution flow of union

Execute the statement

(select 1000 as f) union (select id from t1 order by id desc limit 2);

In the above statement, the results of the two subqueries are merged, and the duplicate results are removed. Only one row is kept.

Explain execution results:
- using temporaryIndicates that temporary tables are used when result sets are merged
Execute the process
- Create an in-memory temporary table with only one primary key field f
- Execute the first subquery, putting the results of the query into an in-memory temporary table
- Perform the second subquery, scanning each row in turn, scanning directly from the primary key index and getting the ID
  - Insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table, insert into temporary table if there is no id in temporary table
  - Get the ID field of the next row and continue with the above process. Insert into the temporary table on success and continue to the next row on failure
- Retrieve each row of data from the temporary table, place the data in the result set, return to the client, and delete the temporary table
To optimize the

As can be seen, the execution process of union uses in-memory temporary tables, which are used to temporarily store data and remove duplicate data to ensure the uniqueness of data.

If you do not need to de-duplicate the data, you can change union to union all so that temporary tables are not used.

Union All also merges the result sets of two subqueries. The difference is that union All does not deduplicate the result sets. Duplicate data is returned to the client.

As you can see, only the overwrite index is used, and no execution steps are used for temporary tables.

Therefore, when the union All is executed again, the temporary table is not created, but the subquery is executed once. The result of the subquery is directly returned to the client as part of the result set.

summary

group byIn use, try to use index fields for grouping, query efficiency is fast
group byIf there is no index, you still want to group, if there is no need to sort, you can useorder by nullReduced one extra sort
ifgroup byThe packet data is large and can be used directlySQL_BIG_RESULTTells the optimizer to use disk temporary tables to prevent additional operations of converting memory temporary tables to disk temporary tables again.
If theunionThe result set has no de-duplication requirement and can be usedunion all

What happens when Mysql executes GroupBy and Union

Group By execution process

The group field has no index

Group fields are index fields

Large data groups have no index

The execution flow of Union

The execution flow of union

summary

Related Posts

Introduction to PostgreSQL

Encounter a problem, hope big guy gives directions

An analysis of json string deserialization in Golang