Whether the above execution order is correct or not, we can see it through explain execution plan. The content is too much, so we can see it in stages.

  1. SQL statement dependencies:

We see that stage-5 is the root, so stage-5 is executed first, stage-2 depends on stage-5, and stage-0 depends on stage-2.

  1. Perform stage-5 first:

Note that table B is scanned first, that is, the table behind left JOIN, and then is filtered (marked ② in the figure). In our SQL statement, table A is filtered, but Hive also automatically filters table B. This reduces the amount of associated data.

  1. Next, execute stage-2:
  • The first is the mapside operation:

First scan table A (marked ① in the figure); Next, filter idNO > ‘112233’ (marked ② in figure); Then, left join is performed, and the associated key is IDNO (marked ③ in the figure). After performing the association operation, the output operation will be three fields, including two fields of SELECT and one field of group by (marked ④ in the figure). Then the group by operation is hash (marked ⑤ in the figure). Then the sorting operation is carried out, and forward sorting is carried out according to IDNO (marked by ⑥ in the figure).

  • Then the Reduce side operation:

First, the group by operation is performed. Note that the grouping method is Mergepartial merging (marked ① in the figure). Select (); select (); select (); Having (count_user>1, 10101); Then limit the number of output lines (marked ④ in figure); Mark ⑤ in the figure indicates whether to compress the file. False does not compress the file.

The amount of data in the execution plan is only the predicted amount of data, not the actual operation, so the data may be inaccurate!

  1. Finally, stage-0:

Limit the number of lines in the final output to 10.

conclusion

Through the above analysis of SQL execution plan, the following points are summarized:

  1. Each stage is an independent MR, and complex Hive SQL statements can produce multiple stages. You can see what the steps are by looking at the description of the execution plan.

  2. For a group by key, it must be a field in the table. For a HAVING key, it must be a select field.

  3. Order by is executed after select, so the key of order BY must be the select field.

  4. Select * will add a lot of unnecessary consumption (CPU, IO, memory, network bandwidth).