Before we get into the text, let’s understand two concepts: drill-down and roll-up
- Drill-down: Also known as drill-down, to Drill down to see further detailed data by adding dimensions. For example, when looking at the total sales for different years, you can Drill down to see sales for all months of the year. Drill-down is designed to increase dimensions for detailed data
- Roll-up: Also known as roll-up, as opposed to drilling, to view further summary data by reducing the dimensions, such as annual sales, while viewing total sales for all months. Rolling up is designed to reduce dimensions to get summary data
GROUPING SETS, CUBE, and ROLLUP are used in OLAP analysis to calculate the GROUPING SETS, CUBE, and ROLLUP of uVs in hours, days, weeks, and months
GROUPING SETS
The GROUPING SETS statement in GROUP BY allows you to specify multiple GROUP BY options in the same result SET. All GROUPING SETS statements can be represented logically based on the GROUPING BY queries of a UNION join
SQL > select * from GROUP BY
GROUPING SETS aggregate query | Peer GROUP BY aggregate queries |
---|---|
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) ) | SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b |
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a) | SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a |
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b) | SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b |
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) ) | SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1 |
GROUPING SETS can be represented BY as many ordinary groups BY unions as the GROUPING SETS specify. In addition, blank parentheses () indicate that the aggregate is computed
CUBE and a ROLLUP
GROUPING SETS specifies the GROUPING dimension. CUBE creates a subtotal of all possible combinations of column sets in its dimensions. Once we compute CUBE on a set of dimensions, we can get the answers to all possible aggregation problems on those dimensions. ROLLUP is used to compute aggregations at the hierarchy level of dimensions, such as GROUP by A, B, c with ROLLUP, which drills down from A to B to C, and generally drills down from the left-most dimension to the right-most dimension
Note that CUBE and ROLLUP are used only with GROUP BY
Look at CUBE and ROLLUP versus GROUPING SETS
CUBE/ROLLUP | GROUPING SETS |
---|---|
GROUP BY a, b, c WITH CUBE | GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()) |
GROUP BY a, b, c, WITH ROLLUP | GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ()) |
Compared with CUBE and ROLLUP, GROUPING SETS is more flexible and allows you to freely define the GROUPING of dimensions. However, CUBE and ROLLUP are more concise for the same result set
Grouping__ID
Using GROUPING SETS, CUBE, and ROLLUP, there is also a Grouping__ID function that tells you which GROUPING the result set belongs to. When multiple dimensions are aggregated, some dimensions are represented as NULL in the result set, but conflicts may occur if the dimensions in the original data set themselves have NULL values. Grouping__ID is 1 if the NULL dimension is in the aggregate dimension, 0 otherwise
Look at an example
Column1(key) | Column2(value) |
---|---|
1 | NULL |
1 | 1 |
2 | 2 |
2 | 3 |
3 | 3 |
3 | NULL |
4 | 5 |
Execute SQL
SELECT key.value, GROUPING__ID, count(*) cnt
FROM T1
GROUP BY key.value WITH ROLLUP;
Copy the code
The results are as follows
Column1(key) | Column2(value) | GROUPING__ID | cnt |
---|---|---|---|
NULL | NULL | 3 | 6 |
1 | NULL | 0 | 2 |
1 | NULL | 1 | 1 |
1 | 1 | 0 | 1 |
2 | NULL | 1 | 1 |
2 | 2 | 0 | 1 |
3 | NULL | 0 | 2 |
3 | NULL | 1 | 1 |
3 | 3 | 0 | 1 |
4 | NULL | 1 | 1 |
4 | 5 | 0 | 1 |
SQL uses ROLLUP GROUPING, which is equivalent to GROUP BY key, value GROUPING ((key, value), key, ()). A group aggregated by key and value. In the source data, we found two rows (1, NULL) and (3, NULL). We can imagine that the dimensions in the result set are displayed the same as those in the key, value aggregation. (1, NULL) and (3, NULL) each correspond to two rows of data, so how to distinguish which row belongs to which set of aggregation. Just look at their GROUPING__ID, GROUPING__ID 1 represents the NULL dimension in the aggregate dimension, which is the key, value aggregate in the example; The dimension whose GROUPING__ID is 0 means NULL is not in the aggregation dimension, which is the key-by-key aggregation in the example
SQL Example recommended reading:GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
hive.new.job.grouping.set.cardinality
This parameter specifies whether the GROUPING SETS, CUBE, and ROLLUP aggregation starts a new Map-Reduce Job
For a query like this: select a, b, c, count(1) from T group by a, b, c with rollup (a, B, C), (A, B, NULL), (A, NULL, NULL), (NULL, NULL, NULL). If the cardinality of table T is large, map-Reduce jobs may crash and the aggregation effect on the Map side is poor
Resources: Apache Hive Wiki documentation