Before we get into the text, let’s understand two concepts: drill-down and roll-up

  • Drill-down: Also known as drill-down, to Drill down to see further detailed data by adding dimensions. For example, when looking at the total sales for different years, you can Drill down to see sales for all months of the year. Drill-down is designed to increase dimensions for detailed data
  • Roll-up: Also known as roll-up, as opposed to drilling, to view further summary data by reducing the dimensions, such as annual sales, while viewing total sales for all months. Rolling up is designed to reduce dimensions to get summary data

GROUPING SETS, CUBE, and ROLLUP are used in OLAP analysis to calculate the GROUPING SETS, CUBE, and ROLLUP of uVs in hours, days, weeks, and months

GROUPING SETS

The GROUPING SETS statement in GROUP BY allows you to specify multiple GROUP BY options in the same result SET. All GROUPING SETS statements can be represented logically based on the GROUPING BY queries of a UNION join

SQL > select * from GROUP BY

GROUPING SETS aggregate query Peer GROUP BY aggregate queries
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) ) SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a) SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b) SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

UNION

SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) ) SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null

UNION

SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b

UNION

SELECT null, null, SUM( c ) FROM tab1

GROUPING SETS can be represented BY as many ordinary groups BY unions as the GROUPING SETS specify. In addition, blank parentheses () indicate that the aggregate is computed

CUBE and a ROLLUP

GROUPING SETS specifies the GROUPING dimension. CUBE creates a subtotal of all possible combinations of column sets in its dimensions. Once we compute CUBE on a set of dimensions, we can get the answers to all possible aggregation problems on those dimensions. ROLLUP is used to compute aggregations at the hierarchy level of dimensions, such as GROUP by A, B, c with ROLLUP, which drills down from A to B to C, and generally drills down from the left-most dimension to the right-most dimension

Note that CUBE and ROLLUP are used only with GROUP BY

Look at CUBE and ROLLUP versus GROUPING SETS

CUBE/ROLLUP GROUPING SETS
GROUP BY a, b, c WITH CUBE GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ())
GROUP BY a, b, c, WITH ROLLUP GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ())

Compared with CUBE and ROLLUP, GROUPING SETS is more flexible and allows you to freely define the GROUPING of dimensions. However, CUBE and ROLLUP are more concise for the same result set

Grouping__ID

Using GROUPING SETS, CUBE, and ROLLUP, there is also a Grouping__ID function that tells you which GROUPING the result set belongs to. When multiple dimensions are aggregated, some dimensions are represented as NULL in the result set, but conflicts may occur if the dimensions in the original data set themselves have NULL values. Grouping__ID is 1 if the NULL dimension is in the aggregate dimension, 0 otherwise

Look at an example

Column1(key) Column2(value)
1 NULL
1 1
2 2
2 3
3 3
3 NULL
4 5

Execute SQL

SELECT key.value, GROUPING__ID, count(*) cnt
FROM T1
GROUP BY key.value WITH ROLLUP;
Copy the code

The results are as follows

Column1(key) Column2(value) GROUPING__ID cnt
NULL NULL 3 6
1 NULL 0 2
1 NULL 1 1
1 1 0 1
2 NULL 1 1
2 2 0 1
3 NULL 0 2
3 NULL 1 1
3 3 0 1
4 NULL 1 1
4 5 0 1

SQL uses ROLLUP GROUPING, which is equivalent to GROUP BY key, value GROUPING ((key, value), key, ()). A group aggregated by key and value. In the source data, we found two rows (1, NULL) and (3, NULL). We can imagine that the dimensions in the result set are displayed the same as those in the key, value aggregation. (1, NULL) and (3, NULL) each correspond to two rows of data, so how to distinguish which row belongs to which set of aggregation. Just look at their GROUPING__ID, GROUPING__ID 1 represents the NULL dimension in the aggregate dimension, which is the key, value aggregate in the example; The dimension whose GROUPING__ID is 0 means NULL is not in the aggregation dimension, which is the key-by-key aggregation in the example

SQL Example recommended reading:GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

hive.new.job.grouping.set.cardinality

This parameter specifies whether the GROUPING SETS, CUBE, and ROLLUP aggregation starts a new Map-Reduce Job

For a query like this: select a, b, c, count(1) from T group by a, b, c with rollup (a, B, C), (A, B, NULL), (A, NULL, NULL), (NULL, NULL, NULL). If the cardinality of table T is large, map-Reduce jobs may crash and the aggregation effect on the Map side is poor



Resources: Apache Hive Wiki documentation