GROUPING SETS

First, let’s talk about the background. We know that GROUP BY can be used for data grouping statistics, and we call the GROUP as the statistical dimension. For example, GROUP BY school is considered as the dimension of school, and GROUP BY also supports multiple fields for grouping statistics. For example, ‘ ‘GROUP BY school,grade’ is the combination of school and grade. Although it is a combination or a single dimension, it is still a single dimension when combined together, because the statistical data can only get the information of each grade in each school. You don’t get separate statistics for schools and you don’t get separate statistics for grades

School Name Number of grade students Tsinghua Primary School 6 1000 Peking University Primary School 6 1300 Peking University Primary School 5 1600 Nanjing Primary School 5 500............Copy the code

Just like the statistics above, if you don’t do a second count, you can’t get the number of students in the whole Tsinghua Primary school, and you can’t get the number of students in the whole sixth grade

Because GROUP BY is a one-dimensional query, it can only calculate information in one dimension, not multiple dimensions at the same time. In a GROUPING SETS query, it aggregates information based on different unidimensional combinations. It is equivalent to UNION ALL for GROUP BY results of different dimensions.

GROUPING SETS is a convenient way to GROUPING multiple GROUP BY logical unions into a HIVE SQL statement.

GROUPING SETS NULL columns that are not represented BY GROUP BY in a GROUP BY logic. GROUPING SETS SETS NULL columns that are not represented BY GROUP BY.

demo

Let’s prepare this data first

Zhang San1, 1 Middle School 2, zhang San2, 1 Middle School 3, Zhang San4, 1 Middle School 5, Zhang San5, 1 Middle School 6, Zhang San6, 2 Middle School 1, Li 41, 2 middle school 2, 2 grade Li 4 2 Li 4 3 Li 4 4 Li 4 4 Li 4 4 Li 4 5 Li 4 5 Li 4 6 Li 4 6 Wang 2 1 Li 2 3 Wang 2 3 3 Middle School 4 Wang 2 4 3 Middle School 5 Wang 2 5 Wang Er6, Grade 6, No.3 Middle SchoolCopy the code
CREATE TABLE ods.ods_student (
  school string,
  grade string,
  `user` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
load data local inpath '/Users/liuwenqiang/workspace/hive/students.data' overwrite into table ods.ods_student;
Copy the code

First, let’s query how many students there are in each grade in each school, and return the results in the format of school, grade, and number

select school,grade,count(1) as userCnt from ods.ods_student group by school,grade;
Copy the code

Next we want to find out how many students are in each school, and then return with the above query result, the position of grade is null

select school,grade,count(1) as userCnt from ods.ods_student group by school,grade union all select school,null,count(1)  as userCnt from ods.ods_student group by school;Copy the code

Based on the above, we want to find how many students are in each grade at the same time, with the school location set to null

select school,grade,count(1) as userCnt from ods.ods_student group by school,grade union all select school,null,count(1)  as userCnt from ods.ods_student group by school union all select null,grade,count(1) as userCnt from ods.ods_student group by grade;Copy the code

As soon as we add dimensions, we add a new statistical SQL on top of the original SQL, and set the dimensions that do not participate in the calculation to NULL

Hive provides GROUPING SETS, which you can think of as a set of group BY dimensions. SETS are grouped according to the GROUPING by dimensions of Hive

select school,grade,count(1) as userCnt from ods.ods_student group by school,grade grouping sets((school,grade),school,grade);
Copy the code

The result is the same, except that the return order is not the same as above. For example, the first line means the number of students in the first grade (so the school)

select nvl(school,'All grades'),nvl(grade,'All School'),count(1) as userCnt from ods.ods_student group by school,grade grouping sets((school,grade),school,grade);
Copy the code

So the output is much nicer and easier to understand

Grammar rules

Grouping sets is a method of combining the grouping results of multiple groups by. Grouping sets is an obvious method, but there are some precautions

  1. grouping sets(dim1,dim2,dim3 ….) Each dimension is either present in group BY or a combination of word-ends in Group BY, for examplegroup by school,gradeAnd the grouping sets always show up(school,grade)This combination
  2. grouping sets(dim1,dim2,dim3 …. An empty group group can appear in (), this means that all group fields are NULL, i.e. all group fields are NULL
select nvl(school,'All grades'),nvl(grade,'All School'),count(1) as userCnt from ods.ods_student group by school,grade grouping sets((school,grade),school,grade,());
Copy the code

In this case, it’s the whole grade school

Realize the principle of

The Grouping sets implementation of Hive does not use group by grouping as kylin does. The grouping sets implementation of Hive does not use Group by grouping as Kylin does

Hive implementation is mindless replication, defined as group by grouping sets and then union

Grouping__id field

In grouping sets, an ID is assigned to each of the smaller and larger dimensions, and an ID is assigned to each of the smaller dimensions instead of each record. The ID starts at 0

select grouping__id, nvl(school,'All grades'),nvl(grade,'All School'),count(1) as userCnt from ods.ods_student group by school,grade grouping sets((school,grade),school,grade,());
Copy the code

Grouping sets((school,grade), School,grade,()); There are four dimensions, namely (school,grade),,school,grade and (), so the ids of these four dimensions are 0, 1, 2 and 3 in order

When we see that the records of the same dimension are not all placed together, we can use this field to sort

select grouping__id, nvl(school,'All grades'),nvl(grade,'All School'),count(1) as usercnt from ods.ods_student group by school,grade grouping sets((school,grade),school,grade,()) order by grouping__id ;
Copy the code

conclusion

  1. Grouping sets is a simple way to write SQL and is easier to scale. The advantage is obvious when there are more dimensions. The grouping sets implementation does not improve the query performance of SQL

  2. Grouping sets in a group by query, the grouping sets are grouped according to different dimensions. You can aggregate the dimensions selectively or all of the dimensions. When the dimensions are aggregated, the value is null and the NVL function is used to assign values to the grouping sets to obtain a meaningful dimension name

  3. Grouping__id is a grouping sets() field that is sometimes used for filtering and sorting. The grouping sets() function sets the arguments in the same order