row_number
When we introduced window functions earlier, we talked about how window functions are used, and we gave them a name to distinguish between general window functions and special window functions, and today we’re going to look at sort related window functions, because they’re window functions, and we’re going to say sort, We can probably guess that it is used to sort the data in the window
Order by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by If you’ve ever used mysql, you’ll know to write stored procedures or use custom variables to do this. The same goes for row Number, which returns the value of the sort order in which it was sorted by our custom sorting rules
So we think of row_number as a window sort function, but Hive doesn’t provide a non-window sort function either, but we’ve already said that if there’s no partition by in the definition of a window, then the entire data input is treated as a window, So in this case we can also use the window sort function to do global sort.
The test data
Here is the test data ID, DEPT, Salary, and we will use this test data to learn our window sorting function
1, Sales 10,000 2, sales 14,000 3, sales 10,000 4, back end 20,000 5, back end 25,000 6, back end 32,000 7,AI 40,000 8,AI 35,000 9,AI 60,000 10, count warehouse 20,000 11, count, 30,000 12, count, 32,000 13, count, 42,000Copy the code
create table ods_num_window(
id string,
dept string,
salary int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', ';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/number.txt' OVERWRITE INTO TABLE ods_num_window;
Copy the code
Learn row_number from the example
Employees in each department are ranked in descending order of salary
select
*.row_number(a)over(partition by dept order by salary desc) as rn
from
ods_num_window
;
Copy the code
We see that each department has its own number one, and it is clear that the ranking is happening within each department
All employees are in descending order of salary
select
*.row_number(a)over(order by salary desc) as rn
from
ods_num_window
;
Copy the code
Partition by: partition by: partition by: partition by: partition by: partition by: partition by: partition by: partition by If there’s no child window, it’s one window, if all the data is in one window it’s global sort
Take the top two salaries in each department
Partition by (row_number()) partition by (row_number()) partition by (row_number()) partition by (row_number()
select
*
from(
select
*.row_number(a)over(partition by dept order by salary desc) as rn
from
ods_num_window
) tmp
where
rn < =2
;
Copy the code
And the way we do that is we sort the data in the subwindow, and we pick out the data that we need, which is Rn <=2 here
Rank and dense_rank
In fact, these two window functions are the same as row_number, they are window sorting functions, so why are there two window sorting functions? If we look at row_number, this time we’re going to sort in ascending order
select
*,row_number() over(partition by dept order by salary) as rn
from
ods_num_window
;
Copy the code
We saw two people in the sales department with the same salary of 10000, but different ranking
When we look at rank, we find that the two people in the sales department with the same salary are tied for first place, and the next person is straight in third place
If we look at dense_rank, the two people with the same salary are still ranked equally, but the next person is still ranked second
Usage scenarios
Top-N
Top-n we’ve covered it before, but I won’t cover it here
Calculation of continuous
So what is continuous, which is kind of a weird name, but I’m going to give you an example, so let’s say I have a log table of users, and I want to filter out the users that have been accessed for more than 7 consecutive days, or I want to count the 10 users that have been accessed for the most consecutive days
Below is a copy of the test data user ID, access date
1,2020-12-01,2020-12-02,2020-12-03,2020-12-04,2020-12-05,2020-12-06,2020-12-07,2020-12-08,2020-12-09 1,2020-12-10,2020-12-01,2020-12-02,2020-12-03,2020-12-04,2020-12-06,2020-12-07,2020-12-08Copy the code
Here is our construction sentence
CREATE TABLE ods.ods_user_log (
id string,
ctime string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
load data local inpath '/Users/liuwenqiang/workspace/hive/user_log.txt' overwrite into table ods.ods_user_log;
Copy the code
Now we analyze the problem and how to calculate continuous, computing is certainly on the same user, and then we can according to user’s access time to sort, then we use date minus the corresponding order Numbers will get a value, if the access time is continuous, we can get the same value
select
id,ctime,
row_number(partition by id order by ctime ) as rn
from
ods_user_log
;
Copy the code
Here to demonstrate the effect is more noticeable, so the design of data is a little special, you can see for the user id is 1, we found that from December 1 to December 10, our ranking in turn from 1 to 10, this time we just will become for the digital date, then it is equal to 20201200 minus the corresponding rankings, At this point, we only need to count the number of 20201200, which is the number of consecutive login days, so instead of converting the date to the number and subtracting it, we just use the date to subtract it.
select
id,ctime,
date_sub(cast(ctime as date),row_number() over(partition by id order by ctime)),
row_number() over(partition by id order by ctime ) as rn
from
ods_user_log
;
Copy the code
In my case, it is 7 days, so I just need to calculate the number of the same date is greater than or equal to 7
select
id,kt,count(1) as loginCnt
from (
select
id,ctime,
date_sub(cast(ctime as date),row_number(a)over(partition by id order by ctime)) as kt,
row_number(a)over(partition by id order by ctime ) as rn
from
ods_user_log
) tmp
group by
id,kt
having
count(1)> =7
;
Copy the code
Let’s try to understand this data. It means that user 1 has been visiting the site for 10 consecutive days starting from (2020-11-30+1)
If there are more than one number per day, then the code is not correct. So we don’t need to use dense_rank. We need to de-rank
Packet sampling
In fact, sampling is something you’ve all seen before, and random sampling is something you’ve seen before, so today we’re going to look at grouping random sampling, and it’s actually very simple, we’re going to randomly sort the subwindow using row_number, and then we’re going to pull out the sample data that we need, and we’re going to use the same data, and we’re going to randomly select each user for three days to log in
select
*
from (
select
id,ctime,
row_number(a)over(partition by id order by rand() ) as rn
from
ods_user_log
) tmp
where rn< =3
;
Copy the code
conclusion
- If the rank() is the same, it will repeat, and the total number will not change.
- If dense_rank() is the same, it will be repeated and the total number will be reduced.
- Row_number () is evaluated sequentially and does not repeat or decrease
- The Row_number function is used for three scenarios, top-n, to calculate continuous and group sampling
Follow the public account: Java big data and data warehouse, learn big data technology.