row_number

When we introduced window functions earlier, we talked about how window functions are used, and we gave them a name to distinguish between general window functions and special window functions, and today we’re going to look at sort related window functions, because they’re window functions, and we’re going to say sort, We can probably guess that it is used to sort the data in the window

Order by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by,sort by If you’ve ever used mysql, you’ll know to write stored procedures or use custom variables to do this. The same goes for row Number, which returns the value of the sort order in which it was sorted by our custom sorting rules

So we think of row_number as a window sort function, but Hive doesn’t provide a non-window sort function either, but we’ve already said that if there’s no partition by in the definition of a window, then the entire data input is treated as a window, So in this case we can also use the window sort function to do global sort.

The test data

Here is the test data ID, DEPT, Salary, and we will use this test data to learn our window sorting function

1, Sales 10,000 2, sales 14,000 3, sales 10,000 4, back end 20,000 5, back end 25,000 6, back end 32,000 7,AI 40,000 8,AI 35,000 9,AI 60,000 10, count warehouse 20,000 11, count, 30,000 12, count, 32,000 13, count, 42,000Copy the code
create table ods_num_window(
    id string,
    dept string,
    salary int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', ';
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/number.txt' OVERWRITE INTO TABLE ods_num_window;
Copy the code

Learn row_number from the example

Employees in each department are ranked in descending order of salary

select
    *.row_number(a)over(partition by dept order by salary desc) as rn
from
    ods_num_window
;
Copy the code

We see that each department has its own number one, and it is clear that the ranking is happening within each department

All employees are in descending order of salary

select
    *.row_number(a)over(order by salary desc) as rn
from
    ods_num_window
;
Copy the code

Partition by: partition by: partition by: partition by: partition by: partition by: partition by: partition by: partition by If there’s no child window, it’s one window, if all the data is in one window it’s global sort

Take the top two salaries in each department

Partition by (row_number()) partition by (row_number()) partition by (row_number()) partition by (row_number()

select
    *
from(
   select
       *.row_number(a)over(partition by dept order by salary desc) as rn
   from
       ods_num_window
) tmp
where
    rn < =2
;
Copy the code

And the way we do that is we sort the data in the subwindow, and we pick out the data that we need, which is Rn <=2 here

Rank and dense_rank

In fact, these two window functions are the same as row_number, they are window sorting functions, so why are there two window sorting functions? If we look at row_number, this time we’re going to sort in ascending order

select
    *,row_number() over(partition by dept order by salary) as rn
from
    ods_num_window
;
Copy the code

We saw two people in the sales department with the same salary of 10000, but different ranking

When we look at rank, we find that the two people in the sales department with the same salary are tied for first place, and the next person is straight in third place

If we look at dense_rank, the two people with the same salary are still ranked equally, but the next person is still ranked second

Usage scenarios

Top-N

Top-n we’ve covered it before, but I won’t cover it here

Calculation of continuous

So what is continuous, which is kind of a weird name, but I’m going to give you an example, so let’s say I have a log table of users, and I want to filter out the users that have been accessed for more than 7 consecutive days, or I want to count the 10 users that have been accessed for the most consecutive days

Below is a copy of the test data user ID, access date

1,2020-12-01,2020-12-02,2020-12-03,2020-12-04,2020-12-05,2020-12-06,2020-12-07,2020-12-08,2020-12-09 1,2020-12-10,2020-12-01,2020-12-02,2020-12-03,2020-12-04,2020-12-06,2020-12-07,2020-12-08Copy the code

Here is our construction sentence

CREATE TABLE ods.ods_user_log (
  id string,
  ctime string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
load data local inpath '/Users/liuwenqiang/workspace/hive/user_log.txt' overwrite into table ods.ods_user_log;
Copy the code

Now we analyze the problem and how to calculate continuous, computing is certainly on the same user, and then we can according to user’s access time to sort, then we use date minus the corresponding order Numbers will get a value, if the access time is continuous, we can get the same value

select
	id,ctime,
	row_number(partition by id order by ctime ) as rn
from
	ods_user_log
;
Copy the code

Here to demonstrate the effect is more noticeable, so the design of data is a little special, you can see for the user id is 1, we found that from December 1 to December 10, our ranking in turn from 1 to 10, this time we just will become for the digital date, then it is equal to 20201200 minus the corresponding rankings, At this point, we only need to count the number of 20201200, which is the number of consecutive login days, so instead of converting the date to the number and subtracting it, we just use the date to subtract it.

select
    id,ctime,
    date_sub(cast(ctime as date),row_number() over(partition by id order by ctime)),
    row_number() over(partition by id order by ctime ) as rn
from
    ods_user_log
;
Copy the code

In my case, it is 7 days, so I just need to calculate the number of the same date is greater than or equal to 7

select
    id,kt,count(1) as loginCnt
from (
    select
        id,ctime,
        date_sub(cast(ctime as date),row_number(a)over(partition by id order by ctime)) as kt,
        row_number(a)over(partition by id order by ctime ) as rn
    from
        ods_user_log
) tmp
group by
    id,kt
having
    count(1)> =7
;
Copy the code

Let’s try to understand this data. It means that user 1 has been visiting the site for 10 consecutive days starting from (2020-11-30+1)

If there are more than one number per day, then the code is not correct. So we don’t need to use dense_rank. We need to de-rank

Packet sampling

In fact, sampling is something you’ve all seen before, and random sampling is something you’ve seen before, so today we’re going to look at grouping random sampling, and it’s actually very simple, we’re going to randomly sort the subwindow using row_number, and then we’re going to pull out the sample data that we need, and we’re going to use the same data, and we’re going to randomly select each user for three days to log in

select
    *
from (
    select
        id,ctime,
        row_number(a)over(partition by id order by rand() ) as rn
    from
        ods_user_log
) tmp
where rn< =3
;
Copy the code

conclusion

  1. If the rank() is the same, it will repeat, and the total number will not change.
  2. If dense_rank() is the same, it will be repeated and the total number will be reduced.
  3. Row_number () is evaluated sequentially and does not repeat or decrease
  4. The Row_number function is used for three scenarios, top-n, to calculate continuous and group sampling

Follow the public account: Java big data and data warehouse, learn big data technology.