sampling

Sampling is also a common method in Hive, which is mainly used in the following scenarios

  1. In some machine learning scenarios, data warehouses serve as data providers to provide sample data
  2. The calculation result of the data is abnormal or the index is abnormal. At this time, we often need to confirm whether the data of the data source itself is abnormal
  3. When the PERFORMANCE of SQL is in question, we will also use the method area of sampling to view the data and then perform SQL tuning
  4. In large-scale data analysis and modeling tasks, mining and analyzing the full data is time-consuming and takes up cluster resources. Therefore, only a small part of data needs to be extracted for analysis and modeling.

Random sampling (Rand () function)

We usually use the sorting function and the rand() function to do random sampling. The limit keyword limits the data returned by the sampling. The difference is which sorting function we use

The rand() function is used for extraction because rand() returns a random value of type double between 0 and 1.

Here we use a table we used before about 4603089 records, here I will not prepare data for you, you can see Hive advanced data storage format to obtain test data

create table ods_user_bucket_log(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
CLUSTERED BY (`id` ) INTO 5 BUCKETS 
row format delimited fields terminated by '\t'
stored as textfile;
insert overwrite table ods_user_bucket_log select * from ods_user_log;
Copy the code

order by rand()

Order by enables only one Reduce, so it is time consuming. Why did we explain common sorting methods in Hive syntax in previous articles

Because order by is global, random sampling can be achieved

select * from ods_user_bucket_log order by rand() limit 10;

sort by rand()

Sort by provides the sorting function in a single reducer, but does not guarantee the overall order. At this time, real randomness cannot be achieved, because the randomness at this time is targeted at partitions. Therefore, if we can control the data entering each partition to be random, we can achieve randomness

select * from ods_user_bucket_log sort by rand() limit 10;

distribute by rand() sort by rand()

Distribute and sort keywords before the RAND function can ensure that data is randomly distributed in the Mapper and Reducer stages. At this time, we can also achieve real randomness. And as we mentioned earlier, cluster BY is basically the same thing as distribute by sort by

select * from ods_user_bucket_log distribute by rand() sort by rand() limit 10;

cluster by rand()

Distribute by is a combination of distribute by and sort by, distribute by rand() sort by rand() randomized twice, cluster by rand() randomized only once, So it’s going to be faster than the previous method

select * from ods_user_bucket_log cluster by rand() limit 10;

Tablesample () Sampling function

Barrel sampling (barrel table sampling)

In Hive, buckets are modeled based on a Hash field and placed into buckets with specified data. For example, a table is divided into 100 buckets based on ID. The algorithm is Hash (ID) % 100. The hash(ID) % 100 = 1 record is placed in the second bucket.

TABLESAMPLE (BUCKET X OUT OF Y [ON colname])

Where x is the bucket number to be sampled, the bucket number starts from 1, colname is the column to be sampled (that is, buckets are divided according to that field), and Y is the number of buckets. So the expression means that the bucket is divided into Y bucket according to the colname field, and the x bucket is extracted

SELECT
	*
FROM
	ods_user_bucket_log
TABLESAMPLE (BUCKET 1 OUT OF 100000 ON rand()) ;
Copy the code

Data block sampling

Block sampling is provided starting from Hive 0.8. Tablesample is used to extract the specified number of rows, proportion, and size

SELECT * FROM ods_user_data TABLESAMPLE(1000 ROWS);
SELECT * FROM ods_user_data TABLESAMPLE (20 PERCENT); 
SELECT * FROM ods_user_data TABLESAMPLE(1M); 
Copy the code
Sample ABLESAMPLE (20 PERCENT)

This will allow Hive to fetch at least N % of the data

SELECT
    *
FROM
    ods_user_bucket_log
TABLESAMPLE(0.0001 PERCENT);
Copy the code

Extract data of a specific size TABLESAMPLE(100M)
SELECT
    *
FROM
    ods_user_bucket_log
TABLESAMPLE(1M);
Copy the code

Notice that this has to be an integer M, because I got an error when I tried zero

Extract a specific number of ROWS TABLESAMPLE(10 ROWS)
SELECT
    *
FROM
    ods_user_bucket_log
TABLESAMPLE(10 rows);
Copy the code

extension

How does random sampling achieve proportional sampling

TABLESAMPLE can be used for proportional sampling. Random sampling can be used for sampling a specific number of records with the help of limit. In fact, if we modify random sampling, we can also use proportional sampling because rand() is random. So we can do conditional filtering on the return value to achieve sampling to scale

select
    *
from(
    select
        * ,rand() as radix
    from
        ods_user_bucket_log
) tmp
where
    radix> =0.0
    and radix< =0.0001
;
Copy the code

Stratified sampling (group sampling)

Stratified sampling, there are two kinds of sampling, one is the number of stratified sampling and the other is the ratio of stratified sampling

Stratified number

select
    *
from (
    select
        id,ctime,
        row_number(a)over(partition by id order by rand() ) as rn
    from
        ods_user_log
) tmp
where rn< =3
;
Copy the code

Stratified and proportional sampling can also be implemented in the same way as above

conclusion

  1. The TABLESAMPLE sampling function itself does not take MR, so it is fast.

  2. The random sampling function requires MR. Therefore, the performance of the random sampling function is not as fast as TABLESAMPLE, and the expression capability is limited. Only a certain number of items can be obtained (limit n).

  3. Implement hierarchical sampling with row_number