sampling
Sampling is also a common method in Hive, which is mainly used in the following scenarios
- In some machine learning scenarios, data warehouses serve as data providers to provide sample data
- The calculation result of the data is abnormal or the index is abnormal. At this time, we often need to confirm whether the data of the data source itself is abnormal
- When the PERFORMANCE of SQL is in question, we will also use the method area of sampling to view the data and then perform SQL tuning
- In large-scale data analysis and modeling tasks, mining and analyzing the full data is time-consuming and takes up cluster resources. Therefore, only a small part of data needs to be extracted for analysis and modeling.
Random sampling (Rand () function)
We usually use the sorting function and the rand() function to do random sampling. The limit keyword limits the data returned by the sampling. The difference is which sorting function we use
The rand() function is used for extraction because rand() returns a random value of type double between 0 and 1.
Here we use a table we used before about 4603089 records, here I will not prepare data for you, you can see Hive advanced data storage format to obtain test data
create table ods_user_bucket_log(
id int,
name string,
city string,
phone string,
acctime string)
CLUSTERED BY (`id` ) INTO 5 BUCKETS
row format delimited fields terminated by '\t'
stored as textfile;
insert overwrite table ods_user_bucket_log select * from ods_user_log;
Copy the code
order by rand()
Order by enables only one Reduce, so it is time consuming. Why did we explain common sorting methods in Hive syntax in previous articles
Because order by is global, random sampling can be achieved
select * from ods_user_bucket_log order by rand() limit 10;
sort by rand()
Sort by provides the sorting function in a single reducer, but does not guarantee the overall order. At this time, real randomness cannot be achieved, because the randomness at this time is targeted at partitions. Therefore, if we can control the data entering each partition to be random, we can achieve randomness
select * from ods_user_bucket_log sort by rand() limit 10;
distribute by rand() sort by rand()
Distribute and sort keywords before the RAND function can ensure that data is randomly distributed in the Mapper and Reducer stages. At this time, we can also achieve real randomness. And as we mentioned earlier, cluster BY is basically the same thing as distribute by sort by
select * from ods_user_bucket_log distribute by rand() sort by rand() limit 10;
cluster by rand()
Distribute by is a combination of distribute by and sort by, distribute by rand() sort by rand() randomized twice, cluster by rand() randomized only once, So it’s going to be faster than the previous method
select * from ods_user_bucket_log cluster by rand() limit 10;
Tablesample () Sampling function
Barrel sampling (barrel table sampling)
In Hive, buckets are modeled based on a Hash field and placed into buckets with specified data. For example, a table is divided into 100 buckets based on ID. The algorithm is Hash (ID) % 100. The hash(ID) % 100 = 1 record is placed in the second bucket.
TABLESAMPLE (BUCKET X OUT OF Y [ON colname])
Where x is the bucket number to be sampled, the bucket number starts from 1, colname is the column to be sampled (that is, buckets are divided according to that field), and Y is the number of buckets. So the expression means that the bucket is divided into Y bucket according to the colname field, and the x bucket is extracted
SELECT
*
FROM
ods_user_bucket_log
TABLESAMPLE (BUCKET 1 OUT OF 100000 ON rand()) ;
Copy the code
Data block sampling
Block sampling is provided starting from Hive 0.8. Tablesample is used to extract the specified number of rows, proportion, and size
SELECT * FROM ods_user_data TABLESAMPLE(1000 ROWS);
SELECT * FROM ods_user_data TABLESAMPLE (20 PERCENT);
SELECT * FROM ods_user_data TABLESAMPLE(1M);
Copy the code
Sample ABLESAMPLE (20 PERCENT)
This will allow Hive to fetch at least N % of the data
SELECT
*
FROM
ods_user_bucket_log
TABLESAMPLE(0.0001 PERCENT);
Copy the code
Extract data of a specific size TABLESAMPLE(100M)
SELECT
*
FROM
ods_user_bucket_log
TABLESAMPLE(1M);
Copy the code
Notice that this has to be an integer M, because I got an error when I tried zero
Extract a specific number of ROWS TABLESAMPLE(10 ROWS)
SELECT
*
FROM
ods_user_bucket_log
TABLESAMPLE(10 rows);
Copy the code
extension
How does random sampling achieve proportional sampling
TABLESAMPLE can be used for proportional sampling. Random sampling can be used for sampling a specific number of records with the help of limit. In fact, if we modify random sampling, we can also use proportional sampling because rand() is random. So we can do conditional filtering on the return value to achieve sampling to scale
select
*
from(
select
* ,rand() as radix
from
ods_user_bucket_log
) tmp
where
radix> =0.0
and radix< =0.0001
;
Copy the code
Stratified sampling (group sampling)
Stratified sampling, there are two kinds of sampling, one is the number of stratified sampling and the other is the ratio of stratified sampling
Stratified number
select
*
from (
select
id,ctime,
row_number(a)over(partition by id order by rand() ) as rn
from
ods_user_log
) tmp
where rn< =3
;
Copy the code
Stratified and proportional sampling can also be implemented in the same way as above
conclusion
-
The TABLESAMPLE sampling function itself does not take MR, so it is fast.
-
The random sampling function requires MR. Therefore, the performance of the random sampling function is not as fast as TABLESAMPLE, and the expression capability is limited. Only a certain number of items can be obtained (limit n).
-
Implement hierarchical sampling with row_number