An overview of the

This section describes hive SQL writing methods in some common data analysis scenarios, including range analysis, data conversion by condition, data column conversion, calculation of consecutive days, and obtaining top N in group sorting.

1. Merge multiple rows

Multi-line merging is often used for interval statistics, which reduces the dimension of hundreds of millions of records to the total amount in different intervals by defining a certain amount level. In general, it’s mapping multiple to one. Typical scenario: Based on the daily transaction flow of the user, calculate the amount of transactions in different amount segments every day.

For example, the daily transaction flow table structure of the user is as above, the number of transactions between 0-100, 100-200, 200-300 and more than 300 districts should be calculated.

CREATE VIEW t_deal_tmp_view_1 AS
SELECT
    CASE
        WHEN rcv_amount <= 100 THEN 1
        WHEN rcv_amount <= 200 THEN 2
        WHEN rcv_amount <= 300 THEN 3 
        ELSE 4 END AS amount_range,
    receiver
FROM t_transfer_info


SELECT
amount_range,
COUNT(receiver) AS cnt
FROM t_deal_tmp_view_1
GROUP BY amount_range   


DROP VIEW t_deal_tmp_view_1 
Copy the code

Why don’t you write it this way

SELECT
    CASE
        WHEN rcv_amount <= 100 THEN 1
        WHEN rcv_amount <= 200 THEN 2
        WHEN rcv_amount <= 300 THEN 3 
        ELSE 4 END AS amount_range,
    COUNT(receiver)
FROM t_transfer_info
GROUP BY 
    CASE
        WHEN rcv_amount <= 100 THEN 1
        WHEN rcv_amount <= 200 THEN 2
        WHEN rcv_amount <= 300 THEN 3
        ELSE NULL END

Copy the code

In Hive, if Group By is used, aggregation functions must be used for non-group By fields. Only Group By fields can be retrieved. The main reason is that there is no method to name new fields using Case When after Group By. So you need to use a temporary view for processing.

2. Use a conditional statement to convert NULL to 0

In hive tables, some records may be NULL. If an operation or logical judgment is performed on the record, the expected result cannot be obtained. In this case, NULL can be converted to 0 before processing. If NULL = 0, you can use NVL (NVL).

If the previous table records the user’s daily income and expenditure, the daily income and expenditure may be empty, and the user’s total income and expenditure for two consecutive days need to be calculated. Use JOIN to calculate the two-day table link, replace NULL with 0, SQL is as follows:

SELECT 
t1.uin,
t1.income + CASE WHEN t2.income IS NULL THEN 0 ELSE t2.income END AS income, 
t1.expend + CASE WHEN t2.expend IS NULL THEN 0 ELSE t2.expend END AS expend
FROM
(
    SELECT 
    uin,
    income,
    expend
    FROM t_user_trans_inf_day
    WHERE statis_day=20180812
)t1
LEFT JOIN
(
    SELECT 
    uin,
    income,
    expend
    FROM t_user_trans_inf_day
    WHERE statis_day=20180811
)t2
ON(t1.uin=t2.uin)

Copy the code

Biographies. 3

If there is A table A, as shown above, which records the consumption records of users with one column for each type of consumption, the columns of the table need to be converted into rows. For example, table B, the original multiple columns are converted into multiple rows. The following

There are two ways to do this, distribution using union and posexplode.

Method one uses union

The implementation method of union is to distribute and take out single columns, and then merge the results. The SQL is as follows.

SELECT uin, 1 AS type, of_amt
FROM t_user_trans
UNION ALL
SELECT uin, 2 AS type, lf_amt
FROM t_user_trans
UNION ALL
SELECT uin, 3 AS type, on_amt
FROM t_user_trans
UNION ALL
SELECT uin, 4 AS type, cr_amt
FROM t_user_trans
Copy the code
Method two, use posexplode

Explode is a built-in function. Explode (ARRAY) can be used to generate a row for each element in the list. Explode (MAP) Explode (MAP) generate a row from each key-value pair in the MAP. If explode(ARRAY) is not a type column, use posexplode instead. If explode(ARRAY) is not a type column, explode(ARRAY) is not a type column.

SELECT 
uin 
t.pos+1 AS type, 
t.value AS amount
FROM t_user_tans
LATERAL VIEW 
posexplode(
ARRAY(
of_amt,
lf_amt,
on_amt,
cr_amt
)) t as pos, value
Copy the code

4. Calculate consecutive days

There is a user login history table, which needs to calculate the consecutive login days of the user. In this case, you can use the Group number, Group By UIN + time minus the Group number, so that the consecutive days are aggregated together, and the final result can be calculated By the aggregation function.

SELECT
uin,
COUNT(uin) AS continuity_days
FROM(
	SELECT
	uin,
	statis_day,
	row_number() OVER(PARTITION BY uin order by statis_day asc) AS rn
	FROM
	(
	    SELECT 
	    uin,
	    statis_day  
	    FROM t_user_login_log 
	    WHERE statis_day>= 20170101    
	    AND statis_day <= 20180809))GROUP BY uin, date_sub(statis_day,CAST(rn AS INT))
Copy the code

5. Select topN for group sorting

If t_user_score is used to record students’ scores in all subjects, select the subject in which each student has the highest score. The row_number() function is used primarily here.

SELECT
uin
FROM
(
	SELECT 
	uin, 
	course, 
	row_number() OVER(PARTITION BY uin order by score asc) AS rn
	FROM
	t_user_score
)
WHERE rn = 1
Copy the code