SQL is an essential skill for big data practitioners, and most big data technology frameworks provide SQL solutions. It can be said that SQL is an enduring and enduring programming language. Especially in the warehouse area, the use of SQL is common. This article will share four common techniques used in interviews and on the job, including:

  • Use of dates and periods
  • Temporary Tables and Common Table Expression (WITH)
  • Aggregation is combined with CASE WHEN
  • Other uses of the Window Function

Number of warehouse? Write SQL… !

First: the use of dates and periods

Date and time filtering is often used in the workplace because weekly, monthly, quarterly, and annual performance is often the focus of analysis when pulling reports, dashboards, and various kinds of analysis.

Time segment extraction: Extract

  • grammar
-- Field can be day, hour, minute, month, quarter, etc
-- Source can be of the date or timestamp type
extract(field FROM source)
Copy the code
  • use
SELECT extract(year FROM 'the 2020-08-05 09:30:08');   The result is 2020
SELECT extract(quarter FROM 'the 2020-08-05 09:30:08');   That's 3
SELECT extract(month FROM 'the 2020-08-05 09:30:08');   -- the result is 8
SELECT extract(week FROM 'the 2020-08-05 09:30:08');   The result is 31, the week of the year
SELECT extract(day FROM 'the 2020-08-05 09:30:08');  -- the result is 5
SELECT extract(hour FROM 'the 2020-08-05 09:30:08');   That's 9
SELECT extract(minute FROM 'the 2020-08-05 09:30:08');   The result is 30
SELECT extract(second FROM 'the 2020-08-05 09:30:08');   -- the result is 8
Copy the code

The fields available for extraction above vary slightly from database to database. Hive supports day, dayofweek, hour, minute, month, quarter, Second, Week, and Year. Among them, weekly, monthly and annual are the most widely used, because no matter the company’s internal products or commercial products provide background statistics of data, weekly and monthly reports (such as the last 7 days and the last 30 days) pay most attention to the cycle of performance.

Note:

Impala supports: YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, EPOCH

Hive supports day, dayofweek, hour, minute, Month, quarter, Second, Week, and Year

Hive introduces this function from Hive2.2.0

The extraction of weeks

  • grammar

The following function is often used to identify the date of Monday and the date of Sunday when performing statistics on a weekly basis:

next_day(STRING start_date, STRING day_of_week)
-- Returns the date of the next week corresponding to the current date
-- 2020-08-05 is Wednesday
SELECT next_day('2020-08-05'.'MO') -- Next Monday corresponding date: 2020-08-10
SELECT next_day('2020-08-05'.'TU') -- Next Tuesday corresponding date: 2020-08-11
SELECT next_day('2020-08-05'.'WE') -- Next Wednesday corresponding date: 2020-08-12
SELECT next_day('2020-08-05'.'TH') -- The corresponding date of next Thursday: 2020-08-06, that is this Thursday
SELECT next_day('2020-08-05'.'FR') -- Next Friday corresponding date: 2020-08-07, namely this Friday
SELECT next_day('2020-08-05'.'SA') -- The corresponding date of next Saturday: 2020-08-08, that is, this Saturday
SELECT next_day('2020-08-05'.'SU') -- The date of next Sunday: 2020-08-09, that is, this Sunday
Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
Copy the code
  • use

How do you get the Monday of the current date? You just need to get the date corresponding to the current date next Monday and subtract 7 days to get:

SELECT date_add(next_day('2020-08-05'.'MO'),7 -);
Copy the code

Similarly, to obtain the date corresponding to the Sunday of the current date, you only need to obtain the date corresponding to the next Monday of the current date, and then subtract 1 day to obtain:

select date_add(next_day('2020-08-05'.'MO'),- 1) 
- the 2020-08-09
Copy the code

The extraction of month

  • grammar

To extract a month from a single date, the LAST_DAY function changes the date of each month to the last day of the month (28th, 29th, 30th or 31st) as follows:

last_day(STRING date)
Copy the code
  • use
SELECT last_day('2020-08-05'); - the 2020-08-31
Copy the code

In addition to the above methods, you can also use the date_format function, for example:

SELECT date_format('2020-08-05'.'yyyy-MM');
- 2020-08
Copy the code

Range of dates

Window of months: application that uses add_months plus trunc()

Returns the corresponding date after the month
- the 2020-07-05
select add_months('2020-08-05'.- 1)

Returns the date at the beginning of the current date
- the 2020-08-01
select trunc("2020-08-05",'MM')

Copy the code

As you can see from the example above, using add_months and subtracting N months alone will fetch the integer months, but trunc() will start from the first of the previous N months.

-- Select all data from 2020-07-05 to 2020-08-05
BETWEEN add_months('2020-08-05'.- 1) AND '2020-08-05' 
-- Select all data from 2020-07-01 to 2020-08-05
BETWEEN add_months(trunc("2020-08-05",'MM'),- 1) AND '2020-08-05' 

Copy the code

Temporary tables and Common Table Expression (WITH)

These two methods are often used in daily work. For some complicated computing tasks, in order to avoid too many joins, some data to be extracted are usually extracted in the form of temporary table or CTE before the main query block.

Temporary table method:

CREATE TEMPORARY TABLE table_1 AS  
    SELECT 
        columns
    FROM table A;
CREATE TEMPORARY table_2 AS 
    SELECT
        columns
    FROM table B;

SELECT
    table_1.columns,
    table_2.columns, 
    c.columns 
FROM table C JOIN table_1
     JOIN table_2;

Copy the code

CTE method:

Note that Hive and Impala support this syntax. WITH employee_by_title_count AS (SELECT t.name AS job_title, COUNT(e.id) as amount_of_employees FROM employees e JOIN job_titles t on e.job_title_id = t.id GROUP BY 1 ), salaries_by_title AS ( SELECT name as job_title , salary FROM job_titles ) SELECT * FROM employee_by_title_count e JOIN salaries_by_title s ON s.job_title = e.job_titleCopy the code

As you can see, the use of TEMP tables and CTE WITH are very similar, both to make your Query more transparent and elegant. Many people are used to write all queries in a single block, using too many joins or subqueries, resulting in the loss of logic and their own do not know where to write, timely use of TEMP TABLE and CTE as auxiliary, is definitely a plus.

The third is the combination of Aggregation and CASE WHEN

Combining the Aggregation function (SUM/COUNT/COUNT DISTINCT/MIN/MAX) with CASE WHEN is the most powerful and interesting way to use it. This creates an effect similar to the SUMIF/COUNTIF effect in EXCEL, which can be used to do a lot of efficient analysis.

  • Table Name: order
  • Column: register_date, order_date, user_id, country, order_sales, order_id

Data preparation

CREATE TABLE order(
    register_date string,
    order_date string,
    user_id string,
    country string,
    order_sales decimal(10.2),
    order_id string);

INSERT INTO TABLE order VALUES(" 2020-06-07 ", "2020-06-09", "001",'c0'.210,"o1");
INSERT INTO TABLE order VALUES(" 2020-06-08 ", "2020-06-09", "002",'c1'.220,"o2");
INSERT INTO TABLE order VALUES(" 2020-06-07 ", "2020-06-10", "003",'c2'.230,"o3");
INSERT INTO TABLE order VALUES(" 2020-06-09 ", "2020-06-10", "004",'c3'.200,"o4");
INSERT INTO TABLE order VALUES(" 2020-06-07 ", "2020-06-20", "005",'c4'.300,"o5");
INSERT INTO TABLE order VALUES(" 2020-06-10 ", "2020-06-23", "006",'c5'.400,"o6");
INSERT INTO TABLE order VALUES(" 2020-06-07 ", "2020-06-19", "007",'c6'.600,"o7");
INSERT INTO TABLE order VALUES(" 2020-06-12 ", "2020-06-18", "008",'c7'.700,"o8");
INSERT INTO TABLE order VALUES(" 2020-06-07 ", "2020-06-09", "009",'c8'.100,"o9");
INSERT INTO TABLE order VALUES(" 2020-06-15 ", "2020-06-18", "0010",'c9'.200,"o10");
INSERT INTO TABLE order VALUES(" 2020-06-15 ", "2020-06-19", "0011",'c10'.250,"o11");
INSERT INTO TABLE order VALUES(" 2020-06-12 ", "2020-06-29", "0012",'c11'.270,"o12");
INSERT INTO TABLE order VALUES(" 2020-06-16 ", "2020-06-19", "0013",'c12'.230,"o13");
INSERT INTO TABLE order VALUES(" 2020-06-17 ", "2020-06-20", "0014",'c13'.290,"o14");
INSERT INTO TABLE order VALUES(" 2020-06-20 ", "2020-06-29", "0015",'c14'.203,"o15");

Copy the code

CASE WHEN to analyze retention/usage

Allow multiple columns to be de-weighted
set hive.groupby.skewindata = false
Allows grouping or sorting using location numbers
set hive.groupby.orderby.position.alias = true

SELECT
    date_add(Next_day(register_date, 'MO'),- 1) AS week_end,
    COUNT(DISTINCT CASE WHEN order_date BETWEEN register_date AND date_add(register_date,6) THEN user_id END) AS first_week_order,
    COUNT(DISTINCT CASE WHEN order_date BETWEEN date_add(register_date ,7) AND date_add(register_date,13) THEN user_id END) AS sencod_week_order,
    COUNT(DISTINCT CASE WHEN order_date BETWEEN date_add(register_date ,14) AND date_add(register_date,20) THEN user_id END) as third_week_order
FROM order
GROUP BY 1

Copy the code

The above example shows whether the user created an order after registering. For example, how many users placed orders in the first week, the second week and the third week after registration, so as to analyze the usage and retention of users.

Note: For the above usage, two parameters need to be configured:

Hive.groupby. skewinData = false: Deduplication is allowed for multiple columns. Otherwise, an error message is displayed:

SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data

Hive. Groupby. Orderby. Position. Alias = true: allows the use of position number group or sort, otherwise an error:

SemanticException [Error 10025]: line 79:13 Expression not in GROUP BY key ''MO''

CASE WHEN to analyze the amount spent by each user

SELECT
    user_id,
    SUM (CASE WHEN order_date BETWEEN register_date AND date_add(register_date,6) THEN order_sales END) AS first_week_amount,
    SUM (CASE WHEN order_date BETWEEN date_add(register_date ,7) AND date_add(register_date,13) THEN order_sales END) AS second_week_amount
    FROM order
GROUP BY 1

Copy the code

By screening the date of registration and consumption, and consumption amount statistics, each user in each period of time (the first week after registration, the second week… And so on) consumption amount, can observe whether the user has sustained consumption habit or consumption amount becomes low analysis.

CASE WHEN the amount consumed exceeds a certain quota

SELECT
    user_id,
    COUNT(DISTINCT CASE WHEN order_sales >= 100 THEN order_id END) AS count_of_order_greateer_than_100
FROM order
GROUP BY 1

Copy the code

The example above is similar to countif. For each user, count the number of orders whose value is greater than a certain value and analyze it to screen out high-value customers.

A. WHEN B. WHEN C. WHEN D. WHEN

SELECT
    user_id,
    MIN(CASE WHEN order_sales > 100 THEN order_date END) AS first_order_date_over1000,
    MAX(CASE WHEN order_sales > 100 THEN order_date END) AS recent_order_date_over100
FROM order
GROUP BY 1

Copy the code

CASE WHEN plus MIN/MAX time, the date of the first purchase over a certain amount of money and the date of the last purchase over a certain amount of money can be obtained during the entire use of the user.

Fourth: Other uses of Window Function

Window Function is often used in work and is often asked in interviews. The common usage scenario is to group topN. Another usage described in this article is to use windowing functions to analyze user access sessions.

A session is a series of user interactions that occur on a web site during a specified period of time. For example, a session can contain multiple Web browsing, events, social interactions, and e-commerce transactions. A session is a container that contains the user’s actions on the web site.

A session has an expiration time, such as 30 minutes. If the session is inactive for more than 30 minutes, it will become obsolete.

Suppose John visits the site, and the time starts from the moment he arrives. If 30 minutes pass and Zhang SAN still does not interact with each other in any form, this session is deemed to end. However, whenever John interacts with an element (such as an event, social interaction, or opening of a new web page), an additional 30 minutes will be added to the time of that interaction, resetting the expiration time.

Data preparation

  • Table Name: user_visit_action
  • Columns: user_id, session_id , page_url, action_time
CREATE TABLE user_visit_action( 
    user_id string,
    session_id string,
    page_url string,
    action_time string);
    
INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss001", "http://a.com", "the 2020-08-06 13:34:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss001", "http://b.com", "the 2020-08-06 13:35:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss001", "http://c.com", "the 2020-08-06 13:36:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss002", "http://a.com", "the 2020-08-06 14:30:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss002", "http://b.com", "the 2020-08-06 14:31:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss002", "http://e.com", "the 2020-08-06 14:33:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 001 ", "ss002", "http://f.com", "the 2020-08-06 14:35:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 002 ", "ss003", "http://u.com", "the 2020-08-06 18:34:11. 478");INSERT INTO TABLE user_visit_action VALUES(" 002 ", "ss003", "http://k.com", "the 2020-08-06 18:38:11. 478");Copy the code

User access session analysis

The sample data sheet is shown above, with links and times for users, visits, and pages. The following uses partition by to express the browsing behavior of each user between different visits.

SELECT
    user_id,
    session_id,
    page_url,
    DENSE_RANK(a)OVER (PARTITION BY user_id, session_id ORDER BY action_time ASC) AS page_order,
    MIN(action_time) OVER (PARTITION BY user_id, session_id) AS session_start_time,
    MAX(action_time) OVER (PARTITION BY user_id, session_id) AS session_finisht_time
FROM user_visit_action

Copy the code

The above query returns the order of each user, each visit, page browsing activity, and the start and end times of the session. Based on this, the results can be stored in a TEMP TABLE or CTE for further analysis.

summary

This article focuses on four common SQL usage tips that you will encounter at work and in interviews. Of course, all of this is relevant to the specific business of analysis. Finally, whether you are SQL boy or SQL Girl, Happy SQL Querying 😊 can be believed with a few skills.

The public account “Big Data Technology and Data Warehouse”, reply to “information” to receive the big data data package