1. Order information table

Suppose you have a large terabyte table (order information table) oeder_info with the following fields:

Prov_id ID (province) City_id (city ID) Area_id (ID) Town_id streets (ID) Date (order date) Order_type (Order type: TC same-city/NOT_TC non-same-city) Order_money (Order Amount: YUAN)
001 00101 0010101 001010101 2021-12-06 tc 100
001 00101 0010102 001010203 2021-12-06 tc 100
002 00202 0020202 002020202 2021-12-07 not_tc 500
003 00303 0030303 003030303 2021-12-08 tc 800
. . . . . . .

2. Dimension table of order administrative region

Another large table (order administrative area dimension table), DIM_ORDER_area, has the following fields:

Prov_id ID (province) Prov_name (Province name) City_id (city ID) City_name (city name) Area_id (ID) Area_name (zone name) Town_id streets (ID) Town_name (Street name)
001 Henan province 00101 The city of luoyang 0010101 jianxi 001010101 Xiyuan streets
001 Henan province 00101 The city of luoyang 0010102 Los dragon area 001010203 Kaiyuan avenue.
001 . . . . . . .
002 Zhejiang province 00202 hangzhou 0020202 lake 002020202 Jiang village streets
002 . . . . . . .
003 Shanghai 00303 Shanghai 0030303 Yangpu district 003030303 Pentagonal Street
003 . . . . . . .
. . . . . . . .

3. Splicing requirements of two tables

Now we need to associate the two tables and connect oeder_INFO to DIM_ORDER_area to get the complete order list with the province/city/region ID and the province/city/region name. The requirements list contains the total number of orders for each street dimension, the total amount of orders, the total number of non-city orders, grouped by order date (day dimension), and only orders for 2021 are screened. | | prov_id ID (province) prov_name (province) | city_id (city ID) | city_name (city name) | area_id area (ID) | area_name area (name) | town_id | ID (street) Town_name (street name) | date (order date) | not_tc_order_count city streets (dimension of the door orders) | street total_order_money (dimensions of the total transaction amount) | | — — — — — — — – | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | — — — — — — — – | — — — — — — — – | — — — — — — — – | | 001 | | 00101 | in henan luoyang | | 001010203 | 0010102 | dragon area kaiyuan road | | 98 | 9800 2021-12-06 |

I used the following SQL statement to query (not optimized) :

SELECT
    order_tab.prov_id AS prov_id, ID - province
    area_tab.prov_name AS prov_name, - name of province
    order_tab.city_id AS city_id, - city ID
    area_tab.name AS city_name, - the city name
    order_tab.area_id AS area_id, - area ID
    area_tab.area_name AS area_name, - the area name
    order_tab.town_id AS town_id, ID - street
	  area_tab.town_name AS town_name, -- Street name
    order_tab.date AS sign_date, -- Order Date
    COUNT(*) AS town_total, -- Count the total order quantity
	-- Count the total number of non-city orders
    SUM(CASE WHEN order_tab.order_type = 'not_tc' THEN 1 ELSE 0 END) AS not_tc_order_count, 
	SUM(order_tab.order_money) AS total_order_money Count the total amount of orders from the street dimension
FROM 
    order_info AS order_tab -- Order information sheet
LEFT JOIN 
     Join the order administrative area dimension table outside left
    dim_order_area AS area_tab
ON 
    order_tab.prov_id = area_tab.prov_id AND 
	order_tab.city_id = area_tab.city_id AND 
	order_tab.area_id = area_tab.area_id AND 
	order_tab.town_id = area_tab.town_id
GROUP BY Group by the following fields
     order_tab.city_id
    , order_tab.prov_id
    , order_tab.area_id
    , order_tab.town_id
    , order_tab.date
    ,area_tab.prov_name
    ,area_tab.city_name
    ,area_tab.county_name
    ,area_tab.town_name
HAVING
    YEAR(sign_date) = 2021; -- Screen orders for 2021 only
Copy the code

The above SQL can solve the above query requirements, but will encounter a problem, as shown below:

.LEFT JOIN 
     Join the order administrative area dimension table outside left
    dim_order_area AS area_tab
ON  -- 
    order_tab.prov_id = area_tab.prov_id AND 
	order_tab.city_id = area_tab.city_id AND 
	order_tab.area_id = area_tab.area_id AND 
	order_tab.town_id = area_tab.town_id
...
Copy the code

Here, through a series of AND splicing judgment conditions (union), the result may have errors. If town_id in an order information table is empty AND other province ids are normal, then it cannot be matched by the LEFT JOIN splicing condition AND will be directly abandoned.

Improvement plan:

Change the judgment condition of LEFT JOIN union stitching to single successive stitching:

LEFT JOIN 
     -- Left outer join order administrative area dimension table (concatenate province name, * and duplicate by DISTINCT *)
    (SELECT prov_name,prov_id FROM dim_order_area) AS area_tab1 
ON 
    order_tab.prov_id = area_tab1.prov_id
LEFT JOIN 
     -- Left outer join order Administrative area dimension table (concatenate city name, * and duplicate by DISTINCT *)
    (SELECT city_id,city_name FROM dim_order_area) AS area_tab2  
ON 
    order_tab.city_id = area_tab2.city_id
LEFT JOIN 
     -- Left outer join order administrative area dimension table (concatenate area name, * and duplicate by DISTINCT *)
    (SELECT county_id,county_name FROM dim_order_area) AS area_tab3  
ON 
    order_tab.area_id = area_tab3.county_id
LEFT JOIN 
     -- Left outer join order Administrative area dimension table (concatenate street names, * and duplicate by DISTINCT *)
    (SELECT town_id,town_name FROM dim_order_area) AS area_tab4  
ON 
    order_tab.town_id = area_tab4.town_id
Copy the code

But there is also a problem: the query time is too long!

  • Reason: Multiple timesLEFT JOIN, each sub-query will concatenate two tables to generate an intermediate table, and then perform cartesian product concatenation for four consecutive times.
  • Solution:DISTINCTKeyword deduplication: remove the repeated data in each sub-query (eg: select only provinces to repeat the first time, select only cities to repeat the second time……) , which can greatly reduce the cost of intermediate table generation and speed up the query time.

Adjust the optimized SQL statement:

SELECT
    order_tab.prov_id AS prov_id, ID - province
    area_tab1.prov_name AS prov_name, - name of province
    order_tab.city_id AS city_id, - city ID
    area_tab2.name AS city_name, - the city name
    order_tab.area_id AS area_id, - area ID
    area_tab3.area_name AS area_name, - the area name
    order_tab.town_id AS town_id, ID - street
	  area_tab4.town_name AS town_name, -- Street name
    order_tab.date AS sign_date, -- Order Date
    COUNT(*) AS town_total, -- Count the total order quantity
	-- Count the total number of non-city orders
    SUM(CASE WHEN order_tab.order_type = 'not_tc' THEN 1 ELSE 0 END) AS not_tc_order_count, 
	SUM(order_tab.order_money) AS order_money Count the total amount of orders from the street dimension
FROM 
    order_info AS order_tab -- Order information sheet
LEFT JOIN 
     -- Left outer join order administrative area dimension table (concatenate province name, * and duplicate by DISTINCT *)
    (SELECT DISTINCT prov_name,prov_id FROM dim_order_area) AS area_tab1 
ON 
    order_tab.prov_id = area_tab1.prov_id
LEFT JOIN 
     -- Left outer join order Administrative area dimension table (concatenate city name, * and duplicate by DISTINCT *)
    (SELECT DISTINCT city_id,city_name FROM dim_order_area) AS area_tab2  
ON 
    order_tab.city_id = area_tab2.city_id
LEFT JOIN 
     -- Left outer join order administrative area dimension table (concatenate area name, * and duplicate by DISTINCT *)
    (SELECT DISTINCT county_id,county_name FROM dim_order_area) AS area_tab3  
ON 
    order_tab.area_id = area_tab3.county_id
LEFT JOIN 
     -- Left outer join order Administrative area dimension table (concatenate street names, * and duplicate by DISTINCT *)
    (SELECT DISTINCT town_id,town_name FROM dim_order_area) AS area_tab4  
ON 
    order_tab.town_id = area_tab4.town_id
GROUP BY Group by the following fields
     order_tab.city_id
    , order_tab.prov_id
    , order_tab.area_id
    , order_tab.town_id
    , order_tab.date
    ,area_tab1.prov_name
    ,area_tab2.city_name
    ,area_tab3.county_name
    ,area_tab4.town_name
HAVING
    YEAR(sign_date) = 2021; -- Screen orders for 2021 only
Copy the code