1. Order information table
Suppose you have a large terabyte table (order information table) oeder_info with the following fields:
Prov_id ID (province) | City_id (city ID) | Area_id (ID) | Town_id streets (ID) | Date (order date) | Order_type (Order type: TC same-city/NOT_TC non-same-city) | Order_money (Order Amount: YUAN) |
---|---|---|---|---|---|---|
001 | 00101 | 0010101 | 001010101 | 2021-12-06 | tc | 100 |
001 | 00101 | 0010102 | 001010203 | 2021-12-06 | tc | 100 |
002 | 00202 | 0020202 | 002020202 | 2021-12-07 | not_tc | 500 |
003 | 00303 | 0030303 | 003030303 | 2021-12-08 | tc | 800 |
. | . | . | . | . | . | . |
2. Dimension table of order administrative region
Another large table (order administrative area dimension table), DIM_ORDER_area, has the following fields:
Prov_id ID (province) | Prov_name (Province name) | City_id (city ID) | City_name (city name) | Area_id (ID) | Area_name (zone name) | Town_id streets (ID) | Town_name (Street name) |
---|---|---|---|---|---|---|---|
001 | Henan province | 00101 | The city of luoyang | 0010101 | jianxi | 001010101 | Xiyuan streets |
001 | Henan province | 00101 | The city of luoyang | 0010102 | Los dragon area | 001010203 | Kaiyuan avenue. |
001 | . | . | . | . | . | . | . |
002 | Zhejiang province | 00202 | hangzhou | 0020202 | lake | 002020202 | Jiang village streets |
002 | . | . | . | . | . | . | . |
003 | Shanghai | 00303 | Shanghai | 0030303 | Yangpu district | 003030303 | Pentagonal Street |
003 | . | . | . | . | . | . | . |
. | . | . | . | . | . | . | . |
3. Splicing requirements of two tables
Now we need to associate the two tables and connect oeder_INFO to DIM_ORDER_area to get the complete order list with the province/city/region ID and the province/city/region name. The requirements list contains the total number of orders for each street dimension, the total amount of orders, the total number of non-city orders, grouped by order date (day dimension), and only orders for 2021 are screened. | | prov_id ID (province) prov_name (province) | city_id (city ID) | city_name (city name) | area_id area (ID) | area_name area (name) | town_id | ID (street) Town_name (street name) | date (order date) | not_tc_order_count city streets (dimension of the door orders) | street total_order_money (dimensions of the total transaction amount) | | — — — — — — — – | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | ——– | — — — — — — — – | — — — — — — — – | — — — — — — — – | | 001 | | 00101 | in henan luoyang | | 001010203 | 0010102 | dragon area kaiyuan road | | 98 | 9800 2021-12-06 |
I used the following SQL statement to query (not optimized) :
SELECT
order_tab.prov_id AS prov_id, ID - province
area_tab.prov_name AS prov_name, - name of province
order_tab.city_id AS city_id, - city ID
area_tab.name AS city_name, - the city name
order_tab.area_id AS area_id, - area ID
area_tab.area_name AS area_name, - the area name
order_tab.town_id AS town_id, ID - street
area_tab.town_name AS town_name, -- Street name
order_tab.date AS sign_date, -- Order Date
COUNT(*) AS town_total, -- Count the total order quantity
-- Count the total number of non-city orders
SUM(CASE WHEN order_tab.order_type = 'not_tc' THEN 1 ELSE 0 END) AS not_tc_order_count,
SUM(order_tab.order_money) AS total_order_money Count the total amount of orders from the street dimension
FROM
order_info AS order_tab -- Order information sheet
LEFT JOIN
Join the order administrative area dimension table outside left
dim_order_area AS area_tab
ON
order_tab.prov_id = area_tab.prov_id AND
order_tab.city_id = area_tab.city_id AND
order_tab.area_id = area_tab.area_id AND
order_tab.town_id = area_tab.town_id
GROUP BY Group by the following fields
order_tab.city_id
, order_tab.prov_id
, order_tab.area_id
, order_tab.town_id
, order_tab.date
,area_tab.prov_name
,area_tab.city_name
,area_tab.county_name
,area_tab.town_name
HAVING
YEAR(sign_date) = 2021; -- Screen orders for 2021 only
Copy the code
The above SQL can solve the above query requirements, but will encounter a problem, as shown below:
.LEFT JOIN
Join the order administrative area dimension table outside left
dim_order_area AS area_tab
ON --
order_tab.prov_id = area_tab.prov_id AND
order_tab.city_id = area_tab.city_id AND
order_tab.area_id = area_tab.area_id AND
order_tab.town_id = area_tab.town_id
...
Copy the code
Here, through a series of AND splicing judgment conditions (union), the result may have errors. If town_id in an order information table is empty AND other province ids are normal, then it cannot be matched by the LEFT JOIN splicing condition AND will be directly abandoned.
Improvement plan:
Change the judgment condition of LEFT JOIN union stitching to single successive stitching:
LEFT JOIN
-- Left outer join order administrative area dimension table (concatenate province name, * and duplicate by DISTINCT *)
(SELECT prov_name,prov_id FROM dim_order_area) AS area_tab1
ON
order_tab.prov_id = area_tab1.prov_id
LEFT JOIN
-- Left outer join order Administrative area dimension table (concatenate city name, * and duplicate by DISTINCT *)
(SELECT city_id,city_name FROM dim_order_area) AS area_tab2
ON
order_tab.city_id = area_tab2.city_id
LEFT JOIN
-- Left outer join order administrative area dimension table (concatenate area name, * and duplicate by DISTINCT *)
(SELECT county_id,county_name FROM dim_order_area) AS area_tab3
ON
order_tab.area_id = area_tab3.county_id
LEFT JOIN
-- Left outer join order Administrative area dimension table (concatenate street names, * and duplicate by DISTINCT *)
(SELECT town_id,town_name FROM dim_order_area) AS area_tab4
ON
order_tab.town_id = area_tab4.town_id
Copy the code
But there is also a problem: the query time is too long!
- Reason: Multiple times
LEFT JOIN
, each sub-query will concatenate two tables to generate an intermediate table, and then perform cartesian product concatenation for four consecutive times. - Solution:
DISTINCT
Keyword deduplication: remove the repeated data in each sub-query (eg: select only provinces to repeat the first time, select only cities to repeat the second time……) , which can greatly reduce the cost of intermediate table generation and speed up the query time.
Adjust the optimized SQL statement:
SELECT
order_tab.prov_id AS prov_id, ID - province
area_tab1.prov_name AS prov_name, - name of province
order_tab.city_id AS city_id, - city ID
area_tab2.name AS city_name, - the city name
order_tab.area_id AS area_id, - area ID
area_tab3.area_name AS area_name, - the area name
order_tab.town_id AS town_id, ID - street
area_tab4.town_name AS town_name, -- Street name
order_tab.date AS sign_date, -- Order Date
COUNT(*) AS town_total, -- Count the total order quantity
-- Count the total number of non-city orders
SUM(CASE WHEN order_tab.order_type = 'not_tc' THEN 1 ELSE 0 END) AS not_tc_order_count,
SUM(order_tab.order_money) AS order_money Count the total amount of orders from the street dimension
FROM
order_info AS order_tab -- Order information sheet
LEFT JOIN
-- Left outer join order administrative area dimension table (concatenate province name, * and duplicate by DISTINCT *)
(SELECT DISTINCT prov_name,prov_id FROM dim_order_area) AS area_tab1
ON
order_tab.prov_id = area_tab1.prov_id
LEFT JOIN
-- Left outer join order Administrative area dimension table (concatenate city name, * and duplicate by DISTINCT *)
(SELECT DISTINCT city_id,city_name FROM dim_order_area) AS area_tab2
ON
order_tab.city_id = area_tab2.city_id
LEFT JOIN
-- Left outer join order administrative area dimension table (concatenate area name, * and duplicate by DISTINCT *)
(SELECT DISTINCT county_id,county_name FROM dim_order_area) AS area_tab3
ON
order_tab.area_id = area_tab3.county_id
LEFT JOIN
-- Left outer join order Administrative area dimension table (concatenate street names, * and duplicate by DISTINCT *)
(SELECT DISTINCT town_id,town_name FROM dim_order_area) AS area_tab4
ON
order_tab.town_id = area_tab4.town_id
GROUP BY Group by the following fields
order_tab.city_id
, order_tab.prov_id
, order_tab.area_id
, order_tab.town_id
, order_tab.date
,area_tab1.prov_name
,area_tab2.city_name
,area_tab3.county_name
,area_tab4.town_name
HAVING
YEAR(sign_date) = 2021; -- Screen orders for 2021 only
Copy the code