Abstract: GaussDB(DWS) supports the MERGE INTO function, which can simultaneously update and insert large amounts of data. It is a very important technology for data warehouse.
This article is shared by Acydy from A Tip on How to Efficiently Import and Update Data in Batches.
preface
What if we have a table and we want to update it and insert it? You can use updates and inserts to accomplish your goals.
If you have a large amount of data and want to complete the task as soon as possible, is there any other solution? Do not miss the MERGE INTO function of GaussDB(DWS).
The MERGE INTO concept
MERGE INTO is a standard introduced in SQL 2003.
If a table T, as well as being updatable, is insertable-into, then rows can be inserted into it (subject to applicable Access Rules and Conformance Rules). The primary effect of an on T is to insert into T each of the zero or more rows contained in a specified table. The primary effect of a on T is to replace zero or more rows in T with specified rows and/or to insert into T zero or more specified rows, depending on the result of a and on whether one or both of and are specified.
A table can be either updated or inserted in a statement. Whether it is updated or inserted depends on the result of the search condition and the specified merge when matched clause and merge when not matched Clause (what to do when condition does not match)
SQL 2008 has been extended to use multiple MATCHED and NOT MATCHED.
MERGE has been extended to support multiple MATCHED and NOT MATCHED clauses, each accompanied by a search condition, that gives much greater flexibility in the coding of complex MERGE statements to handle update conflicts.
The MERGE INTO command involves two tables. Target table: the table to be inserted or updated. Source table: The table used to match the target table. The source of the target table’s data.
The MERGE INTO statement matches the data in the target table and the source table for association conditions. If the association conditions match, the target table is updated. If the association conditions do not match, the target table is inserted.
Usage scenario: When a large amount of data in a table needs to be added to an existing table, MERGE INTO can be used to efficiently import data, avoiding multiple INSERT and UPDATE operations.
The MERGE INTO grammar
GaussDB(DWS) MERGE INTO the syntax is as follows:
MERGE INTO table_name [ [ AS ] alias ] USING { { table_name | view_name } | subquery } [ [ AS ] alias ] ON ( condition ) [ WHEN MATCHED THEN UPDATE SET { column_name = { expression | DEFAULT } | ( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] ) } [, ...] [ WHERE condition ] ] [ WHEN NOT MATCHED THEN INSERT { DEFAULT VALUES | [ ( column_name [, ...] ) ] VALUES ( { expression | DEFAULT } [, ...] ) [, ...] [ WHERE condition ] } ];Copy the code
-
INTO Specifies the target table.
-
USING Specifies the source table. The source table can be a normal table or a subquery.
-
ON association conditions, which specify the association conditions of the target and source tables.
-
WHEN MATCHED WHEN the data in the source table and target table can match the association conditions, select the WHEN MATCHED clause to perform the UPDATE operation.
-
WHEN NOT MATCHED WHEN the data in the source table and target table cannot match the association conditions, select the WHEN NOT MATCHED clause to perform INSERT.
-
WHEN MATCHED, WHEN NOT MATCHED can default one, NOT multiple.
-
WHEN MATCHED, WHEN NOT MATCHED can use WHERE to perform conditional filtering.
-
WHEN MATCHED, WHEN NOT MATCHED order can be exchanged.
Practical application
Start by creating the following tables to perform the MREGE INTO operation.
gaussdb=# CREATE TABLE dst (
product_id INT,
product_name VARCHAR(20),
category VARCHAR(20),
total INT
) DISTRIBUTE BY HASH(product_id);
gaussdb=# CREATE TABLE dst_data (
product_id INT,
product_name VARCHAR(20),
category VARCHAR(20),
total INT
) DISTRIBUTE BY HASH(product_id);
gaussdb=# CREATE TABLE src (
product_id INT,
product_name VARCHAR(20),
category VARCHAR(20),
total INT
) DISTRIBUTE BY HASH(product_id);
gaussdb=# INSERT INTO dst_data VALUES(1601,'lamaze','toys',100),(1600,'play gym','toys',100),(1502,'olympus','electrncs',100),(1501,'vivitar','electrnc',100),(1666,'harry potter','dvd',100);
gaussdb=# INSERT INTO src VALUES(1700,'wait interface','books',200),(1666,'harry potter','toys',200),(1601,'lamaze','toys',200),(1502,'olympus camera','electrncs',200);
gaussdb=# INSERT INTO dst SELECT * FROM dst_data;
Copy the code
WHEN MATCHED and WHEN NOT MATCHED
- Review the plan to see how MERGE INTO is executed.
MERGE INTO JOIN To MERGE the two tables. The association conditions are specified after ON.
gaussdb=# EXPLAIN (COSTS off)
MERGE INTO dst x
USING src y
ON x.product_id = y.product_id
WHEN MATCHED THEN
UPDATE SET product_name = y.product_name, category = y.category, total = y.total
WHEN NOT MATCHED THEN
INSERT VALUES (y.product_id, y.product_name, y.category, y.total);
QUERY PLAN
--------------------------------------------------
id | operation
-----+--------------------------------------------
1 | -> Streaming (type: GATHER)
2 | -> Merge on dst x
3 | -> Streaming(type: REDISTRIBUTE)
4 | -> Hash Left Join (5, 6)
5 | -> Seq Scan on src y
6 | -> Hash
7 | -> Seq Scan on dst x
Predicate Information (identified by plan id)
------------------------------------------------
4 --Hash Left Join (5, 6)
Hash Cond: (y.product_id = x.product_id)
(14 rows)
Copy the code
Why is this converted to LEFT JOIN?
Because the target table needs to be updated when it matches the source table, data is inserted into the target table when it does not match. That is, some of the data from the source table is used to update the target table and some of the data is used to insert into the target table. The semantics are similar to LEFT JOIN.
5 --Seq Scan on public.src y
Output: y.product_id, y.product_name, y.category, y.total, y.ctid
Distribute Key: y.product_id
6 --Hash
Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
7 --Seq Scan on public.dst x
Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
Distribute Key: x.product_id
Copy the code
- Run MERGE INTO to view the result.
The two tables can be related if product_id is 1502,1601,1666, so the three records are updated. If SRC table product_id is 1700, no match, insert this record. Others are not modified.
gaussdb=# SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+--------------+-----------+------- 1501 | vivitar | electrnc | 100 1502 | olympus | electrncs | 100 1600 | play gym | toys | 100 1601 | lamaze | toys | 100 1666 | harry potter | dvd | 100 (5 rows) gaussdb=# SELECT * FROM src ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1502 | olympus camera | electrncs | 200 1601 | lamaze | toys | 200 1666 | harry potter | toys | 200 1700 | wait interface | books | 200 (4 rows) gaussdb=# MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total); MERGE 4 gaussdb=# SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1501 | vivitar | electrnc 1502 | | 100 -- unmodified push around camera | electrncs 1600 | | 200 - update play gym | toys 1601 | | 100 -- unmodified lamaze | toys | 200-1666 | update harry potter | toys | 200 -- updated 1700 | wait interface | books | 200 - insert (6 rows)Copy the code
- Check the number of updates and inserts
You can view the number of updates and inserts using EXPLAIN PERFORMANCE or EXPLAIN ANALYZE. (Only necessary parts are shown here)
You can see in the Predicate Information section that a total of one is inserted and three are updated.
You can see the Information for each node in the Datanode Information section. Datanode1 update 2, datanode2 insert 1, update 1.
gaussdb=# EXPLAIN PERFORMANCE MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total); Predicate Information (identified by plan id) ------------------------------------------------ 2 --Merge on public.dst x Merge Inserted: 1 Merge Updated: 3 Datanode Information (identified by plan id) --------------------------------------------------------------------------------------- 2 --Merge on public.dst x datanode1 (Tuple Inserted 0, Tuple Updated 2) datanode2 (Tuple Inserted 1, Tuple Updated 1)Copy the code
WHEN NOT MATCHED
- Since there is no WHEN NOT MATCHED part, there is no need to perform any operation WHEN the two tables do NOT match, so the data of the source table is NOT needed, so only inner join is needed.
- View the result. MERGE INTO only operates on three pieces of data.
WHEN NOT MATCHED
- Insert only if there is a mismatch. No data was updated in the results.
WHERE filter criteria
The semantics are to determine whether the current row meets the filtering criteria before update or insert. If not, no update or insert is performed. If you do not want a field to be updated, you need to specify filtering conditions.
The following example will update only rows of product_name = ‘Olympus’ when two tables are associable. When two tables cannot be associated and the source table product_id! = 1700 before insertion.
gaussdb=# truncate dst; gaussdb=# INSERT INTO dst SELECT * FROM dst_data; gaussdb=# MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total WHERE x.product_name = 'olympus' WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total) WHERE y.product_id ! = 1700; MERGE 1 gaussdb=# SELECT * FROM dst ORDER BY 1; SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1501 | vivitar | electrnc | 100 1502 | olympus camera | electrncs | 200 1600 | play gym | toys | 100 1601 | lamaze | toys | 100 1666 | harry potter | dvd | 100 (5 rows)Copy the code
The subquery
Subqueries can be used in the USING section for more complex association operations.
-
The result of the aggregation operation on the source table is then matched with the target table
MERGE INTO dst x USING ( SELECT product_id, product_name, category, sum(total) AS total FROM src group by product_id, product_name, category ) y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = x.product_name, category = x.category, total = x.total WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total + 200);
-
The result of the UNION of multiple tables is then matched with the target table
MERGE INTO dst x USING ( SELECT 1501 AS product_id, ‘vivitar 35mm’ AS product_name, ‘electrncs’ AS category, 100 AS total UNION ALL SELECT 1666 AS product_id, ‘harry potter’ AS product_name, ‘dvd’ AS category, 100 AS total ) y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = x.product_name, category = x.category, total = x.total WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total + 200);
The stored procedure
gaussdb=# CREATE OR REPLACE PROCEDURE store_procedure1()
AS
BEGIN
MERGE INTO dst x
USING src y
ON x.product_id = y.product_id
WHEN MATCHED THEN
UPDATE SET product_name = y.product_name, category = y.category, total = y.total;
END;
/
CREATE PROCEDURE
gaussdb=# CALL store_procedure1();
Copy the code
The principle behind MERGE INTO
As mentioned above, MREGE INTO is converted to a LEFT JOIN or INNER JOIN to associate the target table with the source table. So how do you know if a row is to be updated or inserted?
Run EXPLAIN VERBOSE to view the output of the operator. Ctid columns were output when both tables were scanned. So what does the CTID column do?
5 --Seq Scan on public.src y
Output: y.product_id, y.product_name, y.category, y.total, y.ctid
Distribute Key: y.product_id
6 --Hash
Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
7 --Seq Scan on public.dst x
Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
Distribute Key: x.product_id
Copy the code
Ctid identifies the specific location of the row in storage. After knowing this location, you can update the data at this location. As an MPP distributed database, GaussDB(DWS) also needs to know node information (xc_node_id). The UPDATE operation requires both values.
In MREGE INTO, cTID has another trick. When the target table matches and needs to be updated, the cTID value of the row is preserved. If not, insert it. Ctid is NULL. Depending on whether the CTID output of the LEFT JOIN is NULL, the row should be updated or inserted.
In this way, a row can be updated or inserted according to the CTID column output after the JOIN operation on the two tables.
Matters needing attention
When using MERGE INTO, pay attention to whether the matching conditions are appropriate. If you are not careful, data may be updated unexpectedly, and the entire table may be updated.
conclusion
GAUSSDB(DWS) provides efficient data import function MERGE INTO, which is a key function for a data warehouse. You can use MERGE INTO to update and insert a table at the same time and import data quickly even with very large volumes of data.
For more information about GuassDB(DWS), please search “GaussDB DWS” on wechat to follow the wechat public account, and share with you the latest and most complete PB series warehouse black technology, the background can also obtain many learning materials oh ~
Click to follow, the first time to learn about Huawei cloud fresh technology ~