Abstract: Clickhouse, as an OLAP database, has very limited transaction support. This article mainly introduces how to update and delete Clickhouse data by ReplacingMergeTree.
This article is from the huawei cloud community “How to Implement Clickhouse Data Update”, by Xiao Bawang.
Clickhouse, as an OLAP database, has very limited support for transactions. Clickhouse provides MUTATION operations (via ALTER TABLE statements) to UPDATE and DELETE data, but these are “heavy” operations. Unlike the standard SQL syntax of UPDATE and DELETE, they are performed asynchronously and are useful for infrequent updates or deletes of bulk data. Refer to altinity.com/blog/2018/1… . In addition to the MUTATION operation, Clickhouse can also through CollapsingMergeTree, VersionedCollapsingMergeTree, ReplacingMergeTree combining with specific business data structure to implement the data update, delete, All three methods use INSERT statements to INSERT the latest data. The new data “cancels” or “replaces” the old data, but the “cancel” or “replace” occurs when the data file is merged. That is, before the Merge, the new data and the old data exist together. Therefore, we need to do some processing when querying to avoid querying old data. Clickhouse official documentation provided the guidance of using CollapsingMergeTree, VersionedCollapsingMergeTree, clickhouse.com/docs/en/eng… . Compared with CollapsingMergeTree, VersionedCollapsingMergeTree need to mark a field, version, use ReplacingMergeTree to realize data update delete will be more convenient, Here is how to use ReplacingMergeTree to update and delete data.
Let’s assume a scenario requiring frequent data updates, such as the statistics on the electricity consumption of users in a city. As we know, the electricity consumption of users may change every minute and every second, so frequent data updates will be involved. First, create a table to record the electricity consumption of all users in a city.
CREATE TABLE IF NOT EXISTS default.PowerConsumption_local ON CLUSTER default_cluster (User_ID UInt64 COMMENT 'user ID', Record_Time DateTime DEFAULT toDateTime(0) COMMENT 'District_Code UInt8 COMMENT ', Address String COMMENT 'user Address ', Power UInt64 COMMENT' electricity consumption ', Deleted BOOLEAN DEFAULT 0 COMMENT 'Whether data has been Deleted') ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/default.PowerConsumption_local/{shard}', '{replica}', Record_Time) ORDER BY (User_ID, Address) PARTITION BY District_Code; CREATE TABLE default.PowerConsumption ON CLUSTER default_cluster AS default.PowerConsumption_local ENGINE = Distributed(default_cluster, default, PowerConsumption_local, rand());Copy the code
PowerConsumption_local is the local table, and PowerConsumption is the corresponding distributed table. Including PowerConsumption_local using ReplicatedReplacingMergeTree watch engine, the third argument ‘Record_Time said more than the same primary key data, will only retain Record_Time the biggest one, We use this feature of ReplacingMergeTree to update and delete data. Therefore, when selecting a primary key, we need to ensure that the primary key is unique. Here, we select (User_ID, Address) as the primary key, because the user ID plus the user’s Address can determine a unique electricity meter, and the second one will not be the same. Therefore, for multiple data of an electricity meter, only the latest electricity record time will be retained.
Then we insert 10 entries into the table:
INSERT INTO default.PowerConsumption VALUES (0, '2021-10-30 12:00:00', 3, 'Yanta', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (1, '2021-10-30 12:10:00', 2, 'Beilin', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (2, '2021-10-30 12:15:00', 1, 'Weiyang', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (3, '2021-10-30 12:18:00', 1, 'Gaoxin', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (4, '2021-10-30 12:23:00', 2, 'Qujiang', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (5, '2021-10-30 12:43:00', 3, 'Baqiao', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (6, '2021-10-30 12:45:00', 1, 'Lianhu', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (7, '2021-10-30 12:46:00', 3, 'Changan', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (8, '2021-10-30 12:55:00', 1, 'Qianhan', rand64() % 1000 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (9, '2021-10-30 12:57:00', 4, 'Fengdong', rand64() % 1000 + 1, 0);
Copy the code
The data in the table is as follows:
Suppose we now want all user data with administrative code 1 to be updated, we insert the latest data:
INSERT INTO default.PowerConsumption VALUES (2, now(), 1, 'Weiyang', rand64() % 100 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (3, now(), 1, 'Gaoxin', rand64() % 100 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (6, now(), 1, 'Lianhu', rand64() % 100 + 1, 0);
INSERT INTO default.PowerConsumption VALUES (8, now(), 1, 'Qianhan', rand64() % 100 + 1, 0);
Copy the code
After the latest data is inserted, the data in the table is as follows:
ArgMax (a, b) takes the value of a, which is the maximum value of b. ArgMax (a, b) takes the value of a, which is the maximum value of B. So you can get only the latest data by using the following query:
SELECT
User_ID,
max(Record_Time) AS R_Time,
District_Code,
Address,
argMax(Power, Record_Time) AS Power,
argMax(Deleted, Record_Time) AS Deleted
FROM default.PowerConsumption
GROUP BY
User_ID,
Address,
District_Code
HAVING Deleted = 0;
Copy the code
The query results are as follows:
To make our query easier, we can create a view:
CREATE VIEW PowerConsumption_view ON CLUSTER default_cluster AS
SELECT
User_ID,
max(Record_Time) AS R_Time,
District_Code,
Address,
argMax(Power, Record_Time) AS Power,
argMax(Deleted, Record_Time) AS Deleted
FROM default.PowerConsumption
GROUP BY
User_ID,
Address,
District_Code
HAVING Deleted = 0;
Copy the code
Using this view, you can query the latest data:
If we now need to delete user 0, we need to insert a row where User_ID is 0 and User_ID is 1:
INSERT INTO default.PowerConsumption VALUES (0, now(), 3, 'Yanta', null, 1);
Copy the code
Select * from User_ID where User_ID = 0;
Clickhouse data can be updated and deleted as if using an OLTP database, but it is important to remember that the real deletion of old data occurs when data files are merged.
Click to follow, the first time to learn about Huawei cloud fresh technology ~