Clickhouse storage separation in Huawei cloud practice

Abstract: This article is our simplest obS-enabled adaptation of ClickHouse.

This article is from the Huawei Cloud community “ClickHouse Storage separation in Huawei Cloud practice” by he Lifu.

Clickhouse is an excellent OLAP database system that has been widely recognized for its excellent performance since its launch in 2016, and has been widely used and promoted by Chinese Internet companies in the past two years.

Open sharing of information from the Internet and customers to use the case summary, clickhouse is mainly used in the real-time number of warehouse and accelerate the two scenarios, some of the real-time business in pursuit of perfection of the distribution of SSD performance meeting, considering the limited scale of real-time data set, the cost is still able to accept, but for offline accelerated business, Data sets tend to be so large that the cost of a full SSD configuration becomes expensive. Is there a way to achieve high performance while keeping the cost as low as possible? Our idea is to flexibly scale, store data on cheap object storage, meet the high performance demand in high frequency period by dynamically increasing computing resources, and control the cost in low frequency period by recycling computing resources. Therefore, we aim at the feature of memory separation.

I. Current situation of separation of deposit and account

Clickhouse is an all-in-one database system where data is stored directly on local disks (including cloud drives). The latest version of clickHouse supports persistent data to object storage and HDFS. Here are some of the simplest obS-enabled adaptations for ClickHouse. As with native S3 support:

1. Configure S3 disks

2. Create a table and fill in the data

3. Check local site data

4. Check the object storage data

From the above images, you can see that the contents of the data file on the local disk record the file name (UUID) on the OBS. This means that clickHouse and OBS objects are associated by a “mapping” relationship in the local data file. Note that this “mapping” relationship is persisted locally. This means redundancy is required to satisfy reliability.

Further, we see the community pushing clickHouse in the direction of storage separation:

V21.3 Add the ability to backup/restore metadata Files forDiskS3 allows mapping between local data files and OBS objects, directory structure of local data, and other attributes. Put it in the properties of obS objects (object metadata), which decouples the restriction that data directories must be local, and also removes the condition that local files that maintain mapping relationships are reliable and at least double copies;

S3 Zero Copy Replication v21.4 enables multiple replicas to share one copy of remote data, significantly reducing the cost of multi-copy storage for the integrated storage engine.

But in fact, through the verification test, it can be found that there is still a long way to go before the storage separation can be put into production at the current stage. For example: Atomic library on how to move the object storage under the table (table definition SQL file identified in the uniqueness of UUID and the data catalogue UUID corresponding relation), elastic scalability capacity when how to fast and efficient equilibrium data (copy data greatly elongated operation window), modify the local disk files and remote object how to guarantee the consistency, node downtime how fast recovery, etc. And so on.

Second, our practice

In the era of cloud native, storage separation is the trend and our work direction. The following discussion will focus on HUAWEI public cloud object storage OBS.

1. Introduce file semantics

The differences between OBS and its competitors are as follows: OBS supports file semantics and the rename operation of files and folders. This was very valuable for our subsequent system design and elastic scaling implementation, so we integrated the OBS driver into ClickHouse, and then modified the ClickHouse logic so that the data on OBS was exactly the same as the local one:

Local Disk:

OBS:

Once you have a complete data directory structure, it’s easy to support operations like Merge, detach, Attach, mutate, and part reclaim.

2. Offline scenario

From bottom to top, let’s look at the system structure. In the offline acceleration scenario, the dependency on Zookeeper is removed, and each shard has one replica:

The obS rename operation is used to move data at the part level at a low cost (unlike clickHouse Copier). When a node is down, the new node creates a local data file directory from the object storage side.

3. Fusion scenario

Ok, on the basis of the above offline scenarios we continue in real-time scene “real-time cluster” part of the (below), different business clickhouse cluster, can through the way of hot and cold separation tiered storage (this function is relatively mature, the industry generally used it to reduce the storage costs), out in the cold data from real-time cluster, The obS Rename operation is mounted to the “offline cluster” so that we can cover the full life cycle of data from live to offline (including the ETL process from Hive to Clickhouse) :

Third, the future outlook

The previous practice is our first attempt in the direction of memory separation, and is still being improved and optimized. From a macro point of view, obS is still used as a distant disk, but thanks to the high throughput of OBS, Under the premise of the same computing resources, the performance delay of SSD and OBS running Star Schema Benchmark is about 5x, but the storage cost is significantly reduced by 10x. In the future, we will build on the previous work by removing the obS as a remote disk, consolidating the data of a single table in a single data directory, and incorporating clickHouse metadata into a stateless compute node for SQL on Hadoop, an MPP database like impala.

Click to follow, the first time to learn about Huawei cloud fresh technology ~