Big data platforms are a very rapidly evolving direction. Apache withdrew 13 hadoop-related projects this week, also giving tempeh still trumpet Hadoop big data ecology a blow. In the last few years, ClickHouse has become the new big data darling as companies in the community have begun to replace the Hadoop ecosystem with ClickHouse. In this section, I will also reflect on the direction of ClickHouse and the direction of big data storage to give you some reference.

First let’s look at ClickHouse’s cluster architecture:

The official ClickHouse cluster uses the shared Nothing architecture, which puts too much pressure on Zk to write directly to distributed tables. Therefore, local tables are mainly written in this architecture, and data loading is also a very challenging task when massive data is written.

This architecture has the following advantages:

  1. The structure is clear, and the use of expensive shared storage has been transformed into PC storage

  2. Each table can be horizontally partitioned across nodes, and each node has its own local storage

  3. The high availability architecture is simple and convenient for operation and maintenance

  4. The star model has good scalability

  5. Achievement of many modern distributed Data Warehouse basic architecture

This architecture also has the following disadvantages:

  1. Nodes combine computing and storage, which inevitably leads to high costs in capacity expansion. For example, from nodes, less work as computing, but also need to provide strong computing power.

  2. Insufficient flexibility Data distribution must be adjusted when a cluster is expanded or scaled down.

  3. The computing resources of slave nodes are mostly idle.

  4. When data is written at high speed, there is great pressure on Zk and network data synchronization.

  5. If performance or failure occurs in the structure upgrade or operation and maintenance, the whole service is easily unusable.

For these reasons, ClickHouse does not fit directly into a massive data storage architecture in its original form, mainly because of the high storage and computing costs. But ClickHouse is very simple in singleton situations, especially in the HTAP scenario with Clikchouse and MySQL; Or provide some simple no-join environment is very convenient. There is also a model that puts ClickHouse’s storage in HDFS to separate computing and storage (Quick Hand), but this architecture has high operational and company development costs.

So what about the new generation of big data or data platform storage improvements?

When it comes to the next generation of big data platforms, the current star product, Snowflake, has to be mentioned. Snowflake has further defined the direction of modern data warehousing. Snowflake offers Data-Warehouse-as-Service (DaaS) as a cloud-native data warehouse Saas service(No Hardware, No Software, And completely maintained on cloud)

Snowflake proposes a mutli-Clusetr Shared Data architecture. Here’s a look at Snowflake’s architecture:

Based on the architecture of Snowflake, it can be seen that the underlying architecture uses Shared Data architecture, namely S3 type object storage capability provided by cloud vendors. S3 itself is a cross-data center and an almost unlimited expansion mechanism provided by cloud vendors, so users basically do not need to worry about Data storage and security.

In addition, Snowflake’s computing layer is a computing cluster. The computing cluster is located in the same Data center. As long as the network transmission performance is guaranteed, the query parsing can be distributed to the computing layer cluster to obtain Data from Shared Data. The use of cloud-based infrastructure resources at this stage reduces the IT headaches associated with private builds.

Next up is the official cloud Service or metadata management layer, which is stateless and very convenient to solve the high availability problem.

The top layer is the load balancer, which can carry out redundancy and load processing of services, connection retention and so on.

In terms of architecture and layerization, Snowflake is completely implemented on the Cloud Native Database shelf, but the Cloud Service layer is more complex.

Advantages of this architecture:

  1. Unified data storage, there is no concept of data islands, S3 storage, storage and computing separated.

  2. S3 object storage can store structured and unstructured data. S3 object storage can be expanded indefinitely.

  3. Compute nodes in a cluster are stateless and can quickly expand and shrink. Supports multiple specifications.

  4. The top layer provides a data Lake management service, a complete SaaS platform that can be used to manage storage, computing, machine learning, and other management roles.

  5. All layers are relatively independent, scalable on demand and easy to manage.

  6. Turn off unwanted computing resources right out of the box.

  7. Each computing resource can access all data.

  8. The upgrade management is transparent to users.

Disadvantages of this architecture:

1. Based on cloud architecture development, it is difficult to deploy privately.

2. There is currently no open source (imposing a disadvantage).

It can be said that big Data platforms and Data platforms are transitioning from shared nothing architecture to shared Data, and from the traditional OS Database model to the separation of computing and storage. There will be a slow transition to SaaS applications (cloud companies will also become more specialized IT equipment providers, and more SaaS providers will emerge).

The situation in China may be a little slower. At present, the situation of cloud in China is still relatively few. Most companies still want to do everything by themselves, and more companies still hope to develop their own research and then start an AWS business. However, we can see that AWS, Google, Aliyun and Tencent are the only successful ones in the world. Other companies are happy to cater to their own business. But this is arguably the best time to start a cloud-based startup.

Conclusion:

  1. ClickHouse is suitable for an environment with small data volumes, no scaling, no cost sensitivity to the machine, and implementation greater than the cost phase. ClickHouse is excellent and delivers excellent performance on a single machine in normal environments.

  2. The new generation of big data platform computing and storage separation has become a trend.

  3. The storage developed by individuals in a short time cannot reach S3 object storage provided by cloud manufacturers. S3 object storage will also become the storage architecture of a new generation of databases and data platforms.