Abstract: This article is based on huawei cloud NoSQL database architect Yu Wenlong, in this year’s China System Architects Conference SACC speech collated.

The outline of this sharing is divided into four parts as follows:

1. What is GaussDB(for Redis)?

2. Why do you choose to separate storage and computation

3. Design and implementation

4. Competitiveness summary

1. What is GaussDB(for Redis)

1.1 What are the disadvantages of open source Redis?

To answer the question of what GaussDB(for Redis) is, start with the background. Open source Redis is a very good KV cache, but with the booming development of various businesses, the data scale, throughput scale and business complexity keep rising, open source Redis has exposed many problems:

1. AOF expansion problem

Open source Redis is positioned as cache, but in order to meet the fast recovery of business downtime data, it adds AOF log to realize certain persistence function. Unfortunately, Redis does not have a dump mechanism to consume the AOF, but instead uses AOF overwriting to constantly remerge old logs. The override mechanism requires a fork call, which causes problems such as memory doubling, performance blocking, and so on.

2. Snapshot backup problem

As businesses rely more and more on Redis, data backup is becoming very important. As we all know, the Redis architecture is not an MVCC architecture, so if you want to back up data, it is inevitable to copy the memory data after pessimistic locking. However, Redis author designed a copy on write scheme, that is, call fork, create a child process to copy data, avoid user locking. However, this process actually locks the kernel side and still causes significant performance jitters.

3. Disconnect between master and slave

Open source Redis adopts master-slave high availability architecture, and data is transferred in asynchronous mode. Therefore, after the master outage, it is easy to cause data loss or inconsistency. In addition, when the primary node is under a lot of write pressure, the single-thread master/slave replication is likely not able to keep up with the incremental data, resulting in buffer piles, which can lead to write failures and OOM disasters. While Redis can try to even out the huge differences between master and slave by temporarily creating snapshots and synchronizing large files, this can cause a series of fork problems as mentioned earlier.

4. The problem of the fork

Fork is a very heavy system call that is copy-on-write, but usually has twice as much memory reserved for it. Fork also locks and copies process page tables and other information, which has a great impact on services. The above three problems are all caused by fork, which usually requires dbAs to take complex operation and maintenance measures such as shutting down AOF and backup of the primary node to avoid. However, in the scenario of frequent primary/secondary switchover and a large number of nodes, o&M is very difficult. Even in disjointed scenarios, there is theoretically no way around it.

5. Capacity

Open source Redis is not suitable for large-scale use, and there are two important factors that limit its scalability. First of all, fork limits the vertical scale-up capability of Redis. The larger the amount of data, the slower the fork, the greater the impact on business. Therefore, the amount of data that a single Redis process can carry is very limited. Secondly, inefficient management of the Gossip cluster limits its ability to Scale Out: the more nodes there are, the longer it takes to discover faults, and the network storm of internal communication increases exponentially, making the large cluster almost unusable.

1.2 What are the industry’s solutions?

The above are the classic problems encountered by major enterprises in the production practice of open source Redis. These problems limit the large-scale adoption of open source Redis. As a result, many solutions have been proposed in recent years, as shown in the figure below.

In essence, Redis is a KV storage, which can be further divided into two camps according to the scenario: caching and persistence.

Cache scenario: The cache is used to store the data of sSEC and hotspot events. For example, microblog hot search, this kind of data is valid, and can be lost.

Persistence scenario: When using Redis as a cache, due to its simple interface and rich functions, we must hope to store more important data in Redis persistently, such as historical orders, feature engineering, location coordinates, machine learning, etc. The data volume of this kind of data is often very large, the validity period is also very long, and generally cannot be lost.

The cache scenario is relatively simple and open source Redis. There are many self-developed products in the industry for persistence scenarios, such as 360’S SSDB/PIKA, Ali’s TAIR, Tencent’s Tendis, and of course Huawei Cloud’s Gaosredis also belongs to self-developed persistence Redis.

As another reason for persistence, 256GB memory sticks are nearly 30 times more expensive than 256GB SSD disks in terms of cost, and there is a huge difference in available capacity.

1.3 What is the Solution to Huawei Cloud Database?

Huawei cloud database team draws on the experience of open source Redis and chooses self-developed persistent Redis, which is the protagonist of today’s share — Gauss Redis.

Its one-sentence positioning is: support Redis protocol NoSQL database, not cache. It has two features that are completely different from the industry:

1. Storage and calculation separation. Based on huawei’s self-developed distributed storage DFV, gauss Redis provides powerful data storage capabilities, including advanced features such as strong consistency and flexible capacity expansion. What is DFV? It is the cornerstone of Huawei’s full-stack data services, such as file EVS, object OBS, block storage, database family and big data family, all depend on it, so you can imagine its strength and stability.

2. Multi-mode architecture. In fact, Gauss Redis is a member of multi-mode database GaussNoSQL. GaussNoSQL provides full-stack distributed KV engine, user-mode file system, storage pool and other technologies. Only need to encapsulate Redis protocol on the interface, you can easily implement a new NoSQL product. Similarly, NoSQL engines such as MongoDB, Cassandra and InfluxDB are available.

2. Why do you choose to separate storage and computation?

Nowadays, the concept of cloud native is ubiquitous, and the database is gradually moving towards cloud native, and its cloud native has an important feature is the separation of storage and computation. Computation separation also represents the latest trend in the cloud over databases.

First-generation database service: As can be seen from the figure below, when traditional IDC is built, the database is set up on bare metal. Due to the sensitive particularity of database service, DBA or R&D needs to care about the choice of model, disk Raid array, networking, even procurement and many other matters.

The second generation of database services: With the popularization of virtualization technology, a large number of application services are migrated to the cloud, and databases are also migrated to the cloud. The simplest way is to run a database service in a VIRTUAL machine or container. The advantages of doing so are obvious, but there are two disadvantages: one is the general cloud disk is 3 copies, plus the database upper copy, serious waste of resources; On the other hand, the standby server wastes resources and cannot provide services. In addition to the cloud disk IO performance and other problems exist.

Third generation database services: Based on the memory separation architecture, database services are divided into CPU intensive computing layer and I/O intensive storage layer. Data copy management is completely entrusted to the storage layer, and stateless forwarding is realized by the computing layer, which not only gives full play to the elastic advantages of the cloud, but also enables full load sharing. However, the disadvantages are also obvious, that is, it is difficult to transform based on the old architecture.

After the separation of storage and computation architecture, database service is a divide-and-conquer idea: the computing layer is responsible for various processing of servitization and productization, and the whole process is stateless; The storage layer focuses on data maintenance, including replication, disaster recovery, hardware awareness, capacity expansion, and so on.

3. Design and implementation

Next, the overall design and implementation, starting with software architecture. The modules of Gauss Redis computing layer are as follows, including CFGSVR, Proxy and Datanode. RocksDB and GeminiFS (self-developed user-mode file system) are connected to the computing and storage resources, which are responsible for converting KV data into SST files and pushing SST files down to the object storage pool of DFV respectively.

The next step is networking design. The database resources applied for by a tenant are distributed on different physical machine containers in an anti-affinity manner and belong to the same VPC of the same tenant. Although database resources of different users may share the same physical machine, data isolation is ensured due to VPC isolation. In addition, database resources at the computing layer are container-exclusive, while storage resources are shared with physical hardware.

Next, explain the disaster recovery architecture. Since Gauss Redis is positioned as a database rather than a cache, it takes a serious attitude towards data. It not only implements intra-region 3AZ Dr, but also provides cross-region Dr.

Intra-region Dr Implements a high availability solution that tolerates AZ-level faults. Despite this failure, data remains strongly consistent, which provides a very powerful data security guarantee for enterprise applications. The reliability index of this architecture can meet the standard of RPO 0 and RTO less than 10s.

The implementation principle is that the computing layer also implements the anti-affinity deployment of 3AZ based on the strong consistent replication capability of DFV’s three replicas. When a user’s data is written to Datanode1 through proxy, Datanode1 finds a DFV storage node of local AZ by invoking DFV SDK through GeminiFS user-mode file system. The DFV storage node and one DFV storage node in the nearest remote AZ form the majority. After the data is written successfully, the data is returned to the user. In this architecture, az-level computing or storage failures have no impact on data security.

Next, we will talk about cross-region disaster recovery. In addition to the strongly consistent 3AZ solution, Gauss Redis also provides cross-region Dr, that is, asynchronous Dr Between two instances. In this scheme, we added a module of Rsync-server, which is used to subscribe to the newly added logs on the master instance, and then unencode the logs into corresponding formats, and forward the logs to the standby instance on the peer end, which can be played back by the standby instance. This scheme can realize bidirectional synchronization, breakpoint continuation, conflict resolution and so on. In conflict resolution, different resolution algorithms are adopted for different Redis data structures to ensure final consistency.

4. Competitiveness summary

The last section summarizes the advantages of Gauss Redis, including: strong consistency, high availability, hot and cold separation, elastic expansion, and high performance.

The first is strong consistency.

This is mainly due to DFV’s three-copy mechanism, so data written to Gauss Redis is already three-copy consistent when the client receives a reply. Strong consistency is business-friendly, with no need to tolerate data inconsistencies and no need to validate data. However, open source Redis data adopts asynchronous replication, so there is always a difference between the master and slave buffer. In case of power failure, this part of data will be lost, and buffer accumulation will occur when writing under great pressure. In serious cases, OOM will be caused. Therefore, the strong consistency of Gauss Redis is a very important feature, which can provide consistent state for business, without worrying about data consistency and loss problems after the open source Redis master/slave switch.

The second feature is high availability.

High availability is the basic ability of a database. Here, it is emphasized again because the availability of Gauss Redis is different from other databases. It can accept n-1 node failure. Implementation principle Benefits from shared storage DFV: When a compute node fails, its slot routing information is automatically taken over by the remaining nodes. Since there is no migration of underlying data involved, this takeover process is very fast. By analogy, N-1 node failures can be accepted without affecting all data reads and writes. Of course, fewer compute nodes will have an impact on performance.

The third property is the separation of hot and cold.

A classic use scenario of open source Redis is to do hot and cold separation with MySQL, but this requires the business implementation code to realize hot and cold data exchange and maintain its consistency, which is a complicated delivery logic. Gauss Redis implements its own hot and cold separation, that is, the data that the user has just written and frequently accessed is loaded into memory as hot data, while the data that is not frequently accessed is discarded into persistent storage. Therefore, the business using Gauss Redis no longer needs to write code from the business layer to maintain hot and cold exchange logic, and can achieve better consistency.

The fourth property is elastic expansion.

Gauss Redis after the separation of memory and computation can be expanded according to demand, that is, computing is not enough to expand computing, and storage is not enough to expand storage. The capacity expansion of computing resources is also very simple. As mentioned earlier, this process does not involve data copying and relocation, but only involves metadata modification, that is, the corresponding slot routing information (less than 1MB) is migrated to the newly added node. Therefore, it is very fast and can be completed in seconds. Storage resource capacity expansion is easier. Because shared storage is used at the underlying layer, logical capacity expansion is performed in most cases. Users only need to modify the quota on the console to complete capacity expansion, and no data relocation or copy is required. Of course, there is also the case of physical expansion. In this case, we usually find the warning water level in advance and perform smooth migration and expansion before, which is transparent to users.

The fifth feature is high performance.

The architecture of the separation of memory and computation seems to be heavy and the link is complex, but in fact, it can be more bold and radical in hardware adoption and software optimization, such as RDMA network, user-mode protocol, persistent memory and so on. So thanks to these proprietary storage devices, and our compute layer full load sharing architecture (no slave nodes, so performance is easily doubled), we perform well in storage scenarios where the amount of data from the comparator is greater than the amount of memory. In addition, compared with the open source Redis, our performance has a great advantage in the point search scenario where the data is smaller than the memory. Of course, the range query still needs to be optimized.

5. Conclusion

The above is all the content about Gauss Redis shared this time. For more content, please refer to the official blog of Gauss Redis: Official Blog and the official homepage of Gauss Redis: Official homepage.

Click to follow, the first time to learn about Huawei cloud fresh technology ~