Meituan trillion KV storage architecture and practice

KV storage, as an important online storage service of Meituan, carries trillions of requests of online services every day. In the 2019 QCon global Software Development Conference (Shanghai), Qi Zebin, senior technical expert of Meituan, shared “Meituan comments on the Architecture and Practice of Trillion KV Storage”. This paper is a compilation of the content of the speech, which is mainly divided into four parts: The first part describes the development process of Meituan KV storage; The second part elaborates the architecture and practice of KV Squirrel; The third part introduces the architecture and practice of persistent KV Cellar. Finally, the future development plan and the new trend of the industry are shared.

Meituan comments on the development of KV storage

Meituan’s first generation of distributed KV storage is shown in the architecture on the left side of the picture below. I believe many companies have gone through this stage. The basic KV storage distributed design is achieved by consistent hashing on the client side and deploying many Memcached instances on the back end. However, there are obvious problems in such a design: for example, data will be lost when nodes are removed during downtime, cache space is insufficient and need to be expanded, and some data will be lost in consistency hash, etc., which will bring a lot of troubles to business development.

As the Redis project matured, Meituan introduced Redis to solve the problems we mentioned above and evolved an architecture like the one on the right. As you can see, the client side is still the same, using consistent hashing algorithm, and the server side has become a master-slave structure of Redis. When any node goes down, we can use Redis Sentinel to complete Failover and achieve high availability. However, there is still a problem, if the size of the consistent hash will still lose data, so how to solve this problem?

At this time, we found a relatively mature KV storage open source project: Ali Tair. In 2014, we introduced Tair to meet the KV storage requirements of our business. The architecture of the open source Tair version is divided into three main parts: the storage node at the bottom of the image above reports a heartbeat to its central node, which has two configuration management nodes that monitor all storage nodes. When any storage node goes down or is expanded, it reconstructs the cluster topology. When the client starts, it pulls a routing table directly from the central node. The routing table is simply a data distribution diagram of a cluster. Clients directly read and write data to storage nodes based on the routing table. In view of the data loss problem of KV during capacity expansion, it also has a data migration mechanism to ensure data integrity.

But, we are in the process of using, also met with some other issues, such as center node though it is the Lord for high availability, but in fact it is not similar to distributed arbitration mechanism, so in the case of a network partition, it is possible “split brain”, this also gives us a big impact on the business. In addition, data migration may affect service availability during Dr Capacity expansion. In addition, we have used Redis before, and businesses will find Redis to be particularly rich in data structures that are not yet supported by Tair. Although we solved some problems with Tair, Tair was not able to fully meet business requirements. After all, given meituan’s large scale and high business complexity, it is difficult for any open source system to meet our needs well. In the end, we decided to build on the open source systems we already had in place.

Just in 2015, Redis officially released Redis Cluster. So we kept up with the community and did a lot of development work with internal requirements to evolve the full memory, high throughput, low latency KV storage Squirrel. In addition, based on Tair, we added a lot of our own features to evolve a persistent, high-capacity, high-data KV storage Cellar. Because the open source version of Tair hasn’t been updated in four or five years, Meituan has been working on its iteration entirely, and the Redis community has been very active. In general, the iteration of Squirrel is self-development and community oriented, and self-development features are designed to be as compatible with the official architecture as possible. As you will see later, Cellar and Squirrel chose different design solutions to the same problem because of these differences.

These two stores are actually different solutions to the KV storage domain. In practice, if the business is small and delay-sensitive, we recommend Squirrel; If you have a lot of data and are not particularly sensitive to latency, we recommend using a lower-cost Cellar. At present, the daily call volume of these two KV storage systems in Meituan has exceeded one trillion, and their peak requests have also exceeded 100 million levels per second.

Memory KV Squirrel architecture and practices

Before getting started, this article describes what the two storage systems have in common. Take the classic question of distributed storage: How is data distributed? The question in the KV storage domain is how keys are distributed to storage nodes. This Squirrel is the same as Cellar. When we get a Key, we use a fixed hash algorithm to get a hash value, and then modulo the hash value to the number of slots to get a Slot ID. Both of our KV’s are now pre-segmented with 16384 slots. After obtaining the Slot ID, you can query the storage node on which the Slot is stored based on the routing table. The routing table is simply a slot-to-storage node mapping table.

Next, I will talk about my cognition of high availability architecture. I think high availability can be viewed from both macro and micro perspectives. From a macro point of view, high availability refers to how to do DISASTER recovery. For example, if you fail a node, what do you do? A machine room or a group of machine rooms in a region is down, what should you do? From a micro point of view, high availability is how to ensure a high end-to-end success rate. When we do some operation and maintenance upgrades or data migration for expansion and contraction, can we achieve the high availability of business requirements? This article will also share some of meituan’s high availability work from both macro and micro perspectives.

Here is our Squirrel architecture. The middle section is consistent with the official Redis cluster. It has a master-slave structure, and Redis instances communicate with each other through the Gossip protocol. We added a cluster scheduling platform on the right, including scheduling service, scaling service and high availability service, etc. It will manage the whole cluster and update the management results to ZooKeeper as metadata. Our client will subscribe to the metadata changes on ZooKeeper, obtain the topology status of the cluster in real time, and directly perform read and write operations in Redis cluster.

Squirrel node Dr

Then take a look at Squirrel disaster. For Redis clusters, node outages already have a complete handling mechanism. According to the official scheme, it generally takes 30 seconds for any node to be removed from downtime to be marked as FAIL. The removal of the main library can affect data integrity, so we need to be cautious. But what about slave libraries? We think this process is completely unnecessary. On the other hand, we all know that KV storage in memory is usually relatively small. For companies with large volumes of business, it tends to have many clusters. If a switch failure occurs, it can affect many clusters, and it can be cumbersome to make up copies after an outage. To solve these two problems, we made HA high availability service.

Its architecture is shown below, and it monitors all nodes in the cluster in real time. In case of network jitter or downtime (such as Redis 2), it updates ZooKeeper in real time and tells ZooKeeper to remove Redis 2. When the client receives the message, the read traffic is directly routed to Redis 3. If Redis 2 is only jitter for a few tens of seconds, after a few tens of seconds, the HA node will add it back if it detects that it has recovered.

If, after some time, HA determines that it is a permanent outage, the HA node will apply for a new Redis 4 container instance directly from the Kubernetes cluster and add it to the cluster. After the HA node updates the cluster topology, it writes to ZooKeeper to inform the client to update the route, and the client can read from the new slave library Redis 4.

With the above solution, we reduced the removal time from the library from 30 seconds to 5 seconds. In addition, we made the failover replica a minute-level automatic operation by automatically applying container instances to the cluster through HA, without any human intervention.

Squirrel Cross-region DISASTER recovery

We solved the single-node outage problem, but what about the cross-region problem? Let’s start by looking at what’s different across regions. First, compared with the network between rooms in the same region, the cross-region dedicated line is very unstable. Second, the bandwidth of the trans-territorial line is very limited and expensive. Replication within a cluster does not take into account extreme network environments. If the main database is deployed in Beijing and the two secondary databases are deployed in Shanghai, the same data will be transmitted twice in the northbound private line, which will cause a huge waste of private line bandwidth. In addition, as the business grows and evolves, we are also working on unitary deployment and remote live architecture. These demands cannot be met by official master-slave synchronization. Based on this, we also made a replication scheme between clusters.

As shown in the figure above, the master cluster in Beijing and the slave cluster in Shanghai are drawn here. What we need to do is to synchronize data from the master cluster in Beijing to the slave cluster in Shanghai through the cluster synchronization service. According to the process, first of all to synchronize scheduling module we issued “synchronous link is established between two clusters” task, synchronous scheduling module is based on master-slave cluster topology structure, the master-slave synchronization task issued by synchronous cluster between cluster, after receiving a sync task synchronization cluster will be dressed as Redis Slave, through Redis replication protocol, Fetch data from slave culls on the main cluster, including RDB and subsequent incremental changes. When the synchro receives the data, it turns it into a client write command and writes it to the master node of the Shanghai slave cluster. In this way, we synchronized the data from the Beijing primary cluster to the secondary cluster in Shanghai. Similarly, it is also very simple for us to do remote multi-work, and then add a reverse synchronization link to achieve bidirectional synchronization between clusters.

Now let’s talk about how to be highly available from a micro perspective, that is, to maintain a high end-to-end success rate. As for Squirrel, the following three problems affecting the success rate are mainly discussed:

Timeout jitter caused by data migration.
Timeout jitter is caused by persistence.
The hotspot Key request caused the overload of the single node. Procedure

Squirrel Intelligent Migration

For data migration, we encountered three main problems:

The Redis Cluster provides data migration capabilities, but it does not care which slots are moved to or from where.
When you migrate data, you want to migrate data as fast as possible. However, too fast migration may affect normal service requests.
Redis’s Migrate command blocks worker threads, especially when migrating large values.

To address these issues, we made a new migration service.

Let’s take a look at how it works, following the workflow. The core of this step is the “nearby principle”. For example, it is faster to migrate two nodes in the same machine room than two nodes across the machine room. After a migration task is created, it is sent to a batch of migration machines. Migration machine migration, there are several characteristics:

First, concurrency will be performed between migrating nodes in the cluster, such as sending migration commands to Redis 1 and Redis 3 simultaneously.
Second, each Migrate command migrates a batch of keys.
Third, we will use the monitoring service to collect the success rate, time consumption, load and QPS of the client in real time, and then feed the status back to the migration machine. Migration data process is similar to TCP slow start, it will add velocity upwards, if request success rate decline, and so on and so forth, will reduce its speed, migration velocity will eventually stabilized in the dynamic balance, so that to reach the most rapid migration, at the same time as small as possible to affect the normal business request.

Next, we will look at the migration of large values. We implemented an asynchronous Migrate command that will continue to process other normal requests while the Redis main thread executes. If there is a write request for the Key being migrated, Redis will return an error. This maximizes the normal processing of business requests without blocking the main thread.

Squirrel persistent refactoring

Redis generates RDB during master/slave synchronization. The process of generating RDB will call Fork to generate a child process to write data to the disk. Although Fork has COW mechanism of the operating system, when the memory usage reaches 10 or 20 GB, the whole process will still cause nearly second block. This is almost unacceptable for an online business. We also enable AOF for services that require high data reliability. If AOF is enabled, I/O jitter may cause process congestion, which also affects the success rate of requests. Our solution to both of these problems with the official persistence mechanism is to refactor the persistence mechanism.

Above is our latest version of Redis persistence. Write requests are written to the DB first and then to the memory Backlog, just like official. It also sends requests to the asynchronous thread, which is responsible for flushing the changes to the hard disk Backlog. When the hard drive Backlog is too large, we proactively do an RDB at peak times and remove the Backlog from the RDB.

What if we want to find a synchronization point for master slave synchronization? The first step is the same as official, we will look for the required sync points from the memory Backlog, if not, we will look for the sync points from the hard drive Backlog. Due to the large amount of disk space, the hard drive Backlog can store a large amount of data, so it is rare for a synchronization point to be missed. If the disk Backlog is also missing, we trigger a similar operation to full retransmission, but full retransmission does not require the RDB to be generated on the spot, it can be done directly from the RDB of the disk and the disk Backlog behind it. With this design, we reduced full retransmission a lot. In addition, we reduced much of the jitter caused by RDB by controlling RDB generation in the low peak region. At the same time, we avoid the jitter caused by writing AOF. However, this solution is less reliable than the official data because it writes AOF completely asynchronously, but we think the price for the improved availability is well worth it.

Squirrel hot Key

Let’s take a look at Squirrel’s hot Key solution. As shown in the following figure, the common master and slave nodes are in a normal cluster, while the hotspot master and slave nodes are outside the normal cluster. Let’s see how they relate to each other.

When a request comes in to read or write a common node, the node collects statistics on the request Key. If a Key reaches a certain amount of traffic or bandwidth usage, flow control is automatically triggered to restrict hotspot Key access and prevent nodes from being filled with hotspot requests. At the same time, the monitoring service periodically searches all Redis instances for the collected hot Key. If there is a hot spot, the monitoring service reports the Slot of the hot Key to our migration service. The migration service then adds the hotspot master/slave node to the cluster and migrates the hotspot Slot to the hotspot master/slave. Because the hotspot master and slave only have requests from hotspot slots, the processing capability of hotspot keys is greatly improved. Through this design, we can achieve real-time hot spot monitoring, and timely flow control to stop losses; With hotspot migration, we can achieve automatic hotspot isolation and rapid capacity expansion.

Persist KV Cellar architecture and practices

Let’s take a look at the architecture and practice of persistent KV Cellar. This is our latest Cellar architecture diagram.

There are two major architectural differences from Ali’s open source Tair. The first one is OB and the second one is ZooKeeper. Our OB has a similar function to the Observer of ZooKeeper, providing a query service for Cellar central node metadata. It can synchronize the latest routing table with the Master of the central node in real time, and the client routing table is fetched from OB. There are two main benefits of this approach. First, a large number of business clients are naturally isolated from the cluster brain Master, preventing routing table requests from affecting the management of the cluster. Secondly, because OB is only for routing table query and does not participate in cluster management, it can be extended horizontally, which greatly improves our routing table query ability. In addition, we introduced ZooKeeper for distributed arbitration to solve the “brain split” problem of the Master and Slave in the case of network segmentation I just mentioned. By storing the metadata of the cluster to ZooKeeper, we ensured the high reliability of the metadata.

Cellar node disaster recovery

Having introduced the overall architecture, let’s take a look at Cellar’s approach to node disaster. The outage of a cluster node is usually temporary, as is the network jitter of a node, which will quickly recover and rejoin the cluster. Removing a node completely because of its temporary absence, and performing data copy completion operations, can consume a lot of resources and affect business requests. Therefore, we implement Handoff mechanism to solve the impact of such short-term node failure.

As shown in the figure above, if node A is down, the Handoff mechanism will be triggered, at which point the central node will notify the client that node A is down and ask the client to send shard 1 requests to B as well. After node B processes the read/write requests from the client, node B also writes fragment 1 and 2 data that should be written to node A to the local Log.

If node A is down for 3 to 5 minutes, or the network jitter recovers after 30 to 50 seconds, node A sends A heartbeat message to the central node, and the central node notifies node B: “Node A is recovered. Please send the data during its absence to node B.” At this point, node B writes the Log stored locally back to node A. After node A has the full amount of data during the failure, the central node will tell the client that node A has completely recovered, and the client can send shard 1 requests back to node A again.

By doing this, we can quickly remove nodes in seconds and add back nodes after recovery with only a small amount of incremental data. In addition, if node A needs to be upgraded, the central node first cuts the traffic of node A to node B through active Handoff. After node A is upgraded, it writes back the incremental Log and then cuts back the traffic to join the cluster. By actively triggering the Handoff mechanism, we achieve the function of silent upgrade.

Cellar’s cross-regional disaster tolerance

Let me show you how Cellar cross – region disaster tolerance works. Cellar and Squirrel face the same cross-region disaster problem, and the solution is also inter-cluster replication. The following figure shows the cross-region scenario of A primary cluster in Beijing and A secondary cluster in Shanghai. For example, if A client writes data to node A in the primary cluster in Beijing, node A copies the data to nodes B and D as normal replication in the cluster. At the same time, node A copies the data to node H in the secondary cluster. After the H node handles the inter-cluster replication write, it also replicates the write operation to the I and K nodes in the secondary cluster. By establishing such a replication link between the nodes of the primary and secondary clusters, we achieve data replication between clusters and ensure the lowest cross-region bandwidth consumption. Similarly, two nodes in a cluster can be configured with two bidirectional replication links to achieve bidirectional synchronous remote replication.

Top Cellar is consistent

After we complete node and cross-region DISASTER recovery (Dr), businesses put forward higher requirements for us: strong consistent storage. Our previous data replication was asynchronous. During fault removal, data on the faulty node may be lost because the data has not been replicated yet. But for scenarios such as financial payments, they do not allow for data loss. Faced with this difficult problem, how should we solve it? The dominant solution in the industry is strong consistent replication based on Paxos or Raft protocols. We ended up with the Raft protocol. This is mainly because Raft is a very detailed and highly engineered paper. There are quite a few mature open source implementations of Raft that we can build on to shorten the development cycle.

The diagram below shows the architecture of the current Cellar cluster in Raft replication mode. The central node does the Raft group scheduling and decides which nodes have the three copies of each Slot.

You can see that Slot 1 is on storage nodes 1, 2, and 4, and Slot 2 is on storage nodes 2, 3, and 4. Each Slot forms a Raft group to which the client reads and writes. Since we pre-allocated 16,384 slots, we may have hundreds or even thousands of slots on our storage nodes when the cluster size is small. If each Raft replication group has its own replication threads, replication requests, logs, etc., then resource consumption can be very high and write performance can be poor. So we did the Multi Raft implementation, and Cellar wrote a Log of all the Raft replication groups on the same node, using the same set of threads for replication, and the replication packages between the different Raft groups were also integrated according to the target node to ensure that the write performance didn’t get worse because there were too many Raft groups. Raft has its own master selection mechanism which allows it to control its own master nodes and if any of them go down, it can choose a new master via a vote mechanism. Does the central node not need to manage the Raft group? Isn’t. Here is a typical scenario where the Raft Leader becomes extremely uneven between storage nodes if part of a cluster is recovered after several rounds of downtime. In order to ensure strong data consistency, the client’s read and write traffic must be sent to the Raft Leader, which makes the node traffic in the cluster very uneven. So our central node also does the Leader scheduling for the Raft group. For example, Slot 1 is stored on nodes 1, 2, and 4, and node 1 is the Leader. If node 1 fails, Raft selects node 2 as Leader. Node 1 then recovers and rejoins the cluster, and the central node then asks node 2 to return the Leader to node 1. In this way, even after a series of outages and recoveries, the number of leaders between our storage nodes is still guaranteed to be balanced.

Next, let’s see how Cellar can ensure its high end-to-end success rate. There are also three problems that affect the success rate. Cellar encountered the same data migration and hot Key issues as Squirrel, but the solution was different. This is because Cellar is going its own way, and doesn’t have to worry about compatibility with the official version, making more architectural changes. Another problem is that slow requests block the service queue causing large timeouts, which is a different problem in Cellar network, working multi-threaded model design.

Cellar Smart Migration

This diagram shows Cellar’s smart migration architecture. We split the bucket migration into three states. The first state is the normal state, with no migration. If Slot 2 is to be migrated from node A to node B, node A takes A snapshot of Slot 2 and sends the full snapshot to node B. During data migration, the packet return of node B brings back the status of node B. What are the states of B? Engine pressure, network adapter traffic, and queue length. Node A adjusts its migration speed according to the status of node B. Like Squirrel, it adjusts over time to achieve a dynamic balance of migration speed, achieving the fastest migration while minimizing the impact of normal requests to the business.

After Slot 2 is migrated, Slot 3 in the figure enters the state. The client may not have updated the routing table at this point, and when it requests node A, node A will realize that the client requested the wrong node, but it will not return an error, it will proxy the request to node B, and then send the response packet of B back to the client. At the same time, it tells the client that it needs to update the routing table, and then the client can access node B directly. This solves request errors caused by client route update delays.

Cellar queued up

At the top of the figure below is a standard thread queue model. The network thread pool receives network traffic, parses the request packet, and then puts the request into the work queue. The worker thread pool takes the request from the work queue for processing, and then puts the response packet back into the network thread pool for issuing.

When we analyze the online timeout cases, we find that in a batch of timeout requests, usually only one or two requests are caused by the slow processing of the engine, and most of the requests are just timed out because the overall response time is too long due to the long waiting in the queue. From online analysis, only one in 20 truly slow requests are timeout requests.

What’s our solution? Very simple, split thread pool, split queue. After receiving the packet, our network thread is divided into four queues according to its request characteristics, whether it is reading or writing, fast or slow. Read and write requests are easy to distinguish, but how to distinguish fast and slow? We classify requests according to the number of keys, Value size, data structure elements and so on. Then use the corresponding four worker thread pools to process the corresponding queue requests, and realize the isolation of fast and slow read and write requests. So if I have a slow request, it won’t affect the normal processing of the other three requests. But this also raises the question, if we go from one thread pool to four, does that quadruple the number of threads? Not really. When one thread pool is idle, we help other thread pools handle requests. So, our thread pool becomes four, but the total number of threads remains the same. Such a design in our online validation can reduce the latency of service TP999 by 86%, which can significantly reduce the timeout rate.

Hot Key Cellar

The diagram above shows the architecture of the Cellar Hot Key solution. We can see that the central node has added a responsibility for hotspot area management. It is now responsible for not only the normal distribution of data copies, but also the distribution of hotspot data. The cluster has hotspot areas on nodes C and D. Let’s take a look at how this scenario works through the read-write flow. If A client writes A Key to node A, node A determines whether the Key is A hotspot based on real-time hotspot statistics. If the Key is a hot spot, it will replicate the data to nodes with hot spots, namely C and D in the figure, as well as within the cluster. In addition, when the storage node returns the result to the client, it informs the client that the Key is a hotspot. In this case, the client caches the hotspot Key. When the client has a read request for this Key, it directly reads data from the hotspot. In this way, we can do capacity expansion only for hotspot data, unlike Squirrel, which moves the whole Slot out for capacity expansion. If necessary, the central node can also distribute hotspot areas to all nodes in the cluster, so that all hotspot read requests can be evenly distributed to all nodes. In addition, through this real-time hotspot data replication, we have solved the consistency problems caused by similar client-side cache hotspot KV schemes.

Development plans and industry trends

Finally, let’s take a look at the planning of our project and the technology trends in the industry. This part will be elaborated in terms of service, system and hardware. Firstly, in the service layer, there are three main points:

First, Redis Gossip protocol optimization. It is well known that as the Gossip protocol grows larger, the number of messages increases and the Failover time becomes longer and longer. Therefore, when the cluster scale reaches TB level, the cluster availability will be greatly affected, so we will focus on some optimization in this aspect later.
Second, we have already made Raft replication between the data copies of Cellar storage nodes to ensure strong consistency. We will also make Raft replication inside Cellar’s central point so that we don’t have to rely on ZooKeeper for distributed arbitration and metadata storage. Our architecture will also become simpler and more reliable.
Third, Squirrel and Cellar are both KV storage, but they are developed based on different open source projects, so their APIS and access protocols are different. We will consider integrating Squirrel and Cellar in the SDK layer in the future, although there will be different storage clusters at the back end. But the business side can be accessed with a set of SDKS.

At the system level, we are investigating and implementing some Kernel Bypass technologies, such as DPDK, SPDK, user-mode IO technologies for network and hard disk. It can bypass the kernel and access these devices through polling mechanism, which can greatly improve the IO capacity of the system. Storage as an IO – intensive service, performance can be greatly improved.

At the hardware level, intelligent network cards like RDMA can significantly reduce network latency and improve throughput; Flash technologies like 3D XPoint, such as Intel’s new AEP storage, have access latency closer to memory, and the line between flash and memory will become increasingly blurred. Finally, take a look at computing hardware. For example, by adding FPGA cards to flash memory, tasks that should be done by the CPU, such as data compression and decompression, can be carried out on the card. This kind of hardware can not only liberate the CPU, but also reduce the response latency of the service.

Author’s brief introduction

Zebin is a senior technical expert of Meituan Dianping. He joined Meituan in 2014.

Recruitment information

The storage Technology Center of Meituan Basic Technology Department is looking for senior/senior engineers and technical experts in C/C++, Go and Java. Welcome to join the family of Meituan Basic Technology Department. Interested students are welcome to send their resumes to [email protected] (Subject line: Basic Technology Department-Storage Technology Center)

To read more technical articles, please scan the code to follow the wechat public number – Meituan technical team!