Please pay attention to the wechat public account “AI Front”, (ID: AI-front)
As a side note: RISE LABS, formerly known as the AMP Lab in Berkeley, has developed a number of highly successful distributed technologies that have had a profound impact on HIGH-PERFORMANCE computing, including Spark, Mesos, Tachyon, and others. Now Matei, a former AMP lab PhD student and one of the core authors of Spark and Mesos, has moved on to Stanford and late last year launched DAWN, an open source project aimed at popularizing machine learning practices (see AI Frontier). Soon after, RISE LABS released Ray, a new distributed execution framework that aims to replace Spark (see AI Frontier).
For the past few years, RISE LABS has focused its research on how to design distributed systems that do not require coordination. They put in (http://db.cs.berkeley.edu/papers/sigrec10-declimperative.pdf) of the basic theory of reflux, design a new programming language Bloom (http://bloom-lang.net/), Developed a cross-platform program analysis framework Blazes (https://arxiv.org/pdf/1309.3324.pdf), which released the transaction protocol HATs (http://www.vldb.org/pvldb/vol8/p185-bailis.pdf). But prior to Anna, they had not tested or evaluated any of these theories, languages, frameworks or protocols for performance in a multi-core or cloud environment.
And Anna’s launch just confirms their previous research results. Anna’s paper shows that on a single AWS instance, Anna is 10 times faster than Redis. It also beat Cassandra 10 times faster in a standard interactive benchmark. To get more results, they also compared Anna’s performance to that of other major key-value systems: 700 times faster than Masstree and 800 times faster than Intel’s “lockless” TBB hash table. Of course, Anna does not provide the same linear consistency as other key-value systems. However, Anna’s use of a local cache for private state still provides excellent incongrency, 126 times faster than a “hogwild” style C++ hash table. And when it comes to the cloud, Anna leads the pack. Other systems don’t really offer linear scaling, but Anna does.
Anna’s performance and scalability can be attributed to its complete lack of coordination. The node worker process spends 90% of its workload processing requests, while most other systems (such as Masstree and Intel’s TBB) spend less than 10% of their time processing requests and the remaining 90% of their time waiting for coordination. Not only that, but other systems use shared memory and have processor cache breakdowns.
Anna is not only fast, but also achieves a high level of consistency. Their release of the transaction protocol HATs years ago showed that there is a lot of room for uncoordinated distributed consistency and transaction isolation, including cascading consistency and read-commit transaction levels. Anna ported Bloom’s single-cell composite design pattern to C++ and was the first system to achieve all of these levels of consistency. Of course, it is also because of the simplicity of the design that such a fast speed can be achieved.
The RISE researchers learned a lot while designing Anna, which goes far beyond a key-value database and can be applied to any distributed system. They are developing a new expansion system based on Anna called Bedrock. Bedrock runs in the cloud and provides a low-cost key-value storage solution that requires no human intervention and is open source.
Figure 1: Anna architecture
The diagram above shows Anna’s single-node architecture. Anna server consists of a series of independent threads, each running uncoordinated actors. Each thread corresponds to one CPU core, and the number of threads does not exceed the total number of CPU cores. Client agents are responsible for distributing remote requests to actors, and each actor has a private hash table that is stored in shared memory. Changes between threads are swapped over memory broadcasts, while changes between servers are swapped over protobuf.
This one-to-one model of thread and CPU core avoids context switching overhead. Actors do not share key-value state, the key space is partitioned by consistent hashing, and data partitions are replicated between actors using a multi-master replication mechanism, with configurable replication coefficients. Actors notify other actors of key updates based on timestamps, and each actor has its own private state, which is stored in a data structure called a “grid” to ensure consistency in the event of delayed, rearranged, or repeated messages.
Below, Anna’s uncoordinated actor model is evaluated for its parallel capability and multi-server scaling capability on multi-core CPUS, and compared with other popular key-value databases.
Uncoordinated actor model scalability
In the uncoordinated actor model, each actor corresponds to a thread that has its own private copy of any shared state and notifies other actors of updates via asynchronous broadcasts. On multi-core servers, this model is an order of magnitude better than the traditional shared memory model.
To this end, RISE researchers designed a comparison experiment that compared Anna with other shared memory based TBB, Masstree, and one of their own key-value storage systems (let’s call it Ideal). They ran the experiment on AN M4.16 Xlarge instance of AWS, each equipped with a 32-core CPU. The experiment used 1 million key-value pairs with 8-byte keys and 1KB values. During the experiment, they generated different pressure loads based on zipfian distribution and various coefficients to simulate conflicts at different levels.
In the first experiment, they compared Anna’s throughput on a single server with TBB, Masstree, and Ideal. They incrementally increase the number of threads until they reach the maximum number of server CPU cores and observe their throughput.
Figure 2
Figure 3
Figure 2 shows the relationship between the number of threads and throughput in the case of high concurrency, where the Zipf coefficient is 4. Figure 3 shows CPU time usage under high concurrency. CPU time is divided into five categories: processing request (RH), atomic instruction (AI), merge lattice (LM), broadcast (M), and others. The rightmost column is L1 cache breakdown number (CM).
As can be seen from the figure, TBB and Masstree almost lose their parallelism capability under such load pressures. Because most are update operations, parallel update operations on the same key value are serialized, and they require a synchronization mechanism to prevent multiple threads from updating the same key value at the same time. Therefore, as the number of threads increases, their performance tends to flatten out. The Ideal is 6 times better than TBB and Masstree, but not as good as Anna. Although it does not use synchronization, there is still a cache consistency overhead when multiple threads modify shared memory addresses.
In contrast, in Anna, updates are performed for local state, do not need to be synchronized, and change exchanges are periodically broadcast. In the case of high concurrency, its performance is still limited by its replication factor, but it is much better than shared memory based systems. Anna spends 90% of her CPU time processing requests and very little else. It can be seen that Anna’s uncoordinated actor model solves the scalability problem of key-value storage system in multi-core environment.
Figure 4.
Figure 5
Figure 4 shows the relationship between the number of threads and throughput in the case of low concurrency, where the Zipf coefficient is 0.5. Figure 5 shows CPU time usage at low concurrency, with the rightmost column representing memory footprint (MF).
When the replication factor is 1, Anna scales better because of its very low memory footprint. However, as the replication factor increased (to 3), throughput decreased significantly (by three-quarters). There are two reasons for this. First, increasing the replication factor takes up more memory, and in low concurrency, the number of unique key update operations increases so much that change swapping by merge is not possible. Figure 5 shows that when the replication factor is 3, Anna spends 69% of her CPU time processing broadcast changes. When using the full replication factor, Anna also stops scaling because it is equivalent to only one request per thread. However, although TBB and Masstree have no broadcast overhead, there are still significant overhead in terms of memory footprint and synchronization operations. Therefore, it can be concluded from this experiment that for a system that supports multi-master replication, the use of high replication coefficients at low concurrency is detrimental to performance.
Figure 6.
Figure 6 shows how throughput changes as you increase the number of threads on multiple servers. Anna’s replication factor is set to 3, starting 32 threads on the first server, then 32 threads on the second server, and finally all remaining threads on the third server. As can be seen from the figure, Anna’s throughput increases linearly with the increase of the number of threads. There was a slight drop in throughput when thread 33 was started, but that was because thread 33 belonged to the second server. But overall, the throughput growth is very stable. It can be seen that with Anna’s uncoordinated actor model, linear growth of throughput can be achieved.
Comparison with other systems
To compare Anna’s performance with that of other popular key-value systems, RISE researchers designed two experiments, the first with Redis on a single node and the second with Cassandra on a large distributed system.
Anna has multi-threaded parallel capabilities, while Redis uses a single-threaded model. So, in the first experiment, they set up a Redis cluster on the same server and asked Anna to compare it to it. The experiment was run on an EC2 instance of AWS, with the Redis cluster using as many threads as possible.
Figure 7.
As you can see from Figure 7(a), under high concurrency, the overall throughput of the Redis cluster is almost constant, while Anna can spread hotkeys between replicas. Anna’s throughput increases as the replication coefficient increases until it flattens out. If hotkeys are replicated completely, throughput continues to grow as threads are added. As can be seen from Figure 7(b), under low concurrency, both Anna and Redis clusters achieve good parallelism, and their throughput increases with the increase in the number of threads.
It can be seen from this experiment that in the case of high concurrency, Anna can beat the Redis cluster in terms of performance by copying hotkeys, while in the case of low concurrency, Anna can achieve similar performance with the Redis cluster.
In the second experiment, RISE researchers set the consistency of Cassandra to the lowest level (ONE), meaning that the update operation was successful if only ONE node was confirmed. They ran the experiment on four AWS EC2 availability zones (Orleans, Northern Virginia, Ireland, and Tokyo) and measured their scalability by adjusting the number of nodes in the available zones. They all have a replication factor of 3.
Figure 8.
As can be seen from Figure 8, the performance of Anna and Cassandra shows a linear increase with the increase of nodes. However, Anna is significantly better than Cassandra. In fact, Anna was able to beat Cassandra with four threads per node, and when all threads were used, Anna outperformed Cassandra by more than 10 times.
References:
https://rise.cs.berkeley.edu/blog/anna-kvs/?twitter=@bigdata
Original paper:
http://db.cs.berkeley.edu/jmh/papers/anna_ieee18.pdf
For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.