Amazon's distributed key-value store for small objects - Dynamo

“This is the 17th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

An overview of

Dynamo is a high availability KV storage system. To ensure high availability and high performance, Dynamo uses the ultimate consistency model, which provides developers with a new API, versioning mechanisms, and user-side assistance for conflict resolution. Dynamo aims to provide uninterrupted service while maintaining performance and scalability. Due to amazon’s extensive adoption of decentralized, highly decoupled microservice architectures, the availability requirements for storage systems in microservice state are particularly high.

Simple Storage Service (S3) is another well-known Storage Service of Amazon. Although it can also be understood as KV Storage, it does not match the target scenario of Dynamo. S3 is an object storage service for large files. It mainly stores binary files and does not provide cross-object transactions. Dynamo is a small file oriented document storage service that stores structured data (such as JSON), indexes the data, and supports transactions across data items.

Dynamo can be thought of as providing only primary key indexes for better performance and scalability than traditional relational databases.

To achieve scalability and high availability, and to ensure ultimate consistency, Dynamo uses a combination of the following techniques:

Partitions and backups of data with consistent hashes.
Use Vector clocks to deal with data consistency issues.
Use Quorum and decentralized synchronization protocols to maintain Merkle Tree consistency between replicas.
Failure detection and replica maintenance based on Gossip Protocol.

Implementationally, Dynamo has the following features:

Completely decentralized, there is no central node, all nodes are peer to peer.
Adopt final consistency, resolve conflicts using version numbers, and even require the user to participate in conflict resolution.
Hash value is used to divide data, organize data distribution and balance data load.

Author: A miscellany of wood birdswww.qtmuniao.com/2020/06/13/…Please indicate the source

background

Goals and assumptions

Different design assumptions and requirements lead to completely different designs, and Dynamo’s design goals are as follows:

Query the model. Using Dynamo only uses primary keys for queries and generally does not cross data items, so there is no need for a relational model. In addition, Dynamo assumes that the data it stores is relatively small, usually less than 1M.

The ACID properties. Traditional relational databases (DBMSS) usually require ACID properties to ensure the correctness and reliability of transactions. However, ACID support can significantly degrade data performance. For high availability, Dynamo provides only weak consistency (C) and no isolation (I), and does not allow concurrent updates of a single key.

Efficiency. In order to meet the SLAs of most of Amazon’s services, which have stringent latency requirements, Dynamo needs to be configurable, allowing users to choose between performance, efficiency, availability, and persistence.

The other. Dynamo is only used within Amazon’s internal services, so security is not a concern. In addition, many services use separate Dynamo instances, so the initial goal for scalability was at the hundred-machine level.

SLA

Due to the microservices architecture, the rendering of each page of Amazon shopping website usually involves hundreds of services. To ensure user experience, the latency of each service must be strictly limited. Amazon uses slAs of three nines (99.9% of requests require less than 300ms). A key component of Dynamo’s design is to allow services to customize parameters such as persistence and consistency on demand to choose between performance, cost, and correctness.

Design considerations

For multi-replica systems, high availability and strong consistency are contradictory. While traditional commercial systems tend to sacrifice some availability for consistency, Dynamo is built for high availability and has chosen an asynchronous synchronization strategy. However, due to the frequent nature of network and server failures, the system must deal with the inconsistencies, or conflicts, caused by these failures. There are two main aspects to how these conflicts will be resolved: when they will be resolved, and who will resolve them.

When will it be settled? In order to simplify reading, traditional storage systems usually resolve conflicts on the write side, that is, reject writes when conflicts exist. However, Dynamo needs to provide “always writable” guarantees to ensure that the store is available to users at any time (such as when they can add items to their shopping cart at any time), so conflict resolution needs to be deferred until read time.

Who will solve it? Will it be solved by Dynamo or by the application side? If a Dynamo system is working on it, it will often mindlessly choose “Last write win”, which overwrites older changes with newer ones. If it is left to the application, it can do things according to the application’s requirements, such as combining multiple and multiple add cart operations back to the user. Of course, this is optional, since many applications use a common policy (” Last write Win “) that will suffice.

Other key design principles include:

Incremental expansion. Nodes can be dynamically added and deleted to minimize impact on the system and O&M.

Symmetry. Each node in the system has the same responsibility, with no special nodes to simplify construction and maintenance costs.

Decentralization. There is no central control node and point-to-point technology is used to make the system highly available and easily scalable.

Heterogeneity. The system requires nodes that can make full use of heterogeneous resources to allocate load based on node capacity.

System architecture

It focuses on distributed technologies such as partitioning algorithm, backup strategy, versioning mechanism, member organization, error handling and scalability.

The system interface

Dynamo exposes two interfaces: put() and get():

Get (key) : returns a single object corresponding to the key or a list of objects with conflicting versions.

Put (key, context, object) : Selects the copy machine to which the object is to be placed based on the key and drops the data to disk. The context contains some system meta-information that is transparent to the caller, such as the version number of an object. Context is stored with object to validate the PUT request.

Dynamo treats both the key and the value as byte arrays and performs MD5 algorithms on the key to generate a 128-bit identifier for storage node selection.

Partitioning algorithm

To support incremental capacity expansion, Dynamo uses a consistent hash algorithm for load distribution. But the base version of the consistent hash algorithm has two disadvantages:

The load cannot be evenly distributed.
Resource differences of different nodes are not taken into account.

To solve some of these problems, Dynamo uses a variant of consistent hashing: virtual nodes are introduced. The specific algorithm is as follows:

When a node is added to the system, a number of virtual nodes are generated based on its capacity. A node NUMBER is randomly assigned to each virtual node.
All virtual nodes are organized into an end-to-end ring according to their numbered size.
When a request comes in, a data number is generated using a key plus some hash algorithm in the same number space as the node.
Search clockwise around the virtual node ring according to the number, find the physical node corresponding to the first virtual node, and route the request.
When a node leaves, you only need to remove the corresponding virtual node, and the load will automatically migrate around the ring again.

The capacity difference of different nodes is taken into account by allocating the number of virtual nodes, and the traffic is evenly distributed when nodes are added and deleted by a random algorithm that generates virtual node numbers.

Dynamo uses three Partition policies to add, delete, and back up nodes:

Each node is assigned T random numerical numbers (tokens), one token for each virtual node, and the interval carved out by the tokens of the two adjacent virtual nodes in the hash ring is a partition.

This initial strategy had several drawbacks:
- Migrate scan. When a new node joins the system, some data needs to be stolen from other nodes. This requires scanning all the data items in the nodes after the new virtual node to get the data to be migrated. (Guess to serve get request, the data on the node is usually indexed by the user key, rather than the hash value of the key.) Need a full scan). This is a heavy operation, and you need to reduce the migration process’s run weight to ensure availability, but it can make the migration process take longer.
- Merkle Tree recalculates. Merkle Tree, as discussed below, can be roughly understood as a hierarchical signature of data on a partition basis. When nodes join/leave the cluster, the split/merge of the key range leads to recalculation of the corresponding Merkle Tree. This is also a computationally intensive operation, resulting in a heavy additional load that cannot be tolerated in the online system.
- Global snapshot is difficult. The distribution of data on physical nodes is fragmented according to the hash value of the key. Therefore, it is difficult to create a global snapshot in the key space, because it requires the data on all nodes to be globally merged and sorted, which is inefficient.
As you can see, the fundamental problem with this strategy is that data partitions and data placement are coupled. In this way, we cannot add and delete nodes independently without affecting the data partition. A natural improvement, therefore, is to separate data partitioning from data homing.
Each node is still randomly assigned T numbers, but the hash space is equally divided as partitions.

Under this strategy, node numbers (tokens) are only used to build hash rings of virtual nodes, and are no longer used for partitioning. We divide the hash space into Q parts, Q >> S*T, where S is the number of physical nodes. This means that each virtual node can hold many partitions. This strategy can be understood from another perspective, that is, the minimum unit of a node’s host is no longer a key, but a partition. When nodes are added or deleted, the partition is moved as a whole. This solves the migration scan and Merkle Tree recalculation problems when nodes are added or deleted.

The placement strategy for keys is as follows: Each time a key is routed, the hash value of the key is first calculated and searched in the hash ring according to the last hash value of the partition where the hash value is located (key range). The first N physical nodes encountered clockwise serve as a preference list.
Each node is randomly numbered Q/S, and the hash space is equally divided as partition.

This strategy builds on the previous one and forces each physical node to have an equal number of partitions. Since the number of Q, or even the number of partitions carried by each node (Q/S), is much larger than the number of nodes (S), it is easy to allocate the number of nodes carried by the node to other nodes when the node leaves, but this property can still be maintained. When a node is added, it is easy for each node to give it an even point.

Backup Policy (Replication)

Dynamo backs up each data item on N nodes, where N is configurable. For each key, a coordinator is responsible for its backup across multiple nodes. Specifically, the coordination node is responsible for a key range.

During backup, the coordination node selects the clockwise successor n-1 node on the consistent hash ring, along with itself, to store N copies of data entries, as shown in Figure 2. These N nodes are called a preference list. Among them:

The key-node mapping varies according to the three different partitioning policies.
Nodes can go down and restart, and the preference list can sometimes be more than N nodes.
Since virtual nodes are used, these N nodes may correspond to less than N physical machines without interference. For this reason, we need to make a jump in the selection of nodes to ensure that N nodes are on N physical machines.

Data Versioning mechanism

Dynamo provides ultimate consistency assurance, which allows multiple replicas to synchronize asynchronously, improving availability. If there are no machine and network failures, multiple replicas will be synchronized in a limited time. If a fault occurs, some replicas may never be synchronized properly.

Dynamo provides availability at any time, and if the latest data is not available, it needs to provide a new one. To provide this assurance, Dynamo treats each change as a new version of immutable data. It allows multiple versions of data to coexist, and in most cases, the new version can overwrite the old version, allowing the system to pick out the authoritative version automatically (syntactic reconciliation). However, when there is a failure or parallel operation, there may be conflicting version branches. In this case, the system cannot merge automatically, and the client side has to merge multiple versions of data.

Dynamo uses a logical clock called a vector clock to express causality between multiple versions of the same data. A vector clock consists of a sequence of < nodes, counts > corresponding to a synchronized version of the same data. We can use the vector clock of multiple data versions to determine the relationship between these data versions: Parallel branches or casual ordering:

If the count in vector clock A is less than the count of all nodes in vector clock B, then A is the precursor of B and can be discarded. A, for example, [> < node1, 1], B for [< node1, 1 >, < 2, 2 >, < node3, 1 >]
If A is not the precursor of B and B is not the precursor of A, versions A and B conflict and need to be reconciled.

In Dynamo, when a client updates a data object, it must specify the version of the data object to be updated. This is done by passing the vector clock of the same data object previously obtained from Get to the context in the update operation. Similarly, when the client reads data, if the system cannot automatically merge (synthetical and unmerged), multiple versions are returned to the client through the context. Once the client uses this information for subsequent updates, the system assumes that the client has merged the multiple versions (semantic and unmerged). Below is a detailed example.

There are a few things to note:

Each server node maintains a self-incrementing counter that updates its value before it processes change requests.
To prevent the vector clock from growing indefinitely, especially in the event of a network partition or server failure, Dynamo’s strategy is to discard the earliest clock pair in the sequence if it exceeds a certain threshold (say, 10).

Get () and put ()

This section describes the interaction when the system does not fail. It is mainly divided into two processes:

Select a coordinator in some way.
A coordinator uses the quorum mechanism to synchronize multiple copies of data.

Select the coordinator

Dynamo exposes the Dynamo service through HTTP. A Coordinator can be selected using two policies:

Use a load balancer to select a lightly loaded node.
A partition-aware client is used to route directly to the coordinator responsible for the key (that is, the first one in the preference list).

In the first way, the client does not need to save server node information. In the second way, the client does not need to forward and has lower latency.

For the first method, if it is a PUT () request and the selected node S is not among the N nodes in the preference list, S forwards the request to a machine in the preference list as a coordinator. If it is a GET () request, S can be directly used as a coordinator regardless of whether S is in the preference list.

Quorum mechanism

Quorum is an interesting read/write mode with two key configuration parameters R and W. Generally, R and W must meet the requirements of 1.r + W > N 2. W > N/2, where N is the number of cluster backups. It can be understood from two perspectives. One is analogous to read/write lock, that is, the system cannot have multiple writes and reads at the same time, but the smaller R setting can have multiple reads at the same time. The other is that more than half of the writes need to be successful to satisfy the persistent nature of the data. However, none of these requirements are mandatory in Dynamo, and users can flexibly configure them based on their requirements.

When a PUT () request arrives, the coordinator generates a new vector clock version for the new data, writes it to the local replica node, and sends the data to N preferred replica nodes. When the W-1 node replies, the request is considered successful.

When a GET () request arrives, the Coodinator sends the request to the N preferred nodes (including/excluding itself) that hold the key, and when R of them return, returns the list of multi-version results to the user. We then use the vector Clock rule to syntactically reconcile and write back the reconciled version.

Troubleshooting: Hinted Handoff

If a strict Quorum mechanism is used to handle reads and writes, even a small number of nodes down or network partitions will render the system unusable, so Dynamo uses a “sloppy Quorum” algorithm that selects the first N healthy nodes in the preferred list of consistent hash rings.

When the primary coordinator (for example, A) fails, the request will be routed to another node (D), and the first coordinator (A’s information) will be added to the meta information. A resident thread in THE background of D will move the marked data to the corresponding machine and delete the corresponding copy of the machine when it detects that A comes online again. Dynamo uses hinted Handoff to ensure that requests are completed even when a node or network fails.

Of course, for the service to be highly available, W can be set to 1 so that any node in the preference list is available and can be written successfully. But in practice, to ensure persistence, you don’t set it that low. The configuration of N, R, and W will be detailed in later sections.

In addition, to handle data center level failures, Dynamo configudes a list of preferred nodes across different centers for disaster recovery.

Permanent fault handling: Copy synchronization

Hinted Handoff can only handle occasional, temporary node outages. To deal with the consistency issues of other, more serious failures, Dynamo uses a decentralized anti-entropy algorithm to synchronize data between shard replicas.

Dynamo uses a Merkle Tree (also known as a hash Tree, but also used in blockchains) to sign all the data in a shard on a per-shard basis to quickly detect consistency between replicas and pinpoint different regions. All leaf nodes are hashes of real data, and all intermediate nodes are hashes of their children. Such trees have two benefits:

By comparing the root node, we can know whether the data of the two copies of the shard is consistent.
Each intermediate node represents the signature of all data in a range, and as long as it is equal, the corresponding data is consistent.
If there are only a few inconsistencies, you can quickly locate the inconsistent data from the root node.

Dynamo maintains a Merkle Tree for each shard (key range or shard is the smallest logical unit of storage). By using the properties of the Merkle Tree, Dynamo can quickly compare copies of two shards for consistency. If they are inconsistent, you can locate the inconsistent position to minimize data transmission.

The disadvantage of this is that if nodes join or leave the cluster, a large number of key range changes will be caused, and the Merkle Tree needs to be recalculated for the changed key range. Of course, as discussed earlier, the improved partitioning strategy improves this problem.

Membership and fault detection

Explicitly manage membership. In Amazon’s environment, nodes leaving the cluster due to failures or human error are usually rare or don’t last very long. If the location of data fragments is automatically adjusted every time a node goes offline, unnecessary data oscillations may occur. Dynamo explicitly manages members and provides interfaces for administrators to log in and out of physical nodes. That is, data fragments will not move when a node goes offline due to a fault.

The class Gossip algorithm broadcasts meta information. Member relationship changes are first perceived by nodes that process member addition and deletion requests, persisted to the local, and then broadcast by using the class Gossip algorithm. Each node is randomly selected for propagation, and finally all members reach a consensus on this. In addition, the algorithm is also used to exchange data fragmentation information and data distribution information when the nodes start up.

When each node starts up, it only knows its own node information and token information. As each node starts up gradually and exchanges information with each other through the algorithm, the entire hash ring topology (key range to virtual nodes, mapping from virtual nodes to physical nodes) is built incrementally on each node. Thus, when a request comes in, it can be forwarded directly to the corresponding processing node.

Seed nodes avoid logical partitioning. The introduction of functional seed nodes for service discovery, each node will be directly connected to the seed nodes, so that each node to join quickly known to other nodes, to avoid the emergence of logical partition due to joining the cluster at the same time without knowing each other.

Fault detection. To avoid continuous forwarding of PUT/GET requests and synchronization meta-information requests to unreachable nodes, only partial fault detection is sufficient. That is, if the request sent by A to B does not receive A response, A marks B as faulty and starts the heartbeat to sense its recovery. If A receives A request that it should redirect to B and finds that B is faulty, an alternative node is selected from the list of preferred nodes corresponding to the key.

As you can see, Dynamo treats permanent and temporary departures of nodes separately. The display interface is used to add and delete permanent members, and the member topology is broadcast through the Gossip algorithm. Use simple flags and heartbeats to handle occasional failures and forward traffic appropriately. In an environment with fewer failures, such divide and conquer can greatly improve the efficiency of reaching consensus and minimize the unnecessary data relocation caused by occasional node failures and network jitter.

Add or delete a node

In the following figure, considering three copies (N=3) and using the simplest partitioning strategy, when A node X is added between nodes A and B, X will be responsible for the Key Range: (F,G], (G, A], (A, X], meanwhile, B will no longer be responsible for (F,G], C will no longer be responsible for (G, A], D will no longer be responsible for (A, X], Dynamo will actively push relevant Key Range to X through B, C, D to adapt to the addition of X. There is a wait X confirmation phase before push to avoid duplicate push.

implementation

Each Dynamo node consists of four components: Request coordination, membership management, failure detection, and a local persistence engine. All components are implemented in Java.

Dynamo’s local persistence component allows you to choose from a variety of engines, including Berkeley Database (BDB), MySQL, and a memory + persistence based storage. Users can choose based on service scenarios. Most production environments use BDB.

The request coordination component, implemented using Java NIO channels, uses an event-driven model that divides the processing of a message into phases. A state machine is initialized for each incoming read/write request. For example, for read requests, the following state machines are implemented:

Send requests to all nodes that contain copies of the shard in which the key is located.
The minimum number of nodes waiting for read requests (R) is returned.
Failed to collect R requests within the specified time limit and return a client failure message.
Otherwise collect all version data and determine which version data needs to be returned.
If version control is enabled, syntactic reconciliation is performed and the reconciled version is written to the context.

In the process of reading, if some duplicate data is found to be expired, it will be updated, which is called Read repair.

For write requests, one of the preferred N nodes will act as the coordinator, usually the first. But in order to improve throughput and balance the load, usually these N nodes can act as coordinators. In particular, most data is usually written immediately after being read (read the retrieved version, and then written using the corresponding version), so writes are often scheduled to the node with the fastest response from the last read, which holds context information at the time of reading, allowing for faster response and improved throughput.

reference

Serverless. Pub /s3-or-dynam…

Optimistic replication: en.wikipedia.org/wiki/Optimi…

Welcome to pay attention to the public log bird miscellany, get more distributed system articles.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Amazon’s distributed key-value store for small objects — Dynamo

An overview of