This article is a reading note for a paper on Dynamo, Amazon’s high availability key database. Dynamo was designed to reduce the need for consistency and high performance and availability in large clusters where failures are common.

background

The data storage technology proposed in this paper must meet the following requirements:

  • Query model: Data objects are specified by keys;
  • Dynamo provides only weak consistency and only operations on a single key, not transactions.
  • High efficiency: system delay needs to meet requirements;
  • Others: Dynamo needs to be used only within Amazon, so there are no security concerns.

In order for the application to function within a limited time, the client and the service need to reach a service standards agreement (SLA), and the client and server need to agree on some system characteristics. In this paper, 99% of the delay distribution is used as SLA to make performance requirements for the service.

When server failures and network failures are common, and concurrent writes exist, how to detect and resolve conflicts is a huge challenge. Dynamo is designed to be ultimately consistent, with all updates eventually reaching all replicas. To ensure the always-writable feature, the system leaves conflict resolution to the read operation.

Dynamo provides flexible conflict resolution policies, including simple built-in policies and custom policies. The system design process also needs to consider the following quasi-test:

  • Incremental scalability: Expand one node at a time to minimize o&M work and system operating burden;
  • Symmetry: All nodes perform the same function;
  • Decentralization: there is no centralized node;
  • Heterogeneity: The ability to take full advantage of machines with different performance.

System architecture

The system interface

Dynamo provides only two operations get() and put() :

  • get(key)Get the data object corresponding to the key, or a list of data objects with conflicting versions;
  • put(key, context, object)The data and context are stored in the key, and the context contains the metadata of the data object such as the version number.

Partitioning algorithm

Dynamo uses a hash to determine where the key is stored, which maps the key and node to a ring using a hash function, and then stores the key on the first node encountered clockwise on the ring.

But there is a problem with the most basic consistent hash:

  1. The distribution of each node in the ring is not uniform;
  2. Differences in node performance are not considered;

The solution is to have each node map to multiple virtual nodes on the ring. Using virtual nodes has the following benefits:

  • If a node is unavailable, its load can be evenly redistributed to other nodes;
  • If the node becomes available, it can take a similar amount of load from multiple nodes;
  • You can set the number of virtual nodes based on the node performance.

A copy of the

In addition to saving the data object on the node corresponding to the key, you also need to save the copy clockwiseOn a node. The list of nodes that hold keys is calledPreference listIn addition, skip the virtual node corresponding to the physical node whose data has been saved.

Data version

Dynamo treats each change as a new, unchangeable new release, and the client needs to provide a conflict resolution mechanism to merge multiple versions of data. Dynamo tracks different versions of data through a vector clock.

runget()andput()

The client needs to select a node for each read/write request. You can:

  1. Select a node according to the load condition through the load balancer;
  2. Select a node based on zone information.

The node that responds to the read/write operation is called the coordinator. Dynamo uses the consensus protocol of majority voting and provides two parameters:

  • The minimum number of nodes that need to participate in the read
  • The minimum number of nodes that need to participate in a write

The user can adjustandTo strike a balance between delay and consistency.

Prompt switch

Some nodes cannot be accessed due to a failure, when the preference list is selectedThe healthy nodes read and write, and the copies are returned to the recovered nodes after the failed nodes recover.

A copy of the synchronization

Dynamo uses the anti-entropy protocol for copy synchronization and uses Merkle trees to organize hashes of data objects to speed up data comparison.

Membership and fault detection

  • Ring membership: Dynamo relies primarily on gossip based protocols to propagate membership changes and maintain an ultimately consistent membership;
  • External discovery: If two new nodes are added at the same time, there will be a period of time when they do not know each other’s existence, so it is necessary to set up some seed nodes, so that all nodes communicate with it to speed up the discovery of new nodes;
  • Fault detection: If node B does not respond to node A, node B is considered faulty. Dynamo uses the myth-based protocol to sense when other nodes go offline.

Add or delete a storage node

When a new node is added, some key range is assigned to the new node, and the data corresponding to these keys needs to be passed to the new node. Similarly, when a node is deleted, the storage of data also needs to be adjusted.

implementation

Each storage node consists of three software components: request coordination, membership and fault detection, and a local persistence engine. Dynamo allows you to use different storage engines, such as Berkeley DB, MySQL, or memory storage with persistence. The request coordination component is built on an event-driven model and processes messages in a pipelined manner.

The request coordination component runs as a state machine, and a read operation runs as follows:

  1. Sends the read request to the node;
  2. Waiting for the minimum number of replies;
  3. If a sufficient number of responses are not received within a given time, the request fails;
  4. Then aggregate all data versions to determine the data returned;
  5. If versioning is enabled, semantic conflict resolution is performed, producing a context containing the vector clock.

If an outdated version is found during the communication between nodes, update it in time.

reference

  1. DeCandia, Giuseppe, et al. “Dynamo: amazon’s highly available key-value store.” ACM SIGOPS operating systems review. Vol. 41. No. 6. ACM, 2007.