1. Errors in the distributed system

Many problems in a distributed system can cause failures.

  • The easy way is to disable the entire service and display an error message to the user.
  • Complex approach – The service itself is fault-tolerant, allowing it to function even if some internal component fails.

The best way to build fault-tolerant services is to implement some generic abstraction with guarantees, and then have applications rely on those guarantees. For example, with transactions, an application can:

  • Data consistency before and after crashes is not a concern
  • There is no need to worry about concurrent access to the database
  • Storing successful data is reliable

Note: Transaction abstraction ensures that data remains valid even in the event of crashes, race conditions, and disk failures.

One of the most important abstractions of distributed systems is consensus: getting all the nodes to agree on something. Once agreed, it can be used for a variety of purposes:

  • Database nodes can use consensus to elect a new leader.

2. Consistency assurance

In the “Replication delay problem” section, data inconsistencies are mentioned due to timing issues that occur during database replication.

  • For example, if you look at two database nodes at the same time, you might see different data on both nodes because write requests might arrive at different nodes at different times.

However, most databases provide the capability for final consistency, which is to stop writing to the database and wait for an indeterminate amount of time before all read requests eventually return the same value.

Note: The inconsistency is temporary and we expect that all copies will eventually converge to the same value.

The distributed consistency model bears some resemblance to the transaction isolation level hierarchy discussed in previous chapters. But:

  • Transaction isolation is primarily intended to avoid competing states due to simultaneous transaction execution
  • Distributed consistency is mainly about how to coordinate state between replicas in the face of delays and failures

3. Linear consistency

In a linear and consistent system, as long as one client successfully completes a write, all clients reading from the database must be able to see the value that was just written. To maintain the illusion of a single copy of data, the system should ensure that values are read as recent and up-to-date, not from stale caches or copies.

Linear consistency is also called atomic consistency, strong consistency, immediate consistency, or external consistency

Examples of nonlinear consistency

  • Alice and Bob are sitting in the same room, staring at their phones, watching the World Cup final results. After the final score is announced
  • Alice refreshes the phone page, sees the results, and informs Bob
  • Bob refreshes his phone, but his request is routed to a backward copy of the database, and the phone shows that the game is in progress.

2.1 What makes the system linearly consistent

The basic idea behind linear consistency is simple: make the system look like there is only one copy of the data. The following example shows the complexity of linear consistency:

Unrestricted concurrent read and write requests may cause inconsistent return values

  • Client A’s first read operation completes before the write operation begins, so it must return the old value 0

  • The last read operation of client A starts after the write operation is complete.

    • If the database is linearly consistent, it must return a new value of 1

    Note: Because read and write operations must be processed at some point in their respective start and end intervals. If it starts reading after the write, it must see the new value being written.

  • Any read operation that overlaps with the write operation may return 0 or 1 because the client does not know whether the write operation is in effect at the time of reading. These operations are concurrent.

The linear consistency of the system after adding constraints

In a linearly consistent system, there must be a point in time when the value of x automatically flips from 0 to 1. Therefore, if a client read returns a new value of 1, all subsequent reads must return the new value even if the write operation has not yet completed.

Note: The black arrows in the figure above illustrate temporal dependencies. Is where client A first reads the new value 1. After A’s read returns, B begins A new read. Since the read from B occurs strictly after the read from A, it must return 1 even if the write from C is still in progress.

Add comparisons and set constraints

Each operation in the figure below is marked with a vertical line when we think the operation has been performed. These markers are sequentially linked together, and the result must be a valid read sequence.

Linear consistency requires that the lines operating on the markers always move from left to right in time. This requirement reverses our previous guarantee that once a new value is written or read, all subsequent reads will see the written value until it is overwritten again.

2.2 Linear consistency and serializability

Serializable: Is the isolation property of a transaction that can read and write multiple objects (rows, documents, records). It ensures that transactions behave as if they were executed in a certain order.

Linear consistency: Guarantees freshness for reading and writing a single object. It does not combine operations into transactions, so it does not prevent problems such as write bias.

A database can provide serializable and linear consistency, and this combination is called strictly serializable or strong single-copy serializable. Serializable implementations based on two-phase locking or true serial executions are usually linearly consistent. Serializable snapshot isolation, however, is not linearly consistent: it is designed to read from consistent snapshots to avoid lock contention between readers and writers. The point of a consistent snapshot is that it does not include writes after the snapshot, so snapshot reads are not linearly consistent.

2.3 Dependence on linear consistency

For a few domains, linear consistency is a prerequisite for the system to work correctly.

2.3.1 Lock-in and leadership election

In a single-master replication system, you need to ensure that there is only one leader, rather than multiple leaders with a split brain.

  • One way to select a leader is to use a lock: each node attempts to acquire the lock at startup, and the one that succeeds in obtaining the lock is the leader. The lock is linearly consistent, and all nodes must agree on which node owns the lock.

Coordination services such as Apache ZooKeeper and ETCD are often used to implement distributed locking and leader elections. They use consistency algorithms to achieve linearly consistent operations in a fault-tolerant manner.

Note: Strictly locking, ZooKeeper and ETCD provide linearly consistent write operations, but reads can be archaic as they can be serviced by any copy by default. But you can still choose to request a linear consistency read: ETCD calls it a quorum read.

2.3.2 Constraint and uniqueness guarantee

Uniqueness constraints are common in databases:

  • A user name or E-mail address must uniquely identify a user
  • Make sure your bank account balance never goes negative
  • Does not sell more items than are in stock in the warehouse
  • No two people are booked on a flight or in the same seat in the theater at the same time.

Note: All of the above requires linear consistency.

2.3.3 Timing dependence across channels

Suppose you have a website where users can upload photos, and a background process will resize the photos and reduce the resolution to speed up the download. The architecture and data flow of the system are shown below:

There is one possible competitive condition in the figure above:

  • Assume that message queues replicate faster than storage services
  • When this happens, when the zoomer reads the image (Step 5), it may see an older version of the image, or nothing at all.
  • If an older version of the data is retrieved from the file store, the full-size image and thumbnail image create a permanent inconsistency.

Note: This problem occurs because there are two different channels between the Web server and the zoomer.

2.4 System to achieve linear consistency

Linear consistency essentially means behaving as if there is only one copy of the data and all operations are atomic. So the simplest answer is to really only use one copy of the data, but this approach is fault-tolerant: if the node of that copy fails, the data will be lost.

The most common way to make a system fault tolerant is replication.

2.4.1 Single-Master Replication (Possibly Linear Consistency)

In a system with single-master replication, the master repository has a master copy for writing data, while followers keep backup copies of data on other nodes. If data is read from the master or synchronously updated slave, they may be linearly consistent.

Note: Partitioning (sharding) a single master database so that each partition has a separate leader does not affect linear consistency, as linear consistency is only a guarantee for a single object.

2.4.2 Consensus Algorithm (Linear Consistency)

Consensus algorithms are similar to single leader replication. However, the consensus agreement contains measures to prevent split brains and stale copies. So, consensus algorithms are used by Zookeeper and ETCD.

2.4.3 Multi-Master Replication (Consistent Nonlinear)

Systems with multiple master program replication are generally not linearly consistent because they process writes on multiple nodes simultaneously and copy them asynchronously to other nodes. As a result, they can create write conflicts that need to be resolved.

2.4.4 Owner-less Replication (may not be linearly consistent)

For a leaderless replication system (Dynamo), “strong consistency” can be achieved in the case of quorum reads and writes (w+r > n). But it depends on how the quorum is configured and how strong consistency is defined.

Clock-based (Cassandra) last-write win conflict resolution is almost certainly non-linear and consistent, with no guarantee that the clock timestamp matches the actual event order due to clock skew.

Note: A loose quorum also destroys the possibility of linear consistency.

2.4.5 Linear consistency and quorum

Intuitively, in a Dynamo style model, strict quorum reads and writes should be linearly consistent. But when network latency is, there may be competition conditions:

  • The Writer client sends requests to write x=1 to three copies
  • Reader A reads concurrently from both nodes (meeting quorum limits) and sees A new value of 1 in one of them
  • Reader B also reads concurrently from both nodes and returns the old value 0 from both nodes

The quorum condition is satisfied, but the execution is non-linear: B’s request begins after A’s request completes, but B returns the old value and A returns the new value.

2.5 Cost of linear consistency

Multi-master replication is often the ideal choice for replication in multiple data centers.

Consider this: What if there is a network outage between two data centers?

  • With a multi-master database, each data center can continue to function as normal: since data written in one data center is asynchronously copied to another, writes are simply queued and exchanged when the network connection is restored.
  • In single-active configuration, if the network between data centers is interrupted, data cannot be written to the database. As a result, applications are unavailable.

Note: Assume that the network in each data center works properly and clients can access the data center, but data centers cannot connect to each other.

2.5.1 CAP theorem

CAP stands for consistency, availability, and partition fault tolerance. You can only choose two of them. In the event of a network failure, a choice must be made between linear consistency and overall availability.

Note: Unfortunately, this statement is misleading, as network partitioning is a type of failure

2.5.2 Linear Consistency and network latency

While linear consistency is a useful guarantee, in practice, linear consistency systems are surprisingly rare. For example, memory on modern multicore cpus is not even linearly consistent.

If a thread running on one CPU core writes to a memory address and another thread running on another CPU core reads the same address a short time later, there is no guarantee that the value written by the first thread will be read. (Unless memory barriers are used).

The reason for this behavior is that each CPU core requires its own memory cache and storage buffer. By default, memory access is removed from the cache first, and any changes are written to main storage asynchronously.

The same is true for many distributed databases: linear consistency is sacrificed for performance rather than fault tolerance. arunachal

3. Order guarantee

Order is an important concept, so it comes up repeatedly throughout this book:

  • In Chapter 5, the leader’s primary goal in single-master replication is to determine the order of writes in the replication log — that is, the order in which these writes are applied from the library. If there is more than one leader, concurrent operations can cause write conflicts.
  • Serializability, discussed in Chapter 7, is about ensuring that transactions behave as if they were executed in some order. It can be implemented literally by executing transactions in serial order, or by running parallel executions while preventing serialization conflicts.
  • The use of timestamps and clocks in distributed systems, discussed in Chapter 8, is another attempt to introduce order into an unordered world, for example, by determining which of two write operations occurs later.

3.1 Order and causality

The sequence repeats for the following reasons:

  • Order helps maintain causality, which imposes an order on events: cause before effect, message sent before message received.

    Note: A system is causally consistent if it follows the order prescribed by causality. For example, snapshot isolation provides causal consistency: when you read some data from a database, you must be able to reach its causal precursor.

3.1.1 The causal order is not completely sequential

Full order: allows any two elements to be compared. If there are two elements, one can always say which is larger and which is smaller. For example, the set of natural numbers is completely ordered.

Partial order: Mathematical sets are partial order, such as {a, b} and {b, c}, and there is no way to compare sizes. In some cases, it can be said that one set is greater than the other (in the case that one contains all the elements of the other), and in other cases it is impossible to compare.

The difference between full and partial ordering is reflected in different database consistency models:

Linear consistency

In a linear uniform system, the operations are all sequential. That is, the system behaves as if there is only one copy of the data, and all operations are atomic, which means that for any two operations, you can always tell which one happened first.

causality

If two times are causally related (one happens before the other), they are in order, but if they are concurrent, the order between them is not comparable. This means that causality defines a partial order rather than a full order: some operations are ordered from one another, but some are not comparable.

Note: This sentence is too difficult to understand. There is no concurrent operation in a linearly consistent data store: there must be one and only one time line, and all operations are on this time line, forming a full order relationship.

3.1.2 Linear consistency is stronger than causal consistency

So what is the relationship between causal order and linear consistency?

Answer: Linear consistency implies causality: any linearly consistent system correctly maintains causality. In particular, if there are multiple communication channels in the system, linear consistency automatically guarantees causality without requiring any special operations on the system.

The fact that linear consistency ensures causality makes linear consistency systems easier to understand and more attractive. However, as discussed in the Cost of linear consistency, making a system linearly consistent can compromise its performance and availability, especially if the system has significant network latency. But linear consistency isn’t the only way to maintain causality — there are other ways. A system can be causally consistent without incurs the performance penalty of linear consistency. In fact, causal consistency is the most feasible consistency model among all the consistency models that are not slowed down by network delay. And it remains available in the event of a network failure.

In many cases, systems that appear to require linear consistency actually require only causal consistency, which can be achieved more efficiently. Based on these observations, researchers are exploring new types of databases that guarantee causal consistency but also resemble systems with consistent performance and availability.

3.1.3 Capture causality

The techniques used to determine which operations occurred before other operations are similar to those we discussed in detecting concurrent writes. Causality in leaderless data stores: To prevent lost updates, concurrent writes to the same key need to be detected. Causal consistency goes one step further: it needs to track causal dependencies across the entire database, not just one key.

In order to determine the causal order, the database needs to know which version of the data the application reads. For example, the version number of the previous operation is passed back to the database at write time.

3.2 Sequence Of Serial Numbers

Although causality is an important theoretical concept, it is practically impractical to track all causality. A better approach is to use serial numbers or timestamps to sort events. The timestamp does not have to come from a calendar clock; it can be a logical clock, an algorithm used to generate sequences of numbers identifying operations, typically using a counter that increments with each operation.

In a single-master replication database, replication logs define cause-and-effect write operations. The master library can simply increment a counter for each operation, thereby assigning a monotonically increasing sequence number to each operation in the replication log. If the slave library applies writes in the order in which they appear in the replication log, the state of the slave library is always causal.

3.2.1 Non-causal serial number generator

Several ways to generate serial numbers:

  • Each node can generate its own independent set of serial numbers. For example, if you have two nodes, one node only generates odd numbers and the other only generates even numbers.
  • You can attach a timestamp of the calendar clock to each operation. This timestamp is not continuous, but if it has a high enough resolution, it may be enough to provide a full-order relationship for an operation. This fact applies to the “write last win” approach to conflict resolution.
  • Sequence number blocks can be pre-allocated. For example, node A might require ownership of blocks 1 through 1000, while node B might require ownership of blocks 1001 through 2000. Each node can then independently assign the serial number in the block it belongs to, and assign a new block if the serial number is insufficient.

Note: These serial number generators do not capture the correct order of operations of nodes, so cause and effect problems arise.

3.2.2 Lambert timestamp

The Lambert timestamp, proposed by Lester Lambert in 1978, is now one of the most cited papers in the field of distributed systems.

The Lambert timestamp has nothing to do with the physical calendar clock, but it provides a full order: if there are two timestamps, the one with the larger counter value has the larger one. If the counter value is the same, the larger the node ID, the larger the timestamp.

Note: The key idea to make the Lambert timestamp causal is as follows: Each node and client keeps track of the maximum counter value seen so far and includes the value of this maximum counter in each request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately sets its own counter to that maximum value.

The Lambert timestamp is sometimes confused with the version vector seen in detecting concurrent writes. But they serve different purposes: version vectors can distinguish between two operations that are concurrent or one that is causally dependent on the other, whereas Lambert timestamps are always in full order.

3.2.3 Time stamps alone are not enough

Although the Lambert timestamp defines a complete order consistent with causality, it is not sufficient to solve common problems in distributed systems.

Consider a system that needs to ensure that user names uniquely identify user accounts. If two users simultaneously try to create a user with the same username, one should succeed and the other should fail.

In the case of multiple nodes, each node must be checked to ensure that no other nodes are concurrently creating accounts of the same name with the same username and smaller timestamps. If one of the nodes fails or becomes unreachable due to a network problem, the entire system may be dragged down.

Note: In summary, in order to implement something like unique constraints on user names, it is not enough to have the full order of an operation; you also need to know when the full order will settle down.

3.3 Full sequence broadcast

If your program only runs on a single CPU core, it is easy to define a complete sequence of operations: you can simply perform the order in which these operations are performed on the CPU. But in distributed systems, getting all nodes to agree on the same global order of operations can be tricky.

Range of order guarantees

Partitioned databases with one primary library per partition typically maintain order only within each partition, which means they do not provide cross-partition consistency guarantees. Full ordering across all partitions is possible, but requires additional coordination.

Full sequence broadcasting is often described as a protocol for exchanging messages between nodes. It needs to satisfy two security attributes:

  • Reliable delivery: No messages are lost, and if a message is delivered to one node, it will be delivered to all nodes
  • Full order delivery: Messages are delivered to each node in the same order

3.3.1 Broadcast in full order

Full-order broadcasting is exactly what database replication requires: if each message represents a write to the database, and each copy processes the same writes in the same order, the copies will remain consistent. This principle is called state machine replication.

An important manifestation of full-order broadcasting is that the order is solidified by message delivery: if subsequent messages have already been delivered, nodes are not allowed to retroactively insert previous messages earlier in the order. This tries to make full-order broadcasting stronger than timestamp sorting.

3.3.2 Use full order broadcast to achieve linear and consistent storage

Full-order broadcasting is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee when they will be delivered. In contrast, linear consistency is a guarantee of freshness: a read must see the most recent written value.

If you have full order broadcasting, you can build linearly consistent storage on top of it. For example, you can ensure that a user name uniquely identifies a user account.

This linearly consistent CAS operation can be implemented by treating full-order broadcasting as if it were appending logs only:

  • Append only one message to the log, tentatively specifying the user name you want to declare
  • Read the log and wait for the message you just appended to be read back
  • Check if there are any messages claiming ownership of the target username

3.3.3 Using Linear Consistent Storage to Implement full-order broadcast

The previous section described how to build a linearly consistent CAS operation from full-order broadcasting. We can reverse this and build full-order broadcasting based on linearly consistent storage.

The simplest way to do this is to assume that there is a linearly consistent register to store an integer, and that there is an atomic increment and return operation. The algorithm is simple: each message to be sent over full-order broadcast first performs and increments a return operation on the linear consistent register. The value obtained from this register is then appended to the message as a sequence number. You can then send messages to all nodes, and recipients will deliver the messages as needed.

It can be proved that linearly consistent CAS register and full order broadcast have equivalent consensus problem. That is, if you can solve one of these problems, you can turn it into a solution for the other problems.

4. Distributed transactions and consensus

Consensus is one of the most important and fundamental problems in distributed computing. The goal is to get several nodes to agree.

Node agreement is important in many scenarios, such as:

Leadership election: In a single-master replication database, all nodes need to agree on which node is the leader. If some nodes are unable to communicate with other nodes due to a network failure, there may be disputes over the ownership of the leadership. In this case, consensus is important to avoid faulty failover.

Atomic commit: In a database that supports transactions across multiple nodes or partitions, a transaction may fail on some nodes but succeed on others. To maintain the atomicity of a transaction, all nodes must agree on the outcome of the transaction: either all abort/rollback, or all commit.

Note: The formalization of atomic commits is slightly different from consensus: atomic transactions can only be committed if all participants commit, and must be aborted if any of the participants need to abort. Consensus allows agreement on any of the candidate values proposed by the participants. However, atomic commit and consensus can simplify each other.

4.1 Atomic commit vs. two-phase Commit

4.1.1 From single node to distributed atomic submission

For transactions performed at a single database node, atomicity is typically implemented by the storage engine. When a client requests a database node to commit a transaction, the database persists the write of the transaction and then appends the commit record to the log on disk.

But what if a transaction involves multiple nodes?

In these cases, it is not sufficient to simply send commit requests to all nodes and commit transactions for each node independently. This makes it easy to violate atomicity: commits succeed on some nodes and fail on others.

4.1.2 Two-Phase Submission Overview

Two-phase commit is an algorithm used to implement atomic transaction commit across multiple nodes, that is, ensure that all nodes commit or all nodes abort. It is a classic algorithm in distributed database. 2PC is used internally in some databases and is also available to applications in the form of XA transactions. (e.g. Java Transaction API support)

Don’t confuse 2PC with 2PL

Two-phase commit and two-phase lock are two very different things. Two-phase commit provides atomic commit in a distributed database, while two-phase locking provides serializable isolation levels.

When the application is ready to commit, the coordinator starts phase 1: It sends a prepare request to each node, asking if they can commit. The coordinator then tracks the responses of the participants.

  • If all participants reply: yes, they are ready to commit, the coordinator issues a commit request in phase 2, and then commits
  • If any participant replies: No, the coordinator sends an abort to all nodes in phase 2

4.1.3 System Commitment

The above description does not clarify why a two-phase commit guarantees atomicity, but a one-phase commit across multiple nodes does not:

To understand how it works, analyze the process in detail:

  • When an application wants to start a distributed transaction, it asks the coordinator for a transaction ID. The transaction ID is globally unique
  • The application starts a single-node transaction on each participant, with the global transaction ID attached to the single-node transaction. All reads and writes are done variously in these single nodes. If any problems occur at this stage, the coordinator or any participant can abort
  • When the application is ready to commit, the coordinator sends a ready request to all participants, marked with the global transaction ID. If either request fails or times out, the coordinator sends an abort request for that transaction ID to all participants.
  • When an actor receives a prepare request, he needs to ensure that the transaction can actually commit in any case. This includes writing all transaction data to disk and checking for any conflicts or violations of constraints. By answering “yes” to the coordinator, the node promises that the transaction can be committed without error whenever requested.
  • When the coordinator receives replies to all readiness requests, a clear decision is made to commit or abort the transaction. The coordinator must write this decision to the transaction log on disk, and if it crashes later, it can recover knowing all of its decisions. This is called a submission point.
  • Once the coordinator decides to drop, submit or drop requests are sent to all participants. If the request fails or times out, the coordinator must keep retrying forever until it succeeds. If a participant crashes during this execution, the transaction will be committed after it recovers – since the participant voted in favor, it cannot reject the commit after recovery.

4.1.4 Coordinator failure

If the coordinator fails before sending the prepare request, the participant can safely abort the transaction. However, once the participant receives the prepare request and replies: Yes, he can no longer unilaterally waive-he must wait for the coordinator to answer whether the transaction has been committed or aborted.

Note: If the participant crashes or the network fails, there is nothing the participant can do but wait. This transaction state of the participant is said to be doubtful or indeterminate.

The only way to complete 2PC is to wait for the coordinator to resume. This is why the coordinator must write the commit or abort decision to the transaction log on disk before sending the commit or abort request to the participant: after recovery, the coordinator reads the transaction log to determine the status of all transactions in question.

4.1.5 Three-phase submission

Two-phase commit is known as the blocking atomic commit protocol because there are situations where 2PC can get stuck and wait for the coordinator to recover. As an alternative to 2PC, a so-called three-phase commit algorithm (3PC) has been proposed. However, 3PC assumes bounded network delay and finite node response time. In most practical systems with infinite network latency and process pauses, it does not guarantee atomicity.

4.2 Distributed transactions in practice

Distributed transactions have a mixed reputation, especially those implemented through two-phase commit. On the one hand, it is seen as providing a significant security guarantee that is difficult to achieve; On the other hand, they tend to cause operational problems, degrade performance, and are criticized for making promises beyond their capabilities.

There are two types of distributed transactions that are easily confused:

Distributed transactions within a database

Some distributed databases support internal transactions between database nodes. VoltDB and MySQL Cluster’s NDB storage engines, for example, have such internal transaction support. In this case, all participating nodes run the same database software.

Heterogeneous distributed transaction

In heterogeneous transactions, the participants are composed of two or more different technologies: for example, two databases from different vendors, or even non-database systems (such as message brokers). Distributed transactions across systems must ensure atomic commits, although the systems may be completely different.

Message processing exactly once

Heterogeneous distributed transaction processing can integrate different systems in a powerful way. For example, a message in a message queue can be recognized as processed if and only if the database transaction used to process the message was successfully committed. This is done by atomic commit message acknowledgement and database write in the same transaction. Supported by distributed transactions, this operation is possible even if the message broker and the database are two unrelated technologies running on different machines.

XA transaction

Is a standard for two-phase commit across heterogeneous technologies. It was introduced in 1991 and is widely implemented: many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ) support XA

XA is not a network protocol – it is just a C API for connecting with transaction coordinators. The transaction coordinator needs to implement the XA API. The standard does not specify how this should be done, but in practice the coordinator is usually just a library that is loaded into the same process as the application that initiated the transaction. It tracks all participants in a transaction, collects participants’ responses after they are asked to prepare, and records the decision (commit/abort) of each transaction using a log on the local disk.

Hold the lock when in doubt

The problem is the lock. As discussed in “Read Committed,” database transactions typically acquire row-level exclusive locks on rows to be modified to prevent dirty writes. In addition, if serializable isolation levels are to be used, databases that use two-phase locking must also place shared locks on rows read by the database.

The database cannot release these locks until the transaction is committed or aborted. Therefore, when using two-phase commit, the transaction must hold these locks for the entire time in question. If the coordinator has crashed and takes 20 minutes to restart, these locks will be held for 20 minutes. If the coordinator’s logs are completely lost for some reason, these locks are held permanently — or at least until the administrator manually fixes the situation.

Recovery from coordinator failure

In theory, if the coordinator crashes and restarts, it should cleanly recover its state from the log and resolve questionable transactions. In practice, however, isolated questionable transactions do occur, in which the coordinator is unable to determine the outcome of the transaction for any reason. The only way out is for the administrator to manually decide whether to commit or roll back the transaction. The administrator must examine the participants of each questionable transaction to determine whether any of them have committed or aborted, and then apply the same results to the other participants. Resolving such problems can be labor-intensive and can result in significant disruptions during production.

Limitations of distributed transactions

XA transactions address the practical and important problem of keeping multiple participants consistent with each other, but as we have seen, they also introduce serious operational problems. In particular, the core realization here is that the transaction coordinator is itself a kind of database (storing the results of transactions) and therefore needs to be handled with the same care as any other important database.

4.3 Fault-tolerant consensus

The consensus problem is usually formalized: one or more nodes can propose certain values, and the consensus algorithm decides to adopt some of these values. For example, unified seating in a theater, when several customers are trying to book the last seat at the same time, each node processing the customer’s request can propose the ID of the customer served as the key to the decision

Consensus algorithm must satisfy the following properties:

  • Consensus: No two nodes have different decisions
  • Integrity: No node is determined twice
  • Validity: If a node determines the value v, then V is suggested by a node
  • Terminate: The final value is determined by all uncrashed nodes

Each property is explained as follows:

The consensus and integrity attributes define the core idea of consensus: everyone has decided on the same outcome, and once you’ve decided, you can’t change your mind.

The validity attribute is primarily intended to rule out mundane solutions. For example, you could have an algorithm that always determines the value to be null no matter what value is proposed.

The termination attribute formalizes the idea of fault tolerance. What it essentially says is that a consensus algorithm cannot simply idle away forever — in other words, it must make progress.

4.3.1 Consensus algorithm and full order broadcast

The best known fault-tolerant consensus algorithms are View Poke Replication (VSR), Paxos, Raft, and Zab. There are quite a few similarities between these algorithms, but they are not different. This section won’t go into all the details, but it’s usually enough to know some of the higher-level ideas they share.

Most of these algorithms do not actually use formal models of these descriptions directly. Instead, they determine the order of values, which makes them full-order broadcast algorithms. The full-order broadcast algorithm requires messages to be delivered exactly once in the same order to all nodes accurately. If you think about it, this amounts to several rounds of consensus: in each round, the node proposes the next message to be sent, and then decides on the next message to be sent in the full order.

Therefore, full sequence broadcasting is equivalent to repeating multiple rounds of consensus:

  • Because of the consensus attribute, all nodes decide to deliver the same messages in fairy order
  • Because of the integrity attribute, the message is not repeated
  • Because of the validity property, messages cannot be corrupted and cannot be fabricated
  • Messages are not lost due to the termination property

4.3.2 Single leader replication and consensus

In Chapter 5, we discussed single-leader replication, which keeps copies up to date by handing over all writes to the master library and applying them to the slave library in the same order. This is essentially a full sequence broadcast.

How to choose a leader?

If the master library is manually selected and configured by o&M, there is in effect a dictatorial type of consensus algorithm: only one node is allowed to accept writes, and if that node fails, the system cannot write until O&M manually configures other nodes as the master library.

Note: The system described above can perform well in practice, but it does not satisfy the termination property of consensus, and it requires human intervention to make progress.

Some databases automate leader elections and failover, promoting a slave to the new master if the old master fails. This brings us one step closer to fault-tolerant full sequence broadcasting and thus consensus.

4.3.3 Era Number and quorum

All the consensus agreements discussed so far use a leader internally in some form, but they do not guarantee that the leader is unique. Instead, they make weaker guarantees: the protocol defines an era number (called a vote number in Paxos, a view number in view stamp replication, and a tenure number in Raft) and ensures that the leader is unique in each era.

Each time the current leader is deemed to have died, a vote is held between the nodes to choose a new leader. This election is given an increasing era number, so the era number is in full order monotonically increasing. If there is a conflict between the leaders of two different eras, then the leader with the higher era number counts. Before any leader is allowed to decide anything, the existence of other leaders with higher era numbers must first be checked.

Note: The leader must obtain a “quorum” of nodes for the vote. For each decision the leader makes, the body migration must be sent to all the other nodes and wait for a quorum of nodes to respond and approve the proposal. A quorum usually (but not always) consists of a majority of nodes, and a node will only vote for a proposal if it is not aware of any leader with a higher era number.

The voting process looks a lot like a two-phase submission on the surface. The biggest difference is that the coordinator in 2PC is not elected, and 2PC requires all participants to vote for it, while the fault-tolerant consensus algorithm only requires the votes of a majority of nodes.

4.3.4 Limitations of consensus

  • Consensus systems always require strict majorities to work. If a network failure cuts some nodes off from others, only the network on which most of the nodes are located can continue to work. The rest will be blocked.
  • Most consensus algorithms assume that voting nodes are fixed sets, which means you can’t simply add or remove nodes from a cluster.
  • Consensus systems typically rely on timeouts to detect failed nodes. In environments where network latency is highly variable, especially in geographically distributed systems, it often occurs that a node mistakenly believes that the leader has failed due to temporary network problems.

4.4 Membership and Coordination services

Zookeeper and ETCD are designed to hold small amounts of data that can easily fit in memory, so you don’t put all your application data there. This small amount of data is copied to all nodes through a fault-tolerant full-order broadcast algorithm. As discussed earlier on locks, what database replication requires is full-order broadcasting: applying the same writes in the same order makes the copies consistent if each message represents a write to the database.

Linear consistent atomic operations

Locking can be achieved using atomic CAS operations: if multiple nodes attempt to perform the same operation at the same time, only one will succeed. The consensus protocol guarantees atomicity and linear consistency of operations, even if a node fails or the network breaks down at any time. Distributed locks are typically implemented as “leases” that have an expiration date so that they can eventually be released if the client fails.

Full ordering of operations

When a resource is protected by a lock or lease, a protection token is needed to prevent clients from colliting with each other in the event of process suspension. The guard token is a number that monotonously increases each time a lock is acquired. Zookeeper provides this functionality by fully ordering all operations, providing each operation with a monotonically increasing transaction ID and version number.

Failure detection

The client maintains a long-term session on the Zookeeper server. The client and the server periodically exchange heartbeat packets to check whether the node is alive. Even if the connection is temporarily interrupted or the ZooKeeper node fails, the session remains active. However, if the heartbeat stop lasts longer than the session, Zookeeper declares the session dead.

Change notification

Clients can not only read locks and values created by other clients, but also listen for their changes. Thus, a client can know when another client is joining the cluster or failing. By subscribing to notifications, clients no longer poll frequently to find changes.

4.4.1 Assigning work to nodes

An example of the Zookeeper/Chubby model working well

  • If you have several process instances or services, you need to select one of them as the primary library or preferred service. If the leader fails, one of the other nodes should take over.

  • When you have some partitioned resources (databases, message flows, file storage, distributed Actor systems, etc.) and need to decide which partition is allocated to which node. And when new nodes join the cluster, certain partitions need to be moved from existing nodes to new nodes in order to rebalance the load. When a node is removed or fails, other nodes need to take over the work of the failed node.

This kind of task can be accomplished by judicious use of atomic operations, temporary nodes, and notifications in Zookeeper.

4.4.2 Discovering Services

Zookeeper, Etcd, and Consul are also often used for service discovery — that is, to find out which IP address you need to connect to to get to a particular service. In a cloud data center environment, VMS come and go frequently, and the IP address of the service is usually not known in advance. Instead, you can configure your service to register network endpoints in the service registry at startup, which can then be found by other services.

Note: Tools like Zookeeper provide “outsourced” consensus, fault detection, and membership services for applications. They play an important role, and while not easy to use, it’s better than developing an algorithm that can survive all the problems in Chapter 8.

4.4.3 Membership services

The membership service determines which nodes are currently active and active members of the cluster. As we saw in Chapter 8, the failure of another node cannot be reliably detected due to infinite network latency. However, if fault detection is done by consensus, then nodes agree on which nodes should be considered present or absent.

Even if it does exist, it can still happen that a node is mistakenly declared dead by consensus. But it is very useful for a system to know which nodes constitute the current membership. For example, choosing a leader might mean simply choosing the member with the smallest number of current members

5. Twitter

Hurry before the Mid-Autumn festival holiday to tidy up the notes, on the wide to open a happy holiday.

  • Find something that makes you glad you were born. Something that needs to be done. Something that you’re willing to do with all your might.

  • Once people are in a group, their IQ is seriously reduced. In order to gain recognition, individuals are willing to abandon right and wrong and exchange their IQ for a sense of belonging that makes people feel safe.

  • Because young, always feel that the time ahead is very long, long everything may come again, but do not know the time of the river, can only flow forward, never come again.

Please make sure you keep going.