Definition of distributed system

A system in which hardware or software is distributed among computers on a network and communicates or coordinates with each other through messages.

Issues addressed (single architecture shortcomings)

  • Limited processing capability for mass users.
  • The more complex the program, the less efficient it will be.
  • A critical BUG occurs in the production environment, causing the entire service to break down.
  • As the amount of code increases, compilation efficiency decreases.
  • There is only one technology stack to focus on.

Noun definition (distributed/cluster/network partition)

Distributed: multiple people doing different things together. Cluster: multiple people doing the same thing together. Network partition (split brain) : The networks are disconnected, resulting in small clusters in the distributed system. The networks between small clusters are abnormal, and the networks within small clusters are normal.

Evolution of architecture

Monomer architecture – > application server and database server – > master-slave replication – > database server, application server cluster, speaking, reading and writing separation – > add to alleviate the pressure of the database search engine – > increase the cache alleviate the pressure of the reading library – > database split (horizontal/vertical resolution) – > application server vertical separation – > application server level split.

Consistent classification

1, strong consistency: the system is required to write what, read what. Performance is affected. 2, weak consistency: do not promise how long after the data can achieve consistency, as far as possible in a certain level (seconds/minutes/hours) to achieve consistency state. 3, read and write consistency: the first time to see their updated content, others do not guarantee.

  • Certain content is read from the main library, which is under a lot of pressure.
  • The newly updated content is read from the master library and, after some time, from the slave library.

4. Final consistency: only ensure that the data of all copies in the final system is correct.

The CAP theorem

1. Consistency: All copies have the same data and the data read from any node is the latest. 2. Availablity: the services provided externally shall remain normal without response timeout or response error. 3. Partition tolerance: Provides services even when network partitions exist. Any distributed system that meets two of the three requirements. For example, when a user requests N1 to change the value from VO to v1, the network is interrupted between N1 and N2, and another user requests N2 to obtain the value of the field. There are three ways to do this:

  • Return VO (sacrifice consistency, AP mode)
  • Wait for the network to recover and then return to V1 (sacrifice availability, CP mode)
  • Merge N1 and N2 (discard distributed technology, CA mode)

The BASE theory of

If strong consistency cannot be achieved for the tradeoff results of CAP theorem, the final consistency can be achieved in an appropriate way according to the business characteristics. 1. Basically Available: When a distributed system fails, a portion of its availability is allowed to be lost. (Time: normal is 0.5 seconds response results, failure can be increased to 1-2 seconds). 2. Soft State: Allows data to exist in the intermediate State(some data has not been synchronized), but does not affect the overall system availability. Eventuallly consistent: Data is consistent after a period of synchronization.

Conformance protocol (handling distributed transactions)

2PC (Two-stage submission)

process

1. Preparation phase: The coordinator sends a prepare message to each participant, runs the local transaction but does not commit. 2. Commit phase: If the coordinator finds that the participant fails or times out, he/she will send a RollBack message to the participant; otherwise, he/she will send a Commit message.

disadvantages

1. Synchronous blocking: Participant transactions are blocked before phase 1 reaches phase 2. 2. Single point of problem: If the coordinator crashes while running phase 2, participant transactions are locked. 3. Inconsistent data: The coordinator crashes before sending the Commit message, resulting in inconsistent data. 4. Too conservative: If any node fails, the whole transaction will fail.

3PC (Three-phase submission)

process

CanCommit: The coordinator sends a request containing the transaction to each participant, asking if it can be run. 2. PreCommit: The coordinator requires participants to run transactions. 3. DoCommit: The coordinator asks the participant to commit a transaction: at this stage, if the participant does not receive the message from the coordinator, the participant will commit the transaction by default. Reduced the transaction blocking range of 2PC, but did not completely solve the data inconsistency problem.

Consistency algorithm (select final result or Leader)

Paxos algorithm

role

1. Client: Sends a request to the distributed system. (b) no proposals are prepared if acceptors accept proposals. 3. Acceptor decision makers: Approve proposals. 4. Leaner learners: Learn the final decision.

specification

1. An Acceptor must accept the first proposal it receives. 2. Each proposal received must have the same value as the first. 3. A proposal is selected and must be accepted by more than half of acceptors.

Process 1

1. The Proposer sends a prepare request numbered N with no Value to more than half of the acceptors. 2. If an Acceptor has not accepted the proposal, null is returned. 3. If the Proposer decides on its own value, it sends an Accept request numbered N with value to Accptor. 4. Accpetor accepts proposals numbered N and valued.

Flow 2

1. The Proposer sends a prepare request numbered N+1 with no Value to more than half of the acceptors. 2. If an Acceptor has accepted a proposal numbered N, return the Value of proposal N. 3. The Proposer sends an accept request numbered N+1 with a Value. 4. Acceptor accepts proposals numbered N+1 with the Value Value. Extreme case: Two proposers submit successively numbered proposals, resulting in an endless loop. Solution: States that only one Proposer can submit a proposal.

Raft algorithm

role

1. Leader: Interacts with the client, only one. Candidate: Responsible for nominating oneself during the election process and becoming the leader when the election is successful. 3, followers: voters, waiting for the announcement of the poll.

process

1. When the election starts, all nodes are followers. Maintain Follower status if a RequestVote/AppendEntries request is received. 3. If no request is received within a period of time (random 150-300ms), the Candidate will be changed to a Candidate and run for the Leader. If half of the votes are obtained, the Candidate will become the Leader. 4. If no Leader is elected, the next round of election begins.

Distributed system design strategy

The heartbeat detection

Usually carries status and metadata information for easy management.

  • Periodic heartbeat detection: Response times out and death is determined.
  • Cumulative failure detection: Initiate a limited number of retries on a dying node.

High availability

  • Active/standby mode (common) : The host breaks down and the standby host takes over all the work of the host. After the host recovers, the service is switched back to the host in automatic hot backup or manual cold backup mode.
  • Mutual backup mode (multi-active) : Two hosts run simultaneously and monitor each other.
  • Cluster mode: Multiple nodes run at the same time and share requests through the primary node. Therefore, the high availability of the primary node must be resolved.

Load balancing

The solution

  • Hardware: F5
  • Software: LVS, HAProxy, Ngnix.

Strategy: random, polling, weight, least join, hash.

The network communication

RPC

Remote Procedure Call Remote Procedure Call

role

1. Client: the service caller 2. Client Stub: packages the Client request into a network message and sends it to the Server Stub over the network. 3. Server Stub: Receives the Client Stub message, unpacks the message, and invokes the local method. 4. Server: Service provider.

RMI

Remote Method Invocation. Java native support for communication between different Java virtual machines.

role

Client 1. Stub: client proxy. 2, Remote Reference Layer: Remote Reference Layer, parsing and running Remote Reference protocol. 3. Transport: Calls remote methods to receive the results of the run. Transport Layer: receives client requests and forwards them to the Remote Reference Layer. Remote Reference Layer (Skeleton); 3, Skeleton: call the actual method. Registry: Registers a remote object with a URL and replies a reference to it to the client.

Distributed ID generation scheme

The traditional single architecture is basically the single table structure of single database service. Generally, the ID of each service table increases from 1. However, the design of database and table in distributed architecture enables multiple libraries or tables to store the same service data. In this case, the use of database self-increasing ids will generate the same ID, which cannot guarantee the uniqueness of the primary key.

There are several aspects to consider when designing distributed ids

  • Globally unique: No duplicate ids, which is the most basic requirement.
  • If the ID is directly ordered, there is no need to create more indexes and add query conditions. Moreover, the Mysql InnoDB storage engine uses clustered indexes for primary keys, so the write performance is higher if the primary keys are ordered.
  • High availability: An ID uniquely identifies a piece of data. If an ID fails to be generated, services cannot be executed. Therefore, a good ID generation scheme requires high availability.
  • Information security: Although the trend of ids is orderly, they cannot be identified as rules to avoid information crawling.

UUID

UUID, short for universal Unique Identifier. UUID is a set of 32-bit hexadecimal numbers, so the theoretical total of UUID is 16^32=2^128. The generated UUID is made up of data in 8-4-4-4-12 format with 32 characters and 4 hyphens, which are usually removed. At present, there are five methods to generate UUID. Each version has different algorithms and application scope. Advantages: convenient generation, local generation without network consumption. Disadvantages: Not easy to store, information insecure, bad for Mysql index.

Database generation

Since the initial self-increment of distributed database is the same, the conflict will occur. So we design the self-increment ID of the same business table in distributed database to a different starting value, and then set a fixed step length, which is the number of sub-databases or sub-tables. If there are three machines, the starting ID value of order table in DB1 is 1, the starting ID value of ORDER table in DB2 is 2, and the starting ID value of ORDER table in DB3 is 3, and their increment step size is 3, then their ID generation range is as follows:

Advantages: Rely on the database itself does not need other resources, and ID monotonic increment.

Disadvantages: strong dependence on DB, when the DB is abnormal when the whole system is unavailable. Although configuring master and slave can maximize availability, data consistency can be difficult to guarantee in special cases. Inconsistencies during primary/secondary switchover may cause repeated numbers.

Implemented using Redis

Redis implements distributed unique ids through autoatomic commands such as INCR and INCRBY. Due to the single-threaded nature of Redis, the generated ids are guaranteed to be uniquely ordered. However, there is a performance bottleneck in a single machine, so cluster mode is adopted. Cluster mode has the same problem as database cluster, which also needs to set segmentation and step size to achieve. Redis achieves high performance in distributed ID and generates ordered data. However, it relies on Redis and needs to introduce Redis components, which increases the configuration complexity of the system.

Snowflake algorithm -Snowflake

The Snowflake algorithm is a distributed ID generation algorithm developed by Twitter. It divides 64 bits into multiple parts in a namespace, with each part representing a different meaning. In Java, 64-bit integers are of type long, so the ID generated by the Snowflake algorithm in Java is stored by long.

  • The first bit occupies 1bit, and the value is always 0. It can be regarded as a sign bit.
  • 41 bits from the second is the timestamp, 41 bits can represent 2^41 numbers, each number represents milliseconds, so the available time of the snowflake algorithm is (1L<<41)/(1000L360024*365)=69 years.
  • The middle 10 bits represent the number of machines, that is, 2^10=1024 machines.
  • The last 12 bits are an increment sequence that can represent the number 2^12=4096.

This equates to a machine in a data center producing 4096 ordered non-repeating ids in a millisecond.

The ID generated by Snowflake algorithm increases in trend and does not rely on third-party systems such as databases. It is deployed as a service and has higher stability and high ID generation performance. However, the system relies heavily on the machine clock. If the machine clock is dialed back, the number is repeated or the service is unavailable.

conclusion

Distributed ID generation schemes can be roughly divided into two categories:

  • The first type is DB. The trend increases according to the start value and step size. Fault tolerance and availability of services need to be considered.
  • The other is the Snowflake type, which splits the 64bit into different segments, each with a different meaning, basically a timestamp, machine ID, and sequence number. This solution needs to consider the problem of clock callback and do some buffer design to improve performance. You can divide the three into different bits to vary lifetime and concurrency.

A distributed lock

Distributed locks lock shared resources in a distributed environment and serialize request processing, effectively acting as a mutex. Distributed lock can solve the idempotent problem in business. For example, when users place an order and pay, merchants change prices at the same time, concurrency problems may occur. At this point serialization is required to prevent business problems.

Distributed lock to solve the problem

  • Mutual exclusion: Only one process (one server) can access resources at a time.
  • Security: A lock can only be deleted or released by the client holding the lock.
  • Fault tolerance: Locks can still be released when a server is down or locks can be placed on other servers.
  • Avoid deadlocks.

Design goals for distributed locks

  • Strong consistency
  • High availability of services, robust system
  • Automatic lock renewal and release
  • The code is highly abstract and business access is simple
  • Monitorable management

Implement distributed lock based on database

  • Scenario 1: Optimistic locking based on database tables for distributed locking (Version)
  • Option 2: Pessimistic locking based on database tables (InnoDB for Update)
  • Solution 3: Make unique constraints based on database table data records (record method names in the table)

Make unique constraints based on database table data records (record method names in the table)

The easiest way to implement distributed locking is to create a table directly and manipulate the data in that table. When we want to lock a method or resource, we insert a record, and when we want to release the lock we delete the record.

When we want to lock a method, execute the following SQL

insert into methodLock(method_name, desc) values('method_name','desc');
Copy the code

Because we have a unique constraint on method_name, if multiple requests are submitted to the database at the same time, the database guarantees that only one operation will succeed, so we assume that the thread that succeeded has acquired the lock. When the method completes, we release the lock

delete from methodLock where method_name='method_name';
Copy the code

This implementation has the following problems: 1. The lock is strongly dependent on the availability of the database. Once the database fails, the business system will be unavailable. 2. The lock has no expiration time. Failure to unlock the lock will cause the lock record to remain in the database, and other threads will not be able to obtain the lock. 3. The lock must be non-blocking because an insert error is reported when the insert fails. A thread that has not acquired the lock is not enqueued, and to acquire the lock again, the lock operation must be triggered again. 4. The lock is non-reentrant. The same thread cannot acquire the lock again without releasing it because the lock record already exists.

Pessimistic locking based on database tables (InnoDB for Update)

Insert for UPDATE after the query, the database will add an exclusive lock to the database table during the query. InnoDB uses row-level locks only for indexes, otherwise it will use table-level locks. The index must be created as the only index, otherwise there will be multiple overloaded methods cannot be accessed at the same time, between overloaded methods suggest that the type of the parameter is also add) when a record is combined with exclusive lock, other threads can’t increase the exclusive lock on the record, we can think of to get exclusive lock threads can obtain a distributed lock, when get the lock, The business logic of the method can be executed, and after the method is executed, the lock is released by the connnection.mit () operation. This approach effectively addresses the above issues of unable to release locks and blocking locks, but it does not address the database availability and lock non-reentrant issues. Another possible problem here is that although we use a unique index for method_name and explicitly use for UPDATE to use row-level locks, mysql optimizes the query even if the index field is used in the condition, However, it is up to Mysql to determine whether to use an index to retrieve data by determining the cost of different execution plans. If Mysql decides that a full table scan is more efficient than an index, it will not use an index, in which case Mysql will use a table lock instead of a row lock.

Do optimistic locking based on database tables

Optimistic locking meaning: Most implementations are based on data version records. Add a version identifier to the data. When reading the data, the version number is also read. When updating the data, add one to the version number. Disadvantages: Increased the number of database operations, the original one update to select+update. In a business process, multiple resources need to ensure data consistency. If optimistic locking based on database resource tables is used for all resources, each resource must have a resource table, which cannot be met in actual scenarios.

Distributed lock based on Redis

Use Redis setNX() for distributed locks (atomicity). SETNX sets the value of the key to value if and only if the key does not exist. If the given key already exists, SETNX does nothing.

  • SETNX sets the value of the key lock.id to the timeout period of the lock, the current time + the valid time of the lock.
  • Returns 0, indicating that another process has acquired the lock. The current process cannot enter the critical section. The process can continue to try SETNX operations in a loop until the lock is successfully obtained.

The distributed lock implemented by SETNX may have the problem of deadlock. Compared with the lock in single-machine mode, the distributed environment not only needs to ensure that the process is visible, but also needs to consider the network between the process and the lock. After a process obtains a lock, it disconnects from Redis. If the lock is not released in time, other processes competing for the lock will be blocked, resulting in deadlock. The solution is: you need to set timeout for the acquired locks, known as setExpire, which automatically releases the locks when they expire.

Distributed lock based on Zookeeper

There are two schemes to realize distributed lock in Zookeeper, one is to use temporary node, the other is to use temporary ordered node.

Temporary node

The principle of the temporary node scheme: if multiple processes compete to create the same temporary node, only one process will succeed first. Assuming that process 1 successfully creates the node, it obtains the distributed lock, and other processes need to register a listener on parent_node to listen for changes to all its children and suspend the current process. When a child node under parent_node changes, it notifies all processes that have registered listeners on it. These processes determine whether it is the time of deletion on the corresponding lock node. If so, let the suspended process try to acquire the lock again.

The temporary node scheme is simple to implement, but has disadvantages:

  • Disadvantage 1: Processes 2, 3, and 4 are notified when other locks under Parent_node change or are deleted, but they do not care about the release of other locks, which incurs network overhead.
  • Disadvantage 2: Locks created using the temporary node scheme are unfair.

Temporary ordered node

  • Each process attempts to create a temporary ordered node under parent_node.
  • Each process then needs to obtain information about all temporary nodes under the current parent_node and determine whether it is the smallest node, if so, to obtain the lock. If not, suspend the current process and register a listener for the previous node.
  • When a process releases the lock, the corresponding node is deleted. When the previous node is deleted, the next node is notified, and the next node’s process can try to acquire the lock.

Each temporary ordered node only needs to care about its last node, not other extra nodes and extra events, and the implementation of the lock is fair (sequential, the process that arrives first gets the lock). Another advantage of temporary ordered nodes is that shared locks can be realized, such as read locks in read/write locks. As shown in the following figure, temporary ordered nodes can be divided into read lock nodes and write lock nodes:

  • For the read lock node, only the release of the previous write lock node needs to be concerned. If the previous write lock is released, multiple threads corresponding to the read lock node can concurrently read data.
  • For a write lock node, only the release of the previous node is concerned, whether it is a read lock node or a write lock node.

Distributed transaction solutions

What is a transaction

Transactions provide a mechanism to bring all operations involved in an activity into one indivisible unit of execution. All operations that make up a transaction can be committed only if all operations can be executed properly. Failure of any operation will cause the entire transaction to be rolled back. Transactions are simply “do nothing or all” mechanisms.

Database transaction ACID property

  • Atomicity: All operations in the entire transaction are either complete or not complete, and cannot be stopped at some intermediate stage.
  • Consistency: Consistency constraints on database data are not broken before and after a transaction.
  • Isolation: The database allows multiple transactions to execute concurrently. If the data to be accessed by one transaction is being modified by another transaction, the data accessed by it is not affected by the uncommitted transaction as long as the other transaction is not committed.
  • Persistence: After a transaction, changes to the data are permanent and will not be lost even if the system fails.

In short, atomicity describes the integrity of a transaction, consistency and isolation describe the data correctness of a transaction, and persistence describes the reliability of a transaction to modify data.

What are distributed transactions

concept

With the rapid development of the Internet, SOA, microservices and other architectures are used. Now distributed systems are generally composed of multiple independent subsystems, which cooperate with each other to complete each function through network communication. For example, the order and payment process in e-commerce will at least involve the transaction system and payment system, and this process will involve the concept of transaction, that is, to ensure the data consistency of the transaction system and payment system. At this time, we call such cross-system transaction as distributed transaction. In simple terms, a distributed transaction means that the participants of the transaction, the server supporting the transaction, the resource server, and the transaction manager are located on different machine nodes.

The difficulties in

  • Atomicity of transactions: Transactions operate across different nodes. When an operation on one node fails, the atomicity of multi-node operations must be guaranteed.
  • Transaction consistency: When a network transmission failure or node failure occurs, the data replication channel between nodes is interrupted. Data consistency must be ensured during transaction operations to ensure that data does not violate the constraints and trigger rules defined by the database.
  • Transaction isolation: In distributed transaction control, asynchronously committed transactions may occur. In this case, “partially committed” transactions may occur. In this case, if concurrent applications access data without control, “dirty read” problems may occur.

Consistency in distributed systems

Theory of CAP

CAP theory, also known as Brewer’s theorem, states that in a distributed system (a collection of nodes connected to each other and sharing data), only two of consistency, availability, and partition fault tolerance can be guaranteed when read and write operations are involved, and the other must be sacrificed.

  • Consistency: Read operations are guaranteed to return the latest write results for a given client. Consistency emphasize the client read operation to get the latest write operations as a result, because the transaction in the process of execution, the client can’t read uncommitted data, until after the transaction is committed, the client can read write data to the affairs, and if the transaction fails may be rolled back, the client will not read write data between transaction.
  • Availability: Non-failure nodes return reasonable responses (not error and timeout responses) in a reasonable amount of time. The emphasis here is on reasonable responses, without timeouts and errors.
  • Fault tolerance of partitions: The system can continue to perform its duties after network partitions occur.

Although CAP theory can only choose two of these, in a distributed environment, we must choose P (partition tolerance) because the network is not 100% reliable. For example, we choose CA to give up P, so when partition occurs, in order to ensure C, the system needs to forbid write. When there is A write request, the system returns error, which conflicts with A, because A requires no error and no timeout. Therefore, distributed systems can theoretically only choose CP (consistency + partition tolerance) or AP (availability + partition tolerance) between consistency C and availability A.

The BASE theory of

BASE theory refers to Basically Available, Soft State, and Eventual Consistency.

  • BA: Basic availability. When a distributed system fails, it allows the loss of partial availability, that is, the core is guaranteed to be available.
  • S: Soft state, which allows the existence of intermediate state in the system, but does not affect the overall availability of the system. The intermediate state here is data inconsistency in CAP theory.
  • E: Final consistency. All data copies in the system reach a consistent state after a certain period of time.

CAP theory ignores delay, but delay is inevitable in practical application, which means that perfect CP scenario does not exist. Therefore, CP scheme in CAP actually achieves final consistency, but “certain time” is only a few milliseconds. In AP schemes, consistency is sacrificed only when a partition fails, rather than forever. This is where BASE theory extends. Consistency is sacrificed during partitioning, but when a partition fails, the system should achieve final consistency.

Data consistency

  • Strong consistency: Requires that all subsequent reads obtain the latest data, regardless of which data copy the update is performed on.
  • Weak consistency: Under this consistency protocol, it takes a period of time for the user to read the update of certain data of the system by an operation. This period of time is called the “inconsistent window”.
  • Final consistency: A special case of weak consistency in which the system guarantees that the user will eventually be able to read an operation’s update to system-specific data (the read operation has not been preceded by any other update operation that changed the data). The size of the inconsistency window depends on the interaction delay, system load, number of copies of data, etc.

Flexible transaction

Design thought based on the theory of the BASE, flexible transactions, in does not affect the overall system availability, allows the system to exist inconsistent data in the middle of the State (Soft State Soft State), after the time delay of data synchronization, ultimately achieve consistent data, is not completely give up ACID, but by loosening consistency requirements, Local transactions are used to ensure the consistency of distributed transactions and the throughput of the system.

XA and 2PC, 3PC

The full name of XA is eXtended Architecture. It is a specification published by X/Open in 1991 for distributed transaction Processing (DTP). It is a distributed transaction protocol that guarantees strong consistency through the two-phase commit protocol. The following is a screenshot of the DTP model in the Open Group standard file

The DTP model abstracts the concepts of AP (application program), TM (transaction manager) and RM (resource manager) to ensure strong consistency of distributed transactions. Among them, RM and TM use XA protocol for bidirectional communication.

Compared to traditional local transactions, XA transactions add a preparation phase where the database, in addition to passively accepting commit instructions, can also reverse notify the caller whether the transaction can be committed. TM can call the results of all branch transactions and make atomic commit at the end to ensure the strong consistency of transactions.

Java implements XA model by defining JTA interface. ResourceManager in JTA interface needs XA driver implementation provided by database vendor, Transaction Manager needs Transaction Manager vendor implementation, traditional Transaction Manager needs to be bound with application server, so the cost is high. Embedded transaction managers can provide services in the form of JAR packages.

2PC (Two-stage submission)

The idea of the second stage is that participants will inform the coordinator of the success or failure of the operation, and then the coordinator will decide whether to submit or terminate the operation according to the feedback of participants. The two stages are:

  • 1. Preparation stage (voting stage)
  • 2. Submission Stage (Implementation stage)

The two-phase protocol has advantages and is relatively simple, but the disadvantages are also obvious:

  • 1. Synchronization block: All transaction participants are in synchronization block state while waiting for the response of other transaction participants and cannot perform other operations.
  • 2. Single point of problem: The coordinator plays a big role in 2PC, and failure can have a big impact, especially in phase 2, where all participants are waiting.
  • 3. Inconsistent data: In phase 2, if the coordinator only sends part of the Commit message, an exception occurs on the network and only part of the participants Commit the transaction, resulting in inconsistent data.
  • 4. Too conservative: failure of any node will lead to failure of the whole transaction, without a perfect fault tolerance mechanism.

3PC (Three-phase submission)

Three-phase commit is designed to address the shortcomings of two-phase, which is a “non-blocking” protocol. Compared with Phase 2, phase 3 has two changes:

  • Timeouts are introduced, both in the coordinator and in the participant.
  • A preparation phase is inserted between the first and second phases to ensure that the states of the participating nodes are consistent until the final commit phase.

In other words, 3PC splits the preparation phase of 2PC into three phases: CanCommit, PreCommit, and DoCommit.

Compared with 2PC, 3PC has the following advantages: Solves single points of failure and reduces congestion.

3PC problems: Data consistency problems. Because the abort response sent by the coordinator could not be received in time by the participant due to network problems, the participant waited for a timeout and performed the COMMIT operation, resulting in data inconsistency with other participants that received the ABORT response and rolled back.

Distributed transaction solutions

TCC

The TCC transaction mechanism has three phases:

  • In the Try phase, all services are checked (consistency) and necessary services are reserved (quasi-isolation).
  • In the Confirm phase, services are executed without any service check and only service resources reserved in the Try phase are used. The Confirm operation is idempotent and requires idempotent design. If the Confirm fails, retry is required.
  • Cancel phase: The execution is cancelled and service resources reserved during the Try phase are released. The Cancel operation is idempotent.

Compared with XA, TCC transaction mechanism solves the coordinator single point, which is initiated and completed by the main business party. The business activity manager has also become multi-point, introducing clustering. Synchronization blocking: A timeout is introduced and the whole resource is not locked. The resource is converted to service logic with smaller granularity. The implementation of distributed transactions based on TCC will split the logic that can be implemented only by one interface into three interfaces, Try, Confirm and Cancel, so the code implementation complexity is relatively high.

Local message table

There are two roles of message producer and consumer in the local message table scheme. Suppose that system A is A message producer and system B is A message consumer, and the general process is as follows:

  • 1. When system A is called by another system to perform data update operation, the business table of the database will be updated first. In fact, A data is inserted into the message table of the same database, and the two operations occur in one transaction.
  • 2. The script of system A periodically polls local messages and writes A message to the MQ server. If the message fails to be sent, the script will retry.
  • 3. System B sends messages in MQ and processes business logic. If the local transaction fails, it continues to consume messages in MQ for retry. If the service fails, you can notify system A to roll back the service.

Local message table implementation conditions:

  • Both consumer and producer interfaces support idempotent.
  • Producers need to create additional message tables.
  • Compensation logic needs to be provided and the producer needs to support rollback operations if the consumer business fails.

Fault tolerance mechanism:

  • If step 1 fails, the transaction is rolled back directly.
  • Step 2 and step 3 Write mq and consume MQ fails. Retry.
  • Step 3 When the service fails, system B rolls back the transaction to system A.

The core of this solution is to execute tasks requiring distributed processing asynchronously in the form of message logging. Message logs can be stored in local text, databases, or message queues, and then automatically or manually initiated retries by business rules. Similar to mysql master-slave replication.

Reliable messages are ultimately consistent

The general process is as follows

1. A sends A prepare message (half message) to the MQ server. If the prepare message fails to be sent, the operation is cancelled. 2. If the message is sent successfully, the local transaction is executed. 3. If the local transaction succeeds, a confirm message is sent to MQ, and if it fails, the local transaction is rolled back. 4. System B periodically consumes confirm messages in MQ, performs local transactions, and sends ACK messages. If the local transaction of transaction B fails, the system tries again and again. If the service fails, the system sends A rollback request to system A. 5. The MQ periodically polls all prepare messages and invokes the interface provided by system A to query the local transaction processing status of A. If the local transaction corresponding to the prepare message is successfully processed, the CONFIRM message is sent again.

Best efforts to inform

Best effort notification is the simplest kind of flexible transaction, suitable for some business with low final consistency time sensitivity, and the passive processing result does not affect the active processing result. The general idea of this plan is:

  • System A sends this message to MQ after the local transaction is completed.
  • There will be a service consuming MQ, which will consume MQ messages and invoke the interface of system B.
  • If system B is successful, it is ok. If system B fails, it will try its best to notify the service periodically to try to call system B again.

Consistent hash algorithm

The Hash algorithm

A Hash, commonly translated as a Hash, Hash, or Hash, is a Hash algorithm that transforms an arbitrarily long input (also known as a premap) into a fixed-length output, which is the Hash value. This transformation is a compression mapping, meaning that the space of hash values is usually much smaller than the space of input, and different inputs may hash into the same output. Simply put, it is a function that compresses a message of arbitrary length into a message digest of fixed length. Hash is “compressing” an input value into a smaller value, which is usually unique and compact. Today’s business systems are mostly micro-service architectures, and the hash algorithm is not suitable for the natural distributed nature. For example, in a distributed system, to store data to specific nodes, if we use common hash algorithm for routing and map data to specific nodes, such as key%N, where key is the key of data and N is the number of machine nodes, data mapping will be invalid if a machine joins or leaves the cluster. This causes a large amount of data to be rehash, affecting the normal running of services. In this case, a consistent hash algorithm is required.

Consistent Hash algorithm

The consistent Hash algorithm is a special Hash algorithm to solve the distributed cache problem. When removing or adding a server, the mapping between existing server requests and the server that processes them can be changed as little as possible. Consistent hash algorithm solves the dynamic scaling problem of simple hash algorithm in distributed hash table.

balance

Equilibrium means that the result of hashing can be distributed to all the buffer nodes as much as possible, so that all the buffer space can be utilized.

monotonicity

Monotonicity is the ability of consistent hashing to protect allocated content from being remapped to a new buffer when the buffer changes, reducing the amount of rehashing and improving performance.

dispersion

In a distributed environment, it is possible for an endpoint to not see all caches, but only some of them. When the terminal wants to map the content to the cache through the hash process, different terminals may see different buffer scope, resulting in the hash result is inconsistent, the final result is the same content is mapped to different buffers by different terminals. This situation should obviously be avoided because it results in the same content being stored in different caches, reducing the efficiency of system storage.

load

The load problem actually looks at the dispersion problem from another perspective. Since different endpoints may map the same content to different cache instances, it is also possible for a particular cache instance to be mapped to different content by different users. Like fragmentation, this should be avoided, so a good hash algorithm should minimize the load of the buffer.

What is a consistent hash

  • 1. First compute the hash value of the Redis server node and configure it to the circle of 0-2^32.
  • 2. Then use the same method to find the hash value of the key storing the data and map it to the same circle.
  • 3. Search clockwise from the data mapping location, save the data to the first server found, if more than 2^32 still cannot find the server, save to the first Redis server.

Adding a Redis server from the above state and adopting the remainder distributed algorithm will affect the hit of the cache due to the change of the cache instance of the save key. However, in the consistent hashing algorithm, only a small portion of the anti-clockwise hash at the increment node (node5) is affected.

This way is very good solve the cache hit ratio, fault tolerance and scalability, but rarely when the service node, will bring another problem, is the “data skew”, that is a lot of key nodes are assigned to the same server, this risk is very big, if the node suddenly hang up, can cause an avalanche cache. So how to solve it? — Virtual nodes.

Suppose we only have two servers, the distribution is as follows

This must result in a large amount of data being concentrated on Redis2. In order to solve the problem of data skew, virtual nodes are introduced. Multiple hashes are calculated for each service node, and each result is placed with a service node, which is called “virtual node”. For example, if we calculate two virtual nodes for each server,Redis1# 1,Redis1#2,Redis2#1 and Redis2#2 will form four virtual nodes