I have read chapter 23 of SRE Google Operation, Maintenance and Decryption recently, and have some feelings. Please record them.
In daily work, we often need some services to run in a distributed way. It is often easy to deploy and operate distributed systems across regions, such as cities and continents, but it is difficult to ensure the consistency of states between systems. How to ensure the high reliability and availability of the service is that the data provided by the service is accurate. The key lies in the transmission of some states. At this time, it is necessary to use the distributed consensus system to maintain relevant states and ensure that the state information obtained by everyone is ultimately consistent.
In order to realize a distributed consensus system, it is necessary to adopt some theoretically verified schemes, the most basic of which is CAP theory.
Theory of CAP
CAP principle means that for a distributed system, Consistency, Availability, and Partition tolerance are impossible to be met simultaneously. 1. The data seen on each node is consistent 2. Each node can access the data 3
CAP theory states that for distributed systems, the above two points can be achieved at most. At the same time, in the real environment, network partition problems will occur sooner or later (such as optical fiber break, delay, packet loss, etc.), the construction of distributed system needs to choose between high availability and consistency.
The BASE theory of
We are familiar with the traditional ACID data store semantics (atomicity, consistency, isolation, persistence) and this scheme provides us with strong consistency of data. However, distributed system provides a different set of semantics, BASE semantics (basic available, soft state, eventual consistency). BASE is used to solve this kind of problem. It is the result of tradeoff between availability and consistency in CAP theory. Its core idea is that even if strong consistency cannot be achieved, applications can be based on their own characteristics. Adopt an appropriate approach to achieving Eventual consistency.
Basic availability: A distributed system that allows for a partial loss of availability in the event of an unexpected failure — but note that this is by no means equivalent to the system being unavailable. Soft state: As opposed to hard state, it allows data in the system to exist in an intermediate state and considers that the existence of the intermediate state does not affect the overall availability of the system. That is, it allows the system to delay data synchronization between data copies on different nodes. Final consistency: Emphasizes that all copies of data in the system, after a period of synchronization, can finally reach a consistent state.
A distributed system problem scenario
Distributed system problems are mainly caused by network problems.
Split brain
If the heartbeat mechanism is used to provide highly available services, the services may provide primary write services at the same time or hang up at the same time when network problems occur. This problem shows that the head of the election cannot be achieved by a simple heartbeat.
Disaster recovery switchover that requires manual intervention
In the master-slave replication mode of Mysql, external programs are used to monitor the master instance and decide whether to promote the slave node to the master instance according to the status of the master instance. This solution provides CP with no guarantee of availability A.
Problematic group member algorithm
Use the Gossip algorithm to select the leader. In the case of network problems, the network cannot meet THE requirements of C, but can meet the requirements of AP.
Distributed consensus problem solution Paxos protocol
There is no detailed introduction to the Paxos protocol in this book, and it is not very graphic. We can refer to Paxos in our daily life. It turns out that we all use Paxos in our daily life — interpretation of Paxos and detailed explanation of Paxos algorithm (I)– to describe the algorithm through the real world.
The following figure summarizes several roles and relationships in the protocol.
Distributed consensus algorithms are very low-level, very primitive, and really useful are the system components based on them, including: data storage, configuration storage, queues, locking mechanisms, and leader election services.
The distributed consensus system architecture pattern is described later, including several components: a reliable replication state machine, a leader election mechanism, distributed coordination and locking services, distributed queues, and messaging.
Reliable replication state machines
The RSM (Replicated State Machine) is a system that can perform the same set of operations in the same order in multiple processes.
Reliable data storage
Head selection mechanism
Distributed coordination and locking
Reliable distributed queuing and messaging
Performance issues
In this paper, several methods to improve performance of distributed consensus problem are introduced. There is no optimal scheme, which needs to be selected according to different factors and scenarios.
1, SRE Google operation, maintenance and decryption 2, Paxos in life, originally you and I are using — the interpretation of Paxos life 3, Paxos algorithm details (A)– through the real world to describe the algorithm 4, CAP principle 5, CAP principle (CAP theorem), BASE theory 6,