Designing highly usable software architecture is the pursuit of every Internet software developer. Especially for online trading system and online advertising system, service availability will directly affect business realization. With the widespread use of Internet technology in industry, Internet companies have accumulated mature highly available architecture solutions in their own business areas. So what exactly is high availability? How to design a high availability architecture?
To measure the
The time when a service is unavailable is called failure time. Failure times decrease by an order of magnitude with every increase in availability. From day to minute is relatively easy to achieve. Every step up from there will require hundreds of times more effort. Large web services usually do at least four nines, and five or more are difficult. Not only are there technical challenges to solve, but there are also enormous cost pressures. For the core service of the website, 5 nines will be used as much as possible, instead of 4 nines for the core service, even 3 nines will be acceptable. Economic accounts must be considered when making technical decisions.
methods
So how do you make it highly available? The solution is simple: redundancy. Popularly speaking, it is double insurance mechanism. The theory behind it is probability theory. Assuming that one service is 99% available (failure rate 1%), then the availability of the two services is 1-0.01*0.01=99.99%. As you can see, redundancy improves availability exponentially. One more redundant service and you have six nines available. A: wow! Improving usability levels seems so easy! ?
limit
Is it really that simple? Of course not. Don’t forget the CAP theorem!
AP. The more redundancy, the higher the cost of solving C. The biggest of these is the cost of time, which is technically unacceptable. When programmers talk about high availability systems, they often add the word “high concurrency.” High concurrency reflects technology’s pursuit of time. It is these incompatibilities that force architects to make different technology trade-offs for different business scenarios.
design
There are three modes of redundant architecture design: Master & Master, Master & Co-master, and Master & Slave.
In dual-active mode, the two services are equal and provide read and write services. You can select either client. Dual-master mode is the most available, but this architecture is difficult to achieve consistency, requiring bidirectional data synchronization between the two services. Once the communication between them is broken, network partitions form, which can cause brain-split problems that the system cannot solve and must be manually intervened. Therefore, the dual master pattern is rarely chosen in architectural design.
The two services are no longer equal. The primary service takes over all read and write requests, and the secondary service takes over only when the primary service is unavailable. Data synchronization between the active and standby services occurs in both directions, but unlike the dual-master mode, they do not occur simultaneously. In normal cases, only active/standby data is synchronized. Data is written to the standby service during the period when the primary service is unavailable. Data needs to be synchronized from the standby node to the active node only after the active node recovers. During this period, data is double-written to the active and standby nodes to prevent data synchronization from the active node to the standby node at the same time. The active/standby architecture is relatively easy to implement, but the disadvantage is that standby services are a waste of resources most of the time. Generally, the active/standby architecture is considered during the deployment of database systems.
In fact, the master-slave mode is not mainly to solve the problem of high availability, but more to achieve read and write separation, to solve the problem of high concurrency. In the real world, it is usually not a master, slave, but a master, many-slave architecture, because most applications read more and write less. The primary node handles write requests and the secondary node handles read requests. Due to multiple slaves, the availability of read service is much higher than that of write service. In addition, write services can have a single point of failure. This problem can be solved by cluster dynamic primary selection: when the primary node is unavailable, the cluster automatically selects a new primary node. Based on ZooKeeper, dynamic master selection is easy to implement (read ZooKeeper Theory & Application & Practice). However, dynamic primary selection is not used in MySQL cluster architecture, because the data synchronization between the primary node and the secondary node must be configured at deployment time. If the primary node is switched, operation and maintenance personnel need to modify the configuration and restart the service.
case
The above is the abstract architecture design. The actual scenario architecture is more complex, using a combination of master/slave and master/slave architectures. The following figure is a simplified version of MySQL cluster architecture scheme of a Domestic Internet company.
read
-
Origin and theory of distributed systems
-
Theory & Application & Practice of Zookeeper