It is a year to put on the list day, numerous examinees full of expectations of the point open enrollment network, the result lost information just found that there is no way to inquire — query the number of too many, directly hit the system hung! At this time, he was asked a heart-piercing question: how many 9’s does this system have? To answer this question, we need some prior knowledge.
Availability & reliability
The two terms are very similar, and I couldn’t find a good definition to distinguish them until I looked at distributed systems and came across an explanation:
Availability is defined as a property of the system that indicates that it is ready and ready to use. In other words, highly available systems work in a timely manner at any given moment. Reliability refers to the system can continue to operate without failure, is a continuous state. In contrast to availability, reliability is defined in terms of time periods, not moments.
For example, if you want to evaluate a lick 🐶, usability is whether you can find it when you look for it, and reliability is how generous it is when you need to pay. A lick 🐶 if on call, but the money is too stingy, is highly available, low reliability; And if he often can’t find people, but make big moves, it is low availability, high reliability. Analogies to systems, if the system crashes at 1ms per hour, it’s more than 99.9999% usable, but it’s still highly unreliable. Similarly, if a system never crashes but is down for two weeks each year, it is highly reliable but has only 96% availability.
Baidu Encyclopedia explains system reliability as follows: System reliability generally refers to the ability/probability of the system to complete the specified functions within the specified time and under the specified working conditions. That’s the probability of a system running without failure. When we evaluate the usability and reliability of a system, we say three nines, four nines and so on. The ** Service Level Agreement (SLA)** is a number of “9” to indicate the specific downtime time of the system in a year. I will explain these nines in detail in section 3. In real conversation, however, most people understand the two words about the same. Being literal is not the subject of this article, so let’s look at how usability is calculated.
Availability calculation
Usually, A is used to represent the availability of A system, and the following indicators are used to assist the calculation
Relevant indicators
MTBF
MTBF, or Mean Time Between failures, stands for “Mean Time Between Failures”. Is to measure the reliability of a product (especially electrical products) index. The unit is hour. Specifically, it refers to the average working time between two adjacent failures, also known as the average failure interval.
MTTR
MTTR refers To Mean Time To Repair. This is the average repair time of a repairable product, which is the time between failure and repair. The shorter the MTTR, the better the recoverability.
After calculating the availability of individual components using the above formula, we can then calculate the availability of the entire system by modeling the system as series and parallel components. The following rules are used to determine whether a system is in series or parallel:
- The two parts are considered to operate in series if the failure of the component makes the combination inoperable
- If the failure of a component causes another component to take over the operation of the failing component, the two components are considered to be operating in parallel
Serial availability
As shown in the figure above, two components X and Y are considered to be in series if one of them fails and the entire combination becomes unavailable. The entire composition is available only if both component X and component Y are available. It can be seen that the availability of A combination is the product of these two parts, and the formula is as follows: A = Ax Ay From the above equation, it can be seen that in A series system, the availability of the whole combination is always lower than that of the individual components. For the two tandem components X and Y above, the availability is as follows:
From the table above, we can see that even with highly available component Y, the composite system is still affected by component X, which is much lower, consistent with the “barrel principle”, which is affected by the shortest board.
Parallel availability
As shown above, the two components are considered parallel if both fail and the entire system fails. When any component is available, the entire system is available. The overall availability is 1-(both components are unavailable), and the formula is as follows: A = 1-(1-ax)2 As can be seen from the above, the overall availability of A system with two components in parallel is higher than that of any single component. Assuming two parts of component X, as shown above, the availability is as follows:
We see that even a low availability component X can be combined to produce a high availability system.
X 9
Had finished after the calculation of availability, I finally returned to the focus of this article, a measure of reliability – 9, X X 9 said in the process of the use of system 1 year time, the system can be normal use of the time and the ratio of the total time (1 year), we through the following calculation to experience the X 9 differences in different levels of reliability.
- Three nines :(1-99.9%) x 365 x 24=8.76 hours, indicating that the maximum possible service interruption time is 8.76 hours during the continuous operation of the system for one year.
- Four nines :(1-99.99%) x 365 x 24=0.876 hours =52.6 minutes, indicating that the maximum possible service interruption time is 52.6 minutes in a year of continuous operation of the system.
- There are five 9s :(1-99.999%) x 365 x 24 x 60=5.26 minutes, indicating that the maximum possible service interruption time is 5.26 minutes during the continuous operation of the system for one year.
So the X’s in the 9’s only represent the numbers 3 and 5, so why aren’t there 1’s, 2’s, or anything greater than 6? Let’s move on:
- One 9 :(1-90%) × 365=36.5 days
- Two nines :(1-99%) × 365=3.65 days
- Six nines :(1-99.9999%) × 365 × 24 × 60 × 60=31 seconds
It can be seen that one 9 and two 9 indicate that the service may be interrupted for 36.5 days and 3.65 days in a year respectively. This level of reliability may not be worthy of the word “reliability”. However, six nines mean that the service interruption time within a year is at most 31 seconds. Therefore, this level of reliability is not impossible, but to improve the reliability from “five nines” to “six nines”, the latter requires several times the cost of the former.
Availability of A | X 9 | Downtime (minutes) | Applicable to the product |
---|---|---|---|
0.999 | 3 and 9 | 500 | Computer or server |
0.9999 | 4 to 9 | 50 | Enterprise equipment |
0.99999 | 5 and 9 | 5 | General carrier-grade equipment |
0.999999 | 6 and 9 | 0.5 | Higher requirements for carrier-grade equipment |
How do we get more nines?
The number of nines varies from company to company, and many Internet companies require 99.99. Like some institutions website, service website, often failure service is not available, the highest estimate is 99.9. And we often use what we call four nines or five nines, 99.99% and 99.999%. While the difference between the two is 0.009%, it’s less than 0.01%. But for the system, it is precisely this difference of less than 0.01% that determines the system is completely out of the same class. We know that the reliability of a system is not entirely determined by the hardware, but by the software and hardware together. If it is a software problem, we need to monitor our own services and recover in time when the services are abnormal or down. Add redundancy to prevent problems. However, to improve the reliability of the system, in addition to software, there are hardware parts, including network, server and storage devices. The network can be accessed by multiple carriers. RAID and snapshot are stored to improve data security through backup. For servers, we can choose clustering to ensure high availability.