How to estimate the reliability of a distributed storage system?

Welcome to visit netease Cloud Community to learn more about Netease’s technical product operation experience.

In general, we use multi-replica technology to improve the reliability of storage systems, whether it is structured database storage (typical mysql), documentary Nosql database storage (mongodb), or regular BLOB storage systems (GFS, Hadoop). Because data can almost be regarded as the core of enterprise vitality, ensuring the reliability of data storage system is not a small matter for any enterprise.

Data Loss and copysets

“In a three-copy storage system consisting of 999 disks, what is the probability of data loss when three disks fail simultaneously?” This is closely related to the design of storage systems. Let’s first consider the case of two extreme designs.

Design 1: Combine 999 disks into 333 disk pairs.

In this design, data loss occurs only if one of the disk pairs is selected. In this design, the probability of data loss is 333/C(999,3) = 5.025095326058336*e-07.

Design 2: Data is randomly scattered to 999 disks.

In extreme cases, copies of logical data on a random disk were scattered across 998 disks in all clusters. In this design, the probability of data loss is C(999,3)/C(999,3)=1, that is, it must exist. In these two extreme cases, we can see that the probability of data loss is closely related to the degree of data fragmentation. For further reading, we introduce a new concept copyset. CopySet: a group of devices that contains all copies of a data. For example, a copy of data is written to disks 1,2,3. {1,2,3} is a replication group. In a 9-disk cluster, the minimum number of copysets is 3, copysets = {1,2,3}, {4,5,6}, {7,8,9}, that is, a copy of data can only be written to one of the replication groups. Data loss occurs only if {1,2,3}, {4,5,6} or {7,8,9} are both bad. That is, the minimum number of copysets is N/R. The maximum number of copysets in the system is C(N,R), where R is the number of copies and N is the number of disks. The number of copysets in the system will reach the maximum C(N,R) when the nodes are randomly selected to write replicas. That is, if you select any R disks, all three copies of some data will be on these R disks. Number of disks N and copies R In a storage system, number of copysets S, N/R < S < C(N, R)

Disk faults and storage system reliability estimation

1. Disk faults and Paisson distribution

Before the formal estimation of the correlation probability, we need to introduce a basic probability distribution: the Persson distribution. Building distribution mainly describe the probability of random events occur in a system, such as describing the car platform waiting guests for the probability of a value, a hospital within 1 hour N neonatal birth probability and so on, with the introduction of the poisson distribution to do more image can refer to the nguyen piece the poisson distribution and exponential distribution: 10 minutes course.

The above formula is the Poisson distribution. Where P represents probability, N represents some functional relationship, t represents time, N represents quantity, and λ represents the frequency of events.

For example, the probability of 10 failures of 1000 disks in a year is P (N(365) = 10) [note: t is in days]. λ is the number of 1000 disks that fail in one day. According to Google’s statistics, the annual failure rate is 8%, so λ = 10008%/365.

The above statistics are only the probability of N disks being damaged, so how can we use this formula to calculate the approximate value of data reliability (i.e. data loss probability) in distributed system?

2. Estimation of loss rate in distributed storage system

2.1 Failure rate within T time

To estimate the annual failure rate of a distributed storage system, assume that T is 1 year, the system is full of data, and failed disks are not processed. In this case, calculate the annual failure rate of data. Here we define some values: N: number of disks T: statistical time K: number of failed disks S: number of copysets (number of replication groups) in the system R: Number of backups How to calculate the probability of data loss within T (1 year)? From the perspective of probability and statistics, all possible data loss events within T (1 year) are taken into account. In a system with N redundant disks R, the event of data loss may occur within T, that is, the event of bad disks R is greater than or equal to R, that is, R, R+1, R+2… N (that is, all events in K∈[R,N]). When these random events occur, under what circumstances can data be lost? That’s right, if you hit a replication group. If K disks are damaged (if K disks are randomly selected), the probability of hitting a replication group is as follows: P = X/C(N,K), where X is the number of combinations hitting a replication group when K disks are randomly selected. The probability of data loss due to K disks being damaged is as follows: Pa(T,K) = P * p (N(T)=K) The probability of data loss within T time in the system is the sum of the probabilities of all possible data loss events. Pb(T) = sigma Pa(T,K); [R, K ∈ N]

2.2 Distributed system measures annual failure rate

Above, we assume that no recovery measures are taken for any hardware failure in a year, and then t can calculate the annual failure rate in this system state in a decade. However, in large-scale storage systems, the recovery program is usually started in the case of data loss, and the recovery is theoretically a random event from the initial state. After this factor is added, reliability calculation becomes more complicated. In theory, disk failure and recovery in large-scale storage system are extremely complex continuous events. Here, we simplify this probability model into discrete events within different unit time T for statistical calculation. As long as the probability of continuous events occurring between two T is very small and most of the bad disk conditions can be recovered within T time, then the next time T is to start from the new state again, the approximate correctness can be guaranteed by this estimation. The unit of T is defined as hours, so a year can be divided into 365*24/T periods, and the annual failure rate of the system can be interpreted as 100% minus the probability that no failure occurs in all units of T time.

Then the probability of data loss in the whole system is:

Pc = 1-Pb(T) times 365 times 24/T.

Netease Cloud object storage service

Netease Cloud Object Storage service NOS is a cloud Storage service with high performance, high availability, and high reliability. NOS supports standard RESTful apis and provides rich online data processing services to solve unstructured data management problems in the Internet era. Netease Cloud adopts a multiple backup mechanism to provide multiple backup guarantees for user files. If any server or hard disk fails, data will be recovered immediately to ensure data security. Welcome users to try and experience. Finally, if you would like to this article content (that is, the distributed storage system reliability estimation) for further study and exploration, can see the author’s another article: https://work-jlsun.github.io/2017/02/18/storage-durablity-2.html. Poisson and Exponential Distributions: 10 minutes tutorial on Probability Theory, binomial and Poisson Distributions Disk failures and Annual Failure rate estimation for storage systems

Netease Cloud provides container services for you. Please click on the free trial.

Related articles: [recommended] JVM lock implementation exploration 2: synchronized deep exploration [recommended] market research in the way of data analysis [recommended] [big data data warehouse] Kudu performance test report analysis