Hard disk fault prediction based on random forest algorithm

Abstract: The industry expects to use machine learning technology to build disk fault prediction models to more accurately detect disk faults in advance, reduce o&M costs, and improve service experience. This case will use random forest algorithm to train a hard disk fault prediction model.

This article is shared by Huawei Cloud community “Hard disk Fault Prediction Based on Random Forest Algorithm”, originally written by shanhai Zhi Guang.

The experimental goal

1. Master the basic process of using machine learning methods to train models;

2. Master the basic methods of using pandas to do data analysis;

3. Master sciKit-learn methods for constructing, training, saving, loading, predicting, statistical accuracy index and viewing confusion matrix of random forest model;

Case Content Introduction

With the development of the Internet and cloud computing, the demand for data storage is increasing day by day. Large-scale data storage centers are essential infrastructure. Although new storage media, such as SSDS, offer better performance than disks in many respects, their high cost still makes it unaffordable for most data centers. Therefore, large data centers still use traditional mechanical hard disks as storage media.

The life cycle of a mechanical disk is usually three to five years. After two to three years, the failure rate increases significantly, resulting in a sharp increase in disk replacement. According to statistics, hard disk faults account for 48% or more of server hardware faults and are an important factor affecting server operating reliability. As early as the 1990s, people realized that data was more valuable than hard drives themselves, and there was a desire for technology that could predict hard drive failures and provide relatively safe data protection, so S.M.A.R.T. Technology was born.

Self-monitoring Analysis and ReportingTechnology (S.M.A.R.T.) is an automatic hard disk status detection and warning system and specification. Through in the hard disk hardware testing instruction for disk hardware such as head, platters, motor, circuit monitor the running status, record and set by default when compared with the manufacturers, if the situation will be monitoring or is beyond the preset TVC safety range, can automatically by the host monitor hardware or software to the user to make warning and slightly automatic repair, To ensure disk data security in advance. Most hard drives are now equipped with the technology, with the exception of some that come out very early. For more information about the technology, see S.M.A.R.T.- Baidu Encyclopedia.

Although disk manufacturers use S.M.A.R.T. to monitor the health status of disks, most of them use the fault prediction method based on design rules. However, the prediction effect is poor and cannot meet the increasingly strict requirements for predicting disk faults in advance. Therefore, the industry expects to use machine learning technology to build disk fault prediction models to more accurately detect disk faults in advance, reduce o&M costs, and improve service experience.

This case study will train and test a hard drive failure prediction model using an open source S.M.A.R.T. dataset and random forest algorithms in machine learning.

For a theoretical explanation of the random forest algorithm, see this video.

Matters needing attention

1. If you are using JupyterLab for the first time, please check the “ModelAtrs JupyterLab Instruction” to understand how to use;

2. If you encounter errors in the process of using JupyterLab, please refer to “ModelAtrs JupyterLab Common Problems Solutions” to try to solve the problem.

The experimental steps

1. Data set introduction

The data set used in this case is an open source data set from Backblaze inc., a computer backup and cloud storage service provider. Every year since 2013, Backbreze has publicly released the S.M.A.R.T. of hard drives used in their data centers. Log data effectively promotes the development of hard disk fault prediction using machine learning technology

Due to the large amount of S.M.A.R.T. log data published by Backblaze Company, this case is to quickly demonstrate the process of using machine learning to build a hard disk failure prediction model. Only the data published by Backblaze company in 2020 is used. Relevant data has been prepared and placed in OBS.

Note: The code to download the data in this step needs to run on Huawei Cloud ModelArts Codelab

import os import moxing as mox if not os.path.exists('./dataset_2020.zip'): mox.file.copy('obs://modelarts-labs-bj4/course/ai_in_action/2021/machine_learning/hard_drive_disk_fail_prediction/datase t_2020.zip', './dataset_2020.zip') os.system('unzip dataset_2020.zip') if not os.path.exists('./dataset_2020'): Raise Exception(' Error! Data does not exist! ')! Ls -lh./dataset_2020 INFO:root:Using MoXing- v1.17.3-info :root:Using obs-python-sdK-3.20.7 Total 102M - WR-r --r-- 1 ma-user ma-group 51M Mar 21 11:56 2020-12-08.csv -rw-r--r-- 1 ma-user ma-group 51M Mar 21 11:56 2020-12-09.csv -rw-r--r-- 1 MA-user ma-group 1.2m Mar 21 11:55 dataset_2020. CSV -rw-r--r-- 1 ma-user ma-group 3.5K Mar 22 15:59 prepare_data.pyCopy the code

Data interpretation:

2020-12-08.csv: S.M.A.R.T. 2020-12-08 extracted from the 2020Q4 dataset published by Backblaze. Log data

2020-12-09.csv: S.M.A.R.T. 2020-12-09 extracted from the 2020Q4 dataset published by Backblaze. Log data

Dataset_2020.csv: the S.M.A.R.T that has been processed for all of 2020. Log data, as explained in section 2.6 Category Equilibrium analysis below, how to obtain this data

Prepare_data. py: Run the script and the S.M.A.R.T. for all of 2020 will be downloaded. The log data is processed to get dataset_2020.csv. Running the script requires 20 GB of local storage

2. Data analysis

Before building any model using machine learning, it is necessary to analyze the data set to understand the size of the data set, attribute names, attribute values, various statistical indicators and null values. Because we need to know the data before we can use the data well.

2.1 Reading a CSV File

Pandas is a common Python data analysis module used to load CSV files in a dataset. Taking 2020-12-08.csv as an example, we first load the file to analyze S.M.A.R.T. Log data

import pandas as pd
df_data = pd.read_csv("./dataset_2020/2020-12-08.csv")
type(df_data)
pandas.core.frame.DataFrame
Copy the code

2.2 Viewing the size of a SINGLE CSV file

Print (' Size of CSV file data, rows: %d, columns: %d' % (df_data.shape[0], df_data.shape[1])) Size of CSV file data, rows: 162008, columns: 149Copy the code

2.3 View the first five Rows

To view the first 5 rows of the table, you can call the head() function of the DataFrame object

df_data.head()
Copy the code

5 rows × 149 columns

The first 5 rows of the table are shown above. The header of the table is the attribute name, and below the attribute name is the attribute value. Backblaze explains the meaning of the attribute value, translated as follows:

2.4 Viewing Statistical Indicators of the Data

After viewing the first five rows of the table, we call the describe() function of the DataFrame object to calculate statistical metrics for the table data

df_data.describe()
Copy the code

8 rows × 146 columns

The describe() function defaults to statistical analysis of numeric columns. Since the first three columns of the table ‘date’, ‘serial_number’, and ‘model ‘are strings, there are no statistical indicators for these columns.

STD: standard deviation of the value of the column MIN: minimum value of the value of the column 25%: 25% median value of the value of the column 50%: 50% median value of the value of the column 75%: 75% median value of the column value Max: the maximum value of the column value

2.5 Checking the Null Value of data

As you can see from the output above, the count indicator for some attributes is small. For example, the count for Smart_2_RAW is much smaller than the total number of rows in dF_train. Therefore, we need to take a further look at the null value for each column

df_data.isnull().sum() date 0 serial_number 0 model 0 capacity_bytes 0 failure 0 smart_1_normalized 179 smart_1_raw 179 smart_2_normalized 103169 smart_2_raw 103169 smart_3_normalized 1261 smart_3_raw 1261 smart_4_normalized 1261 smart_4_raw 1261 smart_5_normalized 1221 smart_5_raw 1221 smart_7_normalized 1261 smart_7_raw 1261 smart_8_normalized 103169 smart_8_raw 103169 smart_9_normalized 179 smart_9_raw 179 smart_10_normalized 1261 smart_10_raw 1261 smart_11_normalized 161290 smart_11_raw 161290 smart_12_normalized 179 smart_12_raw 179 smart_13_normalized 161968 smart_13_raw 161968 smart_15_normalized 162008 ... smart_232_normalized 160966 smart_232_raw 160966 smart_233_normalized 160926 smart_233_raw 160926 smart_234_normalized 162008 smart_234_raw 162008 smart_235_normalized 160964 smart_235_raw 160964 smart_240_normalized 38968 smart_240_raw 38968 smart_241_normalized 56030 smart_241_raw 56030 smart_242_normalized 56032 smart_242_raw 56032 smart_245_normalized  161968 smart_245_raw 161968 smart_247_normalized 162006 smart_247_raw 162006 smart_248_normalized 162006 smart_248_raw 162006 smart_250_normalized 162008 smart_250_raw 162008 smart_251_normalized 162008 smart_251_raw 162008 smart_252_normalized 162008 smart_252_raw 162008 smart_254_normalized 161725 smart_254_raw 161725 smart_255_normalized 162008 smart_255_raw 162008 Length: 149, dtype: int64Copy the code

This display is not easy to view, so it is more intuitive to graph the number of available null values

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df_data_null_num = df_data.isnull().sum()
x = list(range(len(df_data_null_num)))
y = df_data_null_num.values
plt.plot(x, y)
plt.show()
Copy the code

As you can see from the above results, some of the attributes in the table have a large number of null values.

In the field of machine learning, it is a common phenomenon that null values exist in data sets. There are many reasons for null values. For example, there are many attributes in a user portrait, but not all users have corresponding attribute values. Or some data due to transmission timeout, resulting in no collection, may also appear null value.

2.6 Category equilibrium analysis

The task to be implemented is disk fault prediction, which is to predict whether a disk will be normal or damaged at a certain time. This problem is a fault prediction or anomaly detection problem. This problem has the following characteristics: A large number of normal samples and a small number of faulty samples differ greatly in the number of the two types of samples.

For example, if you run the following code, you can see that the number of normal disk samples in DF_Data is more than 160,000, but the number of faulty disk samples is only 8.

valid = df_data[df_data['failure'] == 0]
failed = df_data[df_data['failure'] == 1]
print("valid hdds:",len(valid))
print("failed hdds:",len(failed))
valid hdds: 162000
failed hdds: 8
Copy the code

Because most of the learning process of machine learning methods are based on the statistical idea for learning, if directly with the above categories imbalanced training data, so the ability of the model may be more obvious preference category sample, category of less samples would be “swamped” off, don’t play a role in the process of learning, So we need to balance different categories of data.

For more breakdown sample data, we can take a look at backblaze’s full year 2020 S.M.A.R.T. All failure samples are selected from the log data, and the same number of healthy samples are randomly selected, which can be achieved by the following code.

This code has been commented out and requires 20GB of local storage to run. You also don’t need to run this code, because dataset_2020.zip has been downloaded at the beginning of this example, and dataset_2020.csv is already available in the zip package. The CSV is the resulting file from running this code

# if not os.path.exists('./dataset_2020/dataset_2020.csv'): # os.system('python./dataset_2020/prepare_data.py') import GC del df_data # Delete dF_data object gC.collect () # Since the following code loads log data into the DF_DATA object, to avoid the risk of memory overflow, you can reclaim the memory manually here, because JupyterLab does not automatically reclaim the memory in the environment during operationCopy the code

2.7 Loading a Data set for Category Balancing

Dataset_2020.csv is the S.M.A.R.T. disk that has been class-balanced. Log data. Now let’s load this file to confirm category balancing

df_data = pd.read_csv("./dataset_2020/dataset_2020.csv")
valid = df_data[df_data['failure'] == 0]
failed = df_data[df_data['failure'] == 1]
print("valid hdds:", len(valid))
print("failed hdds:", len(failed))
valid hdds: 1497
failed hdds: 1497
Copy the code

It can be seen that the normal sample and the fault sample are 1497

3. Feature engineering

With a usable training set in place, it’s time to do feature engineering, which in layman’s terms is choosing which properties in a table to use to build a machine learning model. The quality of artificial design features largely determines the quality of machine learning model effects. Therefore, researchers in the field of machine learning need to spend a lot of energy on artificial design features, which is a time-consuming and labor-intensive project and requires expert experience.

3.1 Studies on SMART properties and Disk Faults

(1) BackBlaze analyzed the correlation between HDD faults and SMART attributes, and found that SMART 5, 187, 188, 197 and 198 had the highest correlation with HDD faults. These SMART attributes were also related to scan errors, reallocation counts and trial counts [1].

(2) El-Shimi et al. found that in addition to the above five features, SMART 9, 193, 194, 241 and 242 had the maximum weight [2].

(3) Pitakrat et al. evaluated 21 machine learning algorithms for predicting hard disk failures, and found that among the 21 machine learning algorithms tested, the random forest algorithm had the largest area under the ROC curve, while KNN classifier had the highest F1 value [3].

(4) Hughes et al. also studied machine learning methods for predicting disk faults. They analyzed the performance of SVM and Naive Bayes, and SVM achieved the highest performance, with a detection rate of 50.6% and a false positive rate of 0% [4].

[1] Klein,Andy. “What SMART Hard Disk Errors Actually Tell Us.” Backblaze Blog Cloud Storage & Cloud Backup,6 Oct.2016, www.backblaze.com/blog/what-s…

[2] El-Shimi, “Vault-linux Storageand File Systems conference. vault-Linux Storageand File Systems Conference, 22 Mar. 2017, Cambridge.

[3] Pitakrat, Teerat, Andre Van Hoorn, And LarsGrunske. “A comparison of machine learning algorithmsfor proactive hard disk drive failure Proceedings of the 4th International ACM Sigsoft Symposium on Architectingcritical Systems. ACM, 2013.

[4] Hughes, Gordon F., et al. “Improved disk-drivefailure warnings.” IEEE Transactions on Reliability51.3 (2002):350-357.

The above are some research results of predecessors. This case plans to adopt the random forest model, so we can select SMART 5, 9, 187, 188, 193, 194,197, 198, 241, 242 as features according to the research results in article 2 above. Their meanings are as follows:

SMART 5: count of remapping sectors SMART 9: total power-on time SMART 187: uncorrectable errors SMART 188: count of command timeout SMART 193: count of magnetic head loading/unloading SMART 194: temperature SMART 197: Number of sectors to be mapped SMART 198: errors reported to the operating system that cannot be corrected by the hardware ECC SMART 241: total number of logical block addressing mode writes SMART 242: total number of logical block addressing mode reads

In addition, different models of hard disks from different vendors may have different standards for recording SMART log data. Therefore, it is best to select the data of the same model as training data to train a model to predict whether the model of hard disks fails. If you need to predict the failure of multiple hard disks of different models, you may need to train multiple models.

3.2 Hard disk Model Selection

Execute the following code to see how much data each model of hard drive has

df_data.model.value_counts()
ST12000NM0007                         664
ST4000DM000                           491
ST8000NM0055                          320
ST12000NM0008                         293
TOSHIBA MG07ACA14TA                   212
ST8000DM002                           195
HGST HMS5C4040BLE640                  193
HGST HUH721212ALN604                  153
TOSHIBA MQ01ABF050                     99
ST12000NM001G                          53
HGST HMS5C4040ALE640                   50
ST500LM012 HN                          40
TOSHIBA MQ01ABF050M                    35
HGST HUH721212ALE600                   34
ST10000NM0086                          29
ST14000NM001G                          23
HGST HUH721212ALE604                   21
ST500LM030                             15
HGST HUH728080ALE600                   14
Seagate BarraCuda SSD ZA250CM10002     12
WDC WD5000LPVX                         11
WDC WUH721414ALE6L4                    10
ST6000DX000                             9
TOSHIBA MD04ABA400V                     3
ST8000DM004                             2
ST18000NM000J                           2
Seagate SSD                             2
ST4000DM005                             2
ST8000DM005                             1
ST16000NM001G                           1
DELLBOSS VD                             1
TOSHIBA HDWF180                         1
HGST HDS5C4040ALE630                    1
HGST HUS726040ALE610                    1
WDC WD5000LPCX                          1
Name: model, dtype: int64
Copy the code

It can be seen that the disk model ST12000NM0007 has the largest amount of data, so we filter out the data of the disk model

df_data_model = df_data[df_data['model'] == 'ST12000NM0007']
Copy the code

3.3 Feature Selection

Select the 10 attributes mentioned above as features

features_specified = [] features = [5, 9, 187, 188, 193, 194, 197, 198, 241, 242] for feature in features: features_specified += ["smart_{0}_raw".format(feature)] X_data = df_data_model[features_specified] Y_data = df_data_model['failure'] X_data.isnull().sum() smart_5_raw 1 smart_9_raw 1 smart_187_raw 1 smart_188_raw 1 smart_193_raw  1 smart_194_raw 1 smart_197_raw 1 smart_198_raw 1 smart_241_raw 1 smart_242_raw 1 dtype: int64Copy the code

Empty values exist, so empty values are populated first

X_data = X_data.fillna(0) 
print("valid hdds:", len(Y_data) - np.sum(Y_data.values))
print("failed hdds:", np.sum(Y_data.values))
valid hdds: 325
failed hdds: 339
Copy the code

3.4 Divide training set and test set

Sklearn’s train_test_split can be used to divide the training set and test set. Test_size indicates the ratio of test set, usually 0.3, 0.2, or 0.1

from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data test_size = 0.2, random_state = 0)Copy the code

4. Start training

4.1 Model Building

Once you have the training set and test set ready, you can start to build the model by simply calling the RandomForestClassifier in the machine learning framework SKLearn

from sklearn.ensemble import RandomForestClassifier 

rfc = RandomForestClassifier()
Copy the code

Super parameters have a lot of random forest algorithm, take different parameter values to build model will receive different training effect, for starters, you can directly use the default parameter values provided by the library, in the understanding of the principle of random forest algorithm has certain, can try to modify the parameters of the model to adjust the training effect of the model.

4.2 Data fitting

The process of model training, that is, the process of fitting training data, is also very simple to implement. The training can be started by calling fit function

rfc.fit(X_train, Y_train)/home/ma - user/anaconda3 / envs/XGBoost - Sklearn/lib/python3.6 / site - packages/Sklearn/ensemble/forest. The py: 248: FutureWarning: N_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in ", FutureWarning) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, Max_features ='auto', max_leaf_nodes=None, min_impurity_Decrease =0.0, min_impurity_split=None, min_samples_leaf=1, Min_samples_split =2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)Copy the code

5 Start forecasting

Call the predict function to begin the prediction

Y_pred = rfc.predict(X_test) 
Copy the code

5.1 Statistical prediction accuracy rate

In machine learning, there are four commonly used performance indexes for classification problems: accuracy, precision, recall and F1-score. The closer the four indexes are to 1, the better the effect is. Sklearn library has these four indicators of the function, can be directly called.

For the theoretical explanation of the four indicators, please refer to this article

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Model used is: Random Forest classifier") 
  
acc = accuracy_score(Y_test, Y_pred) 
print("The accuracy is {}".format(acc)) 
  
prec = precision_score(Y_test, Y_pred) 
print("The precision is {}".format(prec)) 
  
rec = recall_score(Y_test, Y_pred) 
print("The recall is {}".format(rec)) 
  
f1 = f1_score(Y_test, Y_pred) 
print("The F1-Score is {}".format(f1)) 
Model used is: Random Forest classifier
The accuracy is 0.849624060150376
The precision is 0.9122807017543859
The recall is 0.7761194029850746
The F1-Score is 0.8387096774193548
Copy the code

Every time we train the random forest model, different test accuracy indexes of the model will be obtained. This is because the training process of the random forest algorithm has certain randomness, which is a normal phenomenon. However, the prediction results of the same model and sample are constant.

5.2 Model saving, loading and re-prediction

Model to save

import pickle
with open('hdd_failure_pred.pkl', 'wb') as fw:
    pickle.dump(rfc, fw)
Copy the code

The model is loaded

with open('hdd_failure_pred.pkl', 'rb') as fr:
    new_rfc = pickle.load(fr)
Copy the code

Model reprediction

new_Y_pred = new_rfc.predict(X_test) new_prec = precision_score(Y_test, New_Y_pred print("The precision is {}". Format (new_prec)) The precision is 0.9122807017543859Copy the code

5.3 Viewing the Confusion Matrix

To analyze the effect of the classification model, confusion matrix can also be used. The horizontal axis of the confusion matrix represents the categories of predicted results, the vertical axis represents the categories of real labels, and the values in the matrix square represent the number of test samples overlapped by the corresponding horizontal and vertical coordinates.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix 

LABELS = ['Healthy', 'Failed'] 
conf_matrix = confusion_matrix(Y_test, Y_pred) 
plt.figure(figsize =(6, 6)) 
sns.heatmap(conf_matrix, xticklabels = LABELS,  
            yticklabels = LABELS, annot = True, fmt ="d"); 
plt.title("Confusion matrix") 
plt.ylabel('True class') 
plt.xlabel('Predicted class') 
plt.show() 
Copy the code

6. Ideas for improving the model

The above content is a demonstration of the process of using random forest algorithm to build a hard disk fault prediction model. The model accuracy is not very high. There are several ideas to improve the model accuracy:

(1) This case only uses the data of Backblaze company in 2020, you can try to use more training data; (2) This case only uses 10 SMART attributes as features, you can try to use other methods to build features; (3) This case uses the random forest algorithm to train the model, you can try to use other machine learning algorithms;

Click huawei Cloud ModelArts Codelab to directly run the code of this case

Click to follow, the first time to learn about Huawei cloud fresh technology ~