Skip to content

Latest commit

 

History

History

diskdata

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Introduction & Motivation

In the large-scale data centers, the number of hard disk drive(HDD) and solid-state drive (SSD) has reached millions. According to statistics, disk failures account for the largest proportion of all failures. The frequent occurrence of disk failures will affect the stability and reliability of the server and even the entire IT infrastructure, which have a negative impact on business SLAs (Service-Level Agreement). Thus, prediction of disk failures has been an important topic for IT or big data company.

However, the topic has several challenging data characteristics, such as high data noise, extremely imbalanced classification, and time-varying features. And, since stability overwhelms everything, the effectiveness and stability of prediction model is very crucial. Therefore, our team will publish the dataset of over 200 thousands hard disk drives in Alibaba Cloud’s data centers. We hope that more researchers can join us and study how to solve these problems together.

Description of datasets

The dataset has two files:

  1. Table 1: smartlog_data_*.csv is the daily SMART data of disks that has 514 columns. The columns are defined as follows:
Field Type Description
serial_number string disk serial number code
manufacturer string disk manufacturer code
model string disk model code
smart_n_normalized integer normalized SMART data of SMART ID=n
smart_nraw integer raw SMART data of SMART ID=n
dt string sampling time
  1. Table 2: fault_tag_data.csv is data of fault disk labels that has 5 columns. The columns are defined as follows:
Field Type Description
manufacturer string disk manufacturer code
model string disk model code
serial_number string disk serial number code
fault_time string fault time of disk
tag integer ranging in [0,6], IDs of fault subtype

The dataset ranges from 2017-07-31 to 2018-12-31.And, more details about S.M.A.R.T information can be seen in https://en.wikipedia.org/wiki/S.M.A.R.T.

Researchers can download data from https://tianchi.aliyun.com/dataset/dataDetail?dataId=70251 .

Note that we have utilized several common strategies to remove the sensitive information from our published dataset.

Evaluation Metrics

According to our purpose of failure prediction that predicting whether each disk will fail or not within the next 30 days, we redefine the precision, recall, and F-score metrics. The complete definition of metrics is as follows:

Precision for P-window. We define the precision as the fraction of actually failed disks being predicted overall (correctly and falsely) predicted failed disks. As our objective is to evaluate whether a failed disk being predicted is an actual failure within 30 days, we define the P-window as a fixed-size sliding window starting from the first time that a disk is predicted as a failure, and set the length of the P-window as 30 days. Let T denote the start date and T + k - 1 denote the end date of the testing period (k as 30 days in our competition). Note that the P-window may slide out of the testing period. Figure 1 illustrates how we count true positive and false-positive results. If the actual failure happens within the P-window (e.g., the 1st and 4th rows), we regard the failed disk as a correctly predicted one; otherwise (e.g., the 2nd and 3rd rows), we regard the disk as a falsely predicted one.

Recall for R-window. We next define the recall as the fraction of actual failed disks being predicted overall actual failed disks. We define the R-window as a fixed-size window (not sliding window) from the starting date to the end date of the testing period with the length of 30 days in our case (i.e., from T to T + k - 1, where k is 30 days). Figure 2 shows how we count false positive, false negative, and true positive results. If a failed disk being predicted is not failed within the R-window (the 1st and 2nd rows), we regard the disk as a falsely predicted one; otherwise, we regard the failed disk as a correctly predicted one (the 4th and 5th rows). If an actual failed disk within the R-window is not predicted, we regard the failed disk as a missed one (i.e., false negative in the 3rd row).

F1-score. We follow the classical definition of F1-score as

Future work

  1. Publish the SMART data of solid-state drive in Alibaba Cloud’s data centers;
  2. Introduce our team work about disk failure predictions;
  3. Share and List the paper work based on our published data;

Contribution

  1. Large-Scale Disk Failure Prediction(book). Cheng He, Mengling Feng, Patrick P. C. Lee, Pinghui Wang, Shujie Han, Yi Liu

    PAKDD 2020 Competition and Workshop, AI Ops 2020, February 7 – May 15, 2020, Revised Selected Papers

    Editors (view affiliations)

    https://link.springer.com/book/10.1007/978-981-15-7749-9

    @article{he2020large,
      title={Large-Scale Disk Failure Prediction},
      author={He, Cheng and Feng, Mengling and Lee, Patrick PC and Wang, Pinghui and Han, Shujie and Liu, Yi},
      year={2020},
      publisher={Springer}
    }