Monday, February 19, 2007

SMART not so smart in predicting disk drive failure

Continuing from last blog post, Google report [PDF] also shares their analysis based on disk self-monitoring data and identifies important failure related SMART parameters.
  1. The drives with scan errors are 10x more likely to fail that the drives with no scan errors. 30% of the drives fail within the 8 months after first scan error. The failure probability is higher within first month of first scan error occurring in newer drives and then plateaus. With older drives, failure probability rises with time.

  2. The drives with reallocation count fail 3 – 6x more often than those with none. 15% of the drives fail within the 8 months after the first reallocation.

  3. There is no definite correlation between failure rates and seek errors.

  4. CRC errors are less indicative of drive failures than that of cables and connectors.

  5. There is no significant correlation between failures and high power cycle counts for drives less than two years old. For drives 3 years and older, higher power cycle counts can increase the absolute failure rate by over 2%.


Predictive models based on scan errors, reallocation count, offline reallocation count and probational count couldn’t predict more than half of the failed drives.
We conclude that it is unlikely that SMART data alone can be effectively used to build models that predict failures of individual drives. SMART parameters still appear to be useful in reasoning about the aggregate reliability of large disk populations, which is still very important for logistics and supply-chain planning.
Glossary of Terms

Scan Errors – Large scan error counts can be indicative of surface defects, and therefore indicative of surface defects.

Reallocation Count – When the drive logic believes that a sector is damaged it can remap the faulty sector number to a new physical sector drawn from a pool of spares. Reallocation count reflects the number of times this has happened, and is seen as an indication of drive surface wear.

Offline Reallocation – Offline reallocation are defined as subset of the reallocation counts in which only reallocated sectors found during background scrubbing are counted. In other words, it should exclude sectors that are reallocated as a result of errors found during actual I/O operations.

Probational Count – Disk drives put suspect bad sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems. Probational counts can be seen as a softer error indication.

Seek Errors – Seek errors occur when a disk drive fails to properly track a sector and needs to wait for another revolution to read or write from or to a sector.

CRC Errors – CRC errors are detected during data transmission between the physical media and the interface.

3 comments:

  1. Nice article, once tried the smartmontools too and wrote a small blob entry about it. I revived it now, and linked your site for people to find out more. Have a look jstorage.com. Looks like we have some common interests on storage tech, however, we are still beginners :)

    ReplyDelete
  2. Mikko,

    Thanks for info on your article. Here is the correct link to your article SMART to predict disk failure. For some reason, link in your post isn't working.

    Well, I am always a beginner. ;-)

    Anil

    ReplyDelete
  3. Do you know which smartmontools field is what the Google report refers to as, "scan error?"

    When I run smartctl on my pc, I see the fields for reallocation counts (Reallocated_Sector_Ct and/or Reallocated_Event_Count). Maybe I'm not getting it, but which fields equal the "big 4" that the Google report refers to? What field should I be watching for "scan errors?"

    I'd sure appreciate hearing your thoughts.

    ReplyDelete