Sunday, February 18, 2007

Google Findings of Disk Failure Rates and Implications

Few months ago, I came to know that Google will be publishing a detailed report on disk drive failure rates in their environment. Prior to this report, best to my knowledge, there is very little information available from user perspective on this topic beyond limited work at Microsoft and Internet Archive.

Earlier, I thought of complimenting this report with a perspective from storage subsystem vendors. I was surprised to learn that either subsystem vendors don’t capture disk failure data effectively and in usable format or unwilling to share such data. And, there is very little published information on this topic from subsystem vendors.

In one case, a subsystem vendor recommended to contact disk drive manufacturers. Prior studies indicate that the actual drive replacement rate is 10 – 100x higher than failure rates published by disk drive manufacturers. Also, most likely disk drive manufacturers don’t have visibility in to deployment scenarios of failed drives for their data to be useful from the perspective of data centers and subsystem vendors. In the end, I decided to wait for Google report to come out to start this discussion.

Google Findings

I considered Google study to be unique because it looked at a very large sample of 100,000 disk drives. A summary of interesting results on age, manufacturers, read/write load and temperature from this study is listed below:
  1. The failure rate varied from 1.7% for drives in their first year of operation to over 8.6% observed in their third year of operation.

  2. Confirmation of the fact from prior studies that failure rates are highly correlated with drive models, manufacturers and vintages.

  3. A complex correlation between high utilization, i.e. read/write load and higher failure rate instead of strong direct relationship as widely assumed.
  4. First, only very young and very old age groups appear to show the expected behavior. After the first year, the AFR (annualized failure rate) of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives have slightly higher failure rates than high utilization ones.
  5. Surprising finding that lower temperatures are associated with higher failure rates and failures do not increase when the average temperature increases. The trend for higher failures with higher temperature is more pronounced for older drives.
  6. Overall our experiments can confirm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates.
Implications

In my opinion, the second finding reaffirms the subsystem vendors’ stance on not match-and-mix drives of different model and manufacturers.

The third finding has wider implications. First implication is on the higher possibility of failure of new drive during RAID rebuild even though this study explicitly didn’t measure very young age in hours after operation. Second implication is on extending time period for an old drive set to be kept intact and in operation after a new drive set has been phased in. Third implication is on the higher possibility of losing disks in long-term archive storage where read/write load may not be high enough and the challenges in keeping disk read/write load high enough.

1 comment: