Wednesday, March 22, 2006

Double Disk Failure - Follow up

Yesterday, Dave wrote about double disk failure in his blog (See Expect Double Disk Failures with ATA Drives). I agree with Dave's comment "As drives grow, I suspect that it (double protecting RAID) will become a requirement even for Fibre Channel drives." In my opinion, as the rebuild time goes beyond 12 hours, it will become requirement for all drives.

Couple of weeks ago, I also wrote about double disk failure event I encountered (See Data loss risk during RAID rebuild). And my concerns about increased exposure to data loss as RAID rebuild time is getting longer with higher disk capacity in RAID groups (See Happy New Year & Food for Your Brain). I prefer to have a Plan B if the RAID5 rebuild time is going to exceed 6 to 8 hours.

In last couple of weeks, my readers and colleagues referred me to several good articles on RAID protection. Here are two that I found really interesting and useful.

Daniel Feenberg, Things we wish we'd known about NAS devices and Linux Raid
… the majority of single drive failures are followed by a second drive failure before redundancy is established.

The fault lies with the Linux md driver, which stops rebuilding parity after a drive failure at the first point it encounters a uncorrectable read error on the remaining "good" drives.

A single unreadable sector isn't unusual among the tens of millions of sectors on a modern drive. If the sector has never been written to, there is no occasion for the drive electronics or the OS to even know it is bad. If the OS tried to write to it, the drive would automatically remap the sector and no damage would be done - not even a log entry. But that one bad sector will render the entire array unrecoverable no matter where on the disk it is if one other drive has already been failed.

We don't know what the reconstruction policy is for other raid controllers, drivers or NAS devices. None of the boxes we bought acknowledged this "gotcha" but none promised to avoid it either.
K.K. Rao, James L. Hafner and Richard A. Golding, Reliability for Networked Storage Nodes (pdf)
By using drive bandwidth more efficiently through the use of larger rebuild block sizes, we see significant improvements in reliability. In fact, the rebuild block size is a controllable parameter with the most significant impact on reliability.

4 comments:

  1. Very interesting! I had no clue the chances of double disk failure were this high. Thanks for opening my eyes!

    ReplyDelete
  2. What about RAID 6-1 which allows for double drive failure with mirroring? This seems like it might be a viable solution.

    ReplyDelete
  3. Allen,

    RAID6-1 may give you better protection from disk failure only but at what cost. We need to take in to consideration the failure of other components leading up to disk drive at this point too. At what point having a clustered node based on RAID5/6 becomes feasible?

    Anil

    ReplyDelete