Wednesday, March 08, 2006

Data loss risk during RAID rebuild

Have you ever lost data during RAID rebuild? Well, this week it happened again to us, actually my third time in just over a year when the second disk had uncorrectable errors/failure during RAID reconstruction. Couple of months ago, I mentioned same concern in my post Happy New Year & Food for Your Brain.
Will risk of data loss during RAID rebuild time become major concern with increase in disk capacity?
Are there any studies that looked at probability of second disk with uncorrectable errors during RAID reconstruction? If you know any studies or reliability model, send me a message through comments or via email.
How to find my email address? View my complete profile > My Web Page >Contact Us.
As the disk capacity is increasing, it is taking longer to rebuild the RAID group. And during this reconstruction time, there is no protection in place for stored data against total loss other than the last good backup. With typical RAID5 rebuild rate of 10 - 15GB/hr, reconstruction of a RAID group with high capacity disk, such as 500GB disk, can even be longer than the 24 hour backup rotation.

How vulnerable and aware organizations are to data loss during RAID rebuild? What are they doing to protect themselves against the second disk failure during RAID reconstruction?

Previously, I considered several alternative but still looking at ways to mitigate this risk elegantly.
  • RAID10 instead of RAID5 as default RAID group.
  • Dual parity RAID techniques.
  • Initiating snapshot and backup upon detection of first disk failure.

3 comments:

  1. For us, this risk is mitigated by mirroring the data on a second server (soon to be) located offsite. However, this particular issue was not a consideration in the past, perhaps it will be a primary motivator when the accounting department next objects to the next redundancy purchase. :-)

    ReplyDelete
  2. Drew,

    When are you buying the extra WAN bandwidth for offsite synchronous replication? IMO, any other offsite replication is not an alternative to RAID 5 rebuild risks.

    Anil

    ReplyDelete
  3. Best way out could be to use RAID-6 configuration (HP's Advance Data Guarding). This can handle multiple (generally 2 drive failures) per Array.

    -SAT

    ReplyDelete