Wednesday, March 22, 2006

Double Disk Failure - Follow up

Yesterday, Dave wrote about double disk failure in his blog (See Expect Double Disk Failures with ATA Drives). I agree with Dave's comment "As drives grow, I suspect that it (double protecting RAID) will become a requirement even for Fibre Channel drives." In my opinion, as the rebuild time goes beyond 12 hours, it will become requirement for all drives.

Couple of weeks ago, I also wrote about double disk failure event I encountered (See Data loss risk during RAID rebuild). And my concerns about increased exposure to data loss as RAID rebuild time is getting longer with higher disk capacity in RAID groups (See Happy New Year & Food for Your Brain). I prefer to have a Plan B if the RAID5 rebuild time is going to exceed 6 to 8 hours.

In last couple of weeks, my readers and colleagues referred me to several good articles on RAID protection. Here are two that I found really interesting and useful.

Daniel Feenberg, Things we wish we'd known about NAS devices and Linux Raid
… the majority of single drive failures are followed by a second drive failure before redundancy is established.

The fault lies with the Linux md driver, which stops rebuilding parity after a drive failure at the first point it encounters a uncorrectable read error on the remaining "good" drives.

A single unreadable sector isn't unusual among the tens of millions of sectors on a modern drive. If the sector has never been written to, there is no occasion for the drive electronics or the OS to even know it is bad. If the OS tried to write to it, the drive would automatically remap the sector and no damage would be done - not even a log entry. But that one bad sector will render the entire array unrecoverable no matter where on the disk it is if one other drive has already been failed.

We don't know what the reconstruction policy is for other raid controllers, drivers or NAS devices. None of the boxes we bought acknowledged this "gotcha" but none promised to avoid it either.
K.K. Rao, James L. Hafner and Richard A. Golding, Reliability for Networked Storage Nodes (pdf)
By using drive bandwidth more efficiently through the use of larger rebuild block sizes, we see significant improvements in reliability. In fact, the rebuild block size is a controllable parameter with the most significant impact on reliability.

Monday, March 20, 2006

No right or wrong answers to EQ tests

Do you score high on EQ (Entrepreneurial Quotient) tests like the one
posted by Guy Kawasaki? The high score doesn't necessarily mean that you will make a better entrepreneur. But, you may have a future as a writer who may become successful writing business books.

I took Kawasaki's test but with a twist. I answered eleven questions based on what popular entrepreneurship culture preaches and eleven questions based on my actual experiences. I got three times as many wrong answers when I answered based on my experiences instead of what entrepreneurship books preach.

I used to be a typical engineering guy and had no background in business. So I read a lot of books on entrepreneurship, joined associations and attended events. And only thing I can say that the preachings of popular entrepreneurship culture will not make you successful. The key to success is flexibility and considering all available alternatives.

In my opinion, there is no right or wrong answer to EQ tests, all answers are valid alternatives until one works for you.

Tuesday, March 14, 2006

Only if this could be an expense-paid trip to Cancun!

Today, I received this message for a Symantec event in Cancun, Mexico.
Hi Anil,

Whoever said you can't mix business with pleasure must not have heard of Symantec's new disaster recovery strategies seminar in Cancun, Mexico. The seminar will held May 14th – 19th at the Playacar Palace near Cancun, Mexico and will cover various topics including Business Continuity Plan vs. Disaster Recovery Plan, Replication vs. Backup, Data Protection Strategies, and Storage Virtualization.

This could go a long way towards easing the burden of IT professionals.:) Instead of sitting behind a cubicle listening to a Webinar or being herded like cattle into a stuffy conference room, why not get out of the office, make a trip south of the border and get the best of both worlds--training from the best on strategic planning for disaster recovery and then the reward for your hard work and efforts in IT.

Designed for IT managers, IT staff, and self-employed IT professionals, the seminar will present material with a focus on concepts and solutions rather than product-specific features.

If you or the readers of your blog are interested in more information let me know, or visit


On one end, it reminds me of free heavenly trips to Hawaii, doctors get for listening to recent development in medical fields, paid for by pharmaceutical companies! On the other end, it reminds me of free hellish week at timeshare properties in return for listening to sales pitch of timeshare company.

Now, only if we could get someone to foot the bill for this trip without acting like timeshare company and treat IT professionals like doctors?

Wednesday, March 08, 2006

Data loss risk during RAID rebuild

Have you ever lost data during RAID rebuild? Well, this week it happened again to us, actually my third time in just over a year when the second disk had uncorrectable errors/failure during RAID reconstruction. Couple of months ago, I mentioned same concern in my post Happy New Year & Food for Your Brain.
Will risk of data loss during RAID rebuild time become major concern with increase in disk capacity?
Are there any studies that looked at probability of second disk with uncorrectable errors during RAID reconstruction? If you know any studies or reliability model, send me a message through comments or via email.
How to find my email address? View my complete profile > My Web Page >Contact Us.
As the disk capacity is increasing, it is taking longer to rebuild the RAID group. And during this reconstruction time, there is no protection in place for stored data against total loss other than the last good backup. With typical RAID5 rebuild rate of 10 - 15GB/hr, reconstruction of a RAID group with high capacity disk, such as 500GB disk, can even be longer than the 24 hour backup rotation.

How vulnerable and aware organizations are to data loss during RAID rebuild? What are they doing to protect themselves against the second disk failure during RAID reconstruction?

Previously, I considered several alternative but still looking at ways to mitigate this risk elegantly.
  • RAID10 instead of RAID5 as default RAID group.
  • Dual parity RAID techniques.
  • Initiating snapshot and backup upon detection of first disk failure.

Tuesday, March 07, 2006

Who will be next?

Lot of M&A activity in last month. I was particularly excited about Microsoft taking over WinTarget product from String Bean. I really liked their approach for iSCSI target product. I hope Microsoft decides to incorporate iSCSI target functionality into their server product line. It will be a great tool to enable SMBs to implement high availability with ease and confidence.

String Bean Software was the only product vendor, we ever signed a product reseller agreement with. And this relationship finally ended today as we received the reseller agreement termination letter from Mickey McIntire, CEO of String Bean Software.

It looks like storage companies are adopting GYM approach of buy early and cheap. Who will be next to get acquired? I was going to bet on Data Domain. But I am not sure there are many storage players who will be willing to pay $250+ million, it deserves.

M&A headlines from B&S:

03/07/2006 CA Completes Wily Buy

03/06/2006 Brocade Bags NuView

03/06/2006 EMC Acquires Authentica

03/06/2006 Atempo Swallows Storactive

03/03/2006 Microsoft Munches String Bean

03/01/2006 McData & Riverbed: A Rumored Pair

02/27/2006 Hitachi Picks Archiving Partner

02/23/2006 Sun Shines on Aduva Grid

02/16/2006 Qlogic Bets on InfiniBand

02/07/2006 StoneFly Officially Lands in DNF

02/07/2006 HP Hops on OuterBay