Monday, February 26, 2007

Improving ROI of IT management

Most IT departments are having difficulties in addressing IT infrastructure monitoring and management requirements due to
  • Limited resources
  • Increasing complexity of infrastructure
  • Growing number and type of devices in an environment, and
  • Demands for better usage, performance and uptime reporting
While attempts are made by industry to address these challenges through unified management platform, few years ago, I was told by an IT management vendor that over half of such products sit on the customer shelf due to integration complexity, customization requirements and wide variations in capabilities of various device modules provided by device vendors.

Incorporating device configuration and control functions to unified management applications is considered the major hurdle in adoption of such applications. But, IT administrators typically prefer to work with feature-rich device-specific configuration and control applications instead of using a generic unified management product. Most device vendors also provide limited functionality beyond monitoring and basic reporting to unified management platforms limiting value delivered through them. Rightly so, it also plays in to lock-in strategy and defense against out-of-sight out-of-mind perception for device vendors.

As more and more devices being installed in customer environment with email capabilities, real-time monitoring and auto-support from vendors, these management functionalities are becoming quite burdensome for the IT administrators, if not potential security risks. Just think aboutseveral dozen devices with frequent out-bound emails, monitoring pings and SNMP alerts in a data center and the management headache these auto-support functions can create.

So, how can the IT management vendors increase the value delivered to the customers with monitoring, reporting and analysis tools and without device configuration and control abilities as well as without creating another layer of management headache? As mentioned in my last post Understanding the Web 2.0 Trends, this was a topic of my conversation with Scot French, VP of Marketing at Klir Technologies, a local startup, for last few weeks.

I particularly liked the Klir solution to IT monitoring, reporting and analysis platform market with scalability of Software-as-a-Service (SaaS) and collective intelligence of “2.0”. I believe their approach brings the ease of use, perpetual upgrades, contextual content and community approach to IT management that is not available from enterprise IT management software packages.

Your opinions are welcome on IT monitoring, reporting and analysis market, SaaS approach and leveraging “2.0” to enhance ROI of IT management tools.

Sunday, February 25, 2007

Understanding the Web 2.0 Trends

Last week was quite active for me with lot of good discussions and brainstorming sessions in addition to spending an evening at Google office.
  • Challenges of delivering news through search
  • Continued obsolescence of current backup and recovery strategies
  • Improving ROI from infrastructure and storage monitoring initiatives
  • Web 2.0 and Enterprise 2.0 extending beyond applications to infrastructure
Note: Your thoughts are welcome on these topics.

The underlying idea loosely connecting all activities appears to be the “2.0” and its influence. And, these observations lead me back to reviewing What is Web 2.0, a 2005 article by Tim O’Reilly on understanding the Web 2.0 trends.
… the core competencies of Web 2.0 companies:
  • Services, not packaged software, with cost-effective scalability
  • Control over unique, hard-to-recreate data sources that get richer as more people use them
  • Trusting users as co-developers
  • Harnessing collective intelligence
  • Leveraging the long tail through customer self-service
  • Software above the level of a single device
  • Lightweight user interfaces, development models, and business models
Operations must become a core competency. So fundamental is the shift from software as artifact to software as service that the software will cease to perform unless it is maintained on a daily basis. It’s no accident that Google’s system administration, networking, and load balancing techniques are perhaps even more closely guarded secrets that their search algorithms. Google’s success at automating these processes is a key part of their cost advantage over competitors.

Lightweight business models are a natural concomitant of lightweight programming and lightweight connections. When commodity components are abundant, you can create value simply by assembling them in novel or effective ways.

Tuesday, February 20, 2007

How are you being pitched data de-duplication?

Recently, I had interesting discussion with an IT professional about data de-duplication. It started with a very simple query on in-band and out-of-band data de-duplication that became a full-blown discussion and worth sharing on this blog.

Initially, I was taken back with the terms “in-band” and “out-of-band” being used to segment data de-duplication. I haven’t heard these terms since the days of storage virtualization. These terms were marketecture (nicer way to say marketing FUD) used by vendors trying to trip each other by pigeon-holing different solutions.

These terms leverage the long-held implicit connotation with in-band that anything inserted in the data path is not good for customer environment. And, true to the essence, this IT professional also assumed the same without really trying to understand various data de-duplication methods and their pros-and-cons. If in-band was so universally unacceptable, network switches would never have been introduced between server and storage. They are in data path, aren’t they?

Mute Debate: In-band vs. out-of-band

With storage virtualization, in-band referred to in data path and out-of-band referred to outside data path but typically in control path. With data de-duplication the boundaries between in-band and out-of-band are fuzzy unlike storage virtualization. In data de-duplication, all methods touch the data, the only difference is where and when this touch happens.

My suggestion was next time someone tells you that their solution is better because it is out-of-band, ask them how? And don’t accept an answer that doesn’t go deeper than “because it doesn’t sit in the data path.”

What are your pain points?

My next suggestion was not to lose sight of the pain points, you are trying to solve. If you are trying to decide how you want to go from A to B, your decision should start with mode of transportation not whether a car comes with V8 engine.

What are your pain points? Prioritize! There is no silver bullet.
  • Remote office / remote clients

  • Backup time window

  • Backup size footprint

  • Offsite backup

Monday, February 19, 2007

SMART not so smart in predicting disk drive failure

Continuing from last blog post, Google report [PDF] also shares their analysis based on disk self-monitoring data and identifies important failure related SMART parameters.
  1. The drives with scan errors are 10x more likely to fail that the drives with no scan errors. 30% of the drives fail within the 8 months after first scan error. The failure probability is higher within first month of first scan error occurring in newer drives and then plateaus. With older drives, failure probability rises with time.

  2. The drives with reallocation count fail 3 – 6x more often than those with none. 15% of the drives fail within the 8 months after the first reallocation.

  3. There is no definite correlation between failure rates and seek errors.

  4. CRC errors are less indicative of drive failures than that of cables and connectors.

  5. There is no significant correlation between failures and high power cycle counts for drives less than two years old. For drives 3 years and older, higher power cycle counts can increase the absolute failure rate by over 2%.


Predictive models based on scan errors, reallocation count, offline reallocation count and probational count couldn’t predict more than half of the failed drives.
We conclude that it is unlikely that SMART data alone can be effectively used to build models that predict failures of individual drives. SMART parameters still appear to be useful in reasoning about the aggregate reliability of large disk populations, which is still very important for logistics and supply-chain planning.
Glossary of Terms

Scan Errors – Large scan error counts can be indicative of surface defects, and therefore indicative of surface defects.

Reallocation Count – When the drive logic believes that a sector is damaged it can remap the faulty sector number to a new physical sector drawn from a pool of spares. Reallocation count reflects the number of times this has happened, and is seen as an indication of drive surface wear.

Offline Reallocation – Offline reallocation are defined as subset of the reallocation counts in which only reallocated sectors found during background scrubbing are counted. In other words, it should exclude sectors that are reallocated as a result of errors found during actual I/O operations.

Probational Count – Disk drives put suspect bad sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems. Probational counts can be seen as a softer error indication.

Seek Errors – Seek errors occur when a disk drive fails to properly track a sector and needs to wait for another revolution to read or write from or to a sector.

CRC Errors – CRC errors are detected during data transmission between the physical media and the interface.

Sunday, February 18, 2007

Google Findings of Disk Failure Rates and Implications

Few months ago, I came to know that Google will be publishing a detailed report on disk drive failure rates in their environment. Prior to this report, best to my knowledge, there is very little information available from user perspective on this topic beyond limited work at Microsoft and Internet Archive.

Earlier, I thought of complimenting this report with a perspective from storage subsystem vendors. I was surprised to learn that either subsystem vendors don’t capture disk failure data effectively and in usable format or unwilling to share such data. And, there is very little published information on this topic from subsystem vendors.

In one case, a subsystem vendor recommended to contact disk drive manufacturers. Prior studies indicate that the actual drive replacement rate is 10 – 100x higher than failure rates published by disk drive manufacturers. Also, most likely disk drive manufacturers don’t have visibility in to deployment scenarios of failed drives for their data to be useful from the perspective of data centers and subsystem vendors. In the end, I decided to wait for Google report to come out to start this discussion.

Google Findings

I considered Google study to be unique because it looked at a very large sample of 100,000 disk drives. A summary of interesting results on age, manufacturers, read/write load and temperature from this study is listed below:
  1. The failure rate varied from 1.7% for drives in their first year of operation to over 8.6% observed in their third year of operation.

  2. Confirmation of the fact from prior studies that failure rates are highly correlated with drive models, manufacturers and vintages.

  3. A complex correlation between high utilization, i.e. read/write load and higher failure rate instead of strong direct relationship as widely assumed.
  4. First, only very young and very old age groups appear to show the expected behavior. After the first year, the AFR (annualized failure rate) of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives have slightly higher failure rates than high utilization ones.
  5. Surprising finding that lower temperatures are associated with higher failure rates and failures do not increase when the average temperature increases. The trend for higher failures with higher temperature is more pronounced for older drives.
  6. Overall our experiments can confirm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates.
Implications

In my opinion, the second finding reaffirms the subsystem vendors’ stance on not match-and-mix drives of different model and manufacturers.

The third finding has wider implications. First implication is on the higher possibility of failure of new drive during RAID rebuild even though this study explicitly didn’t measure very young age in hours after operation. Second implication is on extending time period for an old drive set to be kept intact and in operation after a new drive set has been phased in. Third implication is on the higher possibility of losing disks in long-term archive storage where read/write load may not be high enough and the challenges in keeping disk read/write load high enough.

Saturday, February 17, 2007

P2P powered Devices … coming soon?

Last night, I came across PBS's Robert Cringely post Appearances Can Be Deceiving: What's that 40-gig hard drive doing inside my Apple TV? on P2P technology incorporated in future generation of Apple TV.
If the Apple TV is plugged in it is turned on. Did you notice that? That means the hard drive will have at least the capability of running 24/7. Now envision a BitTorrent-like file distribution system that is controlled primarily by iTunes, rather than by you or me. A centrally controlled P2P system is VERY powerful because it allows for the pre-positioning of content.
His discussion of P2P powered devices is very interesting to me as few months ago, I presented similar thoughts in a series of posts on GridNetworks and what they can do to stand out in overcrowded P2P online video streaming market.
The second method, more likely to work for GridNetworks, is to pre-install or embed the player in to as many devices as possible, preferably the type of devices that are almost always on, almost always connected and publicly available to participate in content distribution without compromising owner-user experience.

They need to focus on embedding their player in to any device that has storage capacity and a network port. This may be the differentiation GN needs to stand-out in overcrowded P2P based online video streaming market.
Cringely's post is echoing same thoughts as I mentioned before about the success factors for GridNetworks, a startup in P2P space. Actually, he also gave a nice prop to GridNetworks in his post.
There are products like this already in operation, such as GridNetworks from Seattle or Mike Homer's Kontiki network, now part of VeriSign. It isn't rocket science, but to succeed, networks of this sort need lots of nodes, especially nodes that remain on 24/7.
Steve Rubel finds the idea interesting but doesn't think that Apple will have P2P based IPTV because of its focus on consumer products.

He may be looking at IPTV with similar business model as Cable TV. Instead, Apple IPTV most likely will be similar to combination of Apple iTunes and Apple TV with delivery between the two managed by someone like Akamai or GridNetworks.

Some may consider P2P for consumer applications only. In my opinion, P2P and Grid have wider applications in enterprise with open source projects like Cleversafe for dispersed storage.

Also see,

Challenges of High Quality Video Delivery
Success Factors for GridNetworks
Success Factors for GridNetworks … contd
Success Factors for GridNetworks … closure

Monday, February 12, 2007

Free Backup School, Seattle March 8, 2007

Checkout a free Backup School in Seattle, March 8, 2007 organized by Storage Decisions and Storage Magazine. I received this notification through Puget Sound SNUG Mailing List.
For more details and the complete agenda, call Brian Digeronimo 508-621-5532 or register online at: http://searchStorage.com/r/0,,63031,00.htm?

"Backup School Hits the Road"

Where: Seattle, Washington
When: Thursday, March 8, 2007 (8:00 am - 4:00 pm)
Cost: FREE!
Who: This is a free seminar for IT professionals who are interested in improving backup efficiency and making restores more reliable.

Produced by: Storage Decisions and Storage Magazine

Your registration includes: Free breakfast and lunch, plus a chance to win one of two American Express gift certificates!

TOPICS INCLUDED:
• The top 10 ways people misconfigure their backup systems
• How to avoid those mistakes
• The pros and cons of the different disk-based backup targets
• An overview of CDP, data reduction backup and replication-based backup systems

Sunday, February 11, 2007

Isilon Journey as New Entrant

Last Friday, I had the opportunity to listen to Sujal Patel, founder of Isilon during Northwest Entrepreneur Network (NWEN) breakfast at Bellevue Harbor Club. He talked about the trials and tribulations last six years, he went through in bringing a new idea to mature market with Isilon.

It was an interesting presentation and hopefully, NWEN will post slides, notes or audio of his presentation online as they do with other breakfast meetings. Before discussing the four key factors of Isilon success, he talked about his background and Isilon timeline. This was on amazing story on its own.
Isilon Timeline
2001 - Series A funding $8.5 million
2003 - First customer, Kodak
2004 - NBC and Sports Illustrated using Isilon product for Athens Olympic
2005 - 100th customer
2006 - Initial Public Offering (IPO)
Establishing a storage startup during the height of dotcom crash. Isilon received Series A funding in 2001. He didn't talk about the journey from inception of idea to Series A funding, most probably another interesting talk on its own. But it was clear from his mention of 150+ meetings needed to raise Series A that it was not an easy path from idea to Series A.

Founders with no background in storage industry. He did have background in digital media delivery infrastructure. He started out solving the problem with digital media delivery instead of creating a storage solution.

Surviving when dozens of storage startups started by seasoned storage industry professionals failed. I think there were some inaccuracies in his list of storage startups RIP but the message was quite clear. Industry experience is not a corollary to a disruptive technology, solution or successful startup. It maybe an hindrance than tool to see a real customer problem and produce a solution, customers want.

His clustered storage product focused on solving a very small problem in the larger realm of storage ecosystem but an important problem to a niche digital media segment. Contrast that with technologies like iSCSI and CDP, the solutions still searching for a problem to solve, important enough to someone to pay for it.

Isilon, a case study in lessons of Clayton Christensen's The Innovator's Dilemma and Geoffrey Moore's Crossing the Chasm

More in future posts. Update: John Cook posted a nice summary of Sujal's talk. As comments on his blog reflect, I also thought Sujal presentation was excellent, well-prepared and with right content for the right audience.

BTW, I asked him the last question about organizational changes taking place at Isilon as it is growing. The follow up question, I didn't get to ask would have been, how these changes will impact future innovations at Isilon.

One day, I would like to meet him as well as his product manager, Sam for whom he had very nice things to say.

Tuesday, February 06, 2007

Sights of Tokyo


View of Peace Bell from cab on our way to hotel from Tokyo station.


View from the hotel in the morning.


View from Tokyo TV Tower


View from Tokyo TV Tower, looking down through plexiglass floor tile.

Just a bout of writer's block. :-(

Thursday, February 01, 2007

Data De-duplication for Primary Storage

Last night, while reading Steve Duplessie's long travelogue, I noticed a very interesting note referring to data de-duplication for primary storage.
We landed on time, and a car was nicely waiting to pick me up to bring me to my first meeting, with Data Domain … I chatted with a bunch of their smart folks theorizing about where other implementations of this technology could really affect change in the world, and found quite a few. What if you could get the performance attributes required by a high percentage of today's applications on a primary store that happened to get 40 to 1 compression rates? Imagine the economic advantages and the consolidation potential.
Again tonight, I came across another note referring to data de-duplication for primary storage in Jon Toigo's praise the PR agency post.
What de-duplication doesn’t address is the primary storage issue. The device used for primary storage is not inexpensive and has limited capacity. When you run out of space, you run out of space. You can manually delete and/or move to tape, but this is somewhat time intensive. Or, you can always make a storage vendor’s day by buying more and more hardware. What I believe will be a hot and extremely important technology in 2007 is data compression. The business case for data compression on primary storage is the same as the one used for de-duplication for backup. Compression can cut primary server data to a minimum of one-third its original size. It makes good business sense: by compressing data on primary storage devices you need less hardware less resources to manage, lower power consumption (and power consumption is a big deal), etc., all without a performance hit.
Hmm … it may be just my imagination but seems like Data Domain may be planning to enter primary storage market with a data de-duplication product and JPR may be doing guerrilla PR for them.

Anyway, I am glad to see that I wasn't the only one theorizing the extension of data de-duplication beyond backup to other storage sub-segments, last year. See Where are you being de-duped?.
Some of the near-term applications are going to be in the area of backup, archive, wide area data transfer, data caching, primary storage and enterprise data storage grids. In the end, data de-duplication can be applied anywhere where cost of resources freed by eliminating repeating patterns exceed the cost of resources required to remove repeating patterns.
see image