Monday, April 26, 2010

Why does CORE fail? Part 1

Recently, David Vellante at Wikibon wrote in his blog post Dedupe Rates Matter ... Just Not as Much as You Think about his Capacity Optimization Ratio Effectiveness (CORE) value for ranking dedupe/compression/capacity optimization solutions. He also applied CORE to few dedupe solutions for primary storage.

As I commented on his blog, right away I noticed that CORE formula had an important parameter missing - time to uncompress/reconstitute (hereafter referred as time to uncompress) deduped data. It is an important parameter that impacts the rate of reading data from dedupe solution by applications/users. As time to uncompress need to be happen inline for both inline and post-processing solutions, logically there will be no major discrepancy in using time to uncompress and reading data from a dedupe solution interchangeably.

Is time to compress/dedupe also proxy to rate of data written to dedupe solution?

Another important parameter is rate of writing data to a dedupe solution as applications/users have certain expectations on how quickly data must be written to a storage system. David includes time to compress (tc) in his CORE calculation, what I assume, as a proxy to rate of data written to dedupe solution. I may be wrong as I didn't see an explicit statement about why time to compress/dedupe is important.

In my opinion, he incorrectly assumes the impact of time to compress/dedupe (hereafter referred as time to compress) to be same across various dedupe solutions whether inline or post processing solutions. The time to compress impacts the rate of writing data, more so, for a dedupe solution that uses inline processing. There is no impact on rate of writing data for post-processing solutions. So, to have apple-to-apple comparison, David need to either use the rate of writing data across all solutions or include time to compress data as penalty for inline solution due to slowing down the rate of writing data.

The low time to write data is a requirement of applications/users which inline solutions meet by reducing the time to compress as much as possible (possibly at the expense of lower dedupe ratio). Post processing solutions meet the same requirement by delaying the compression/deduplication for later (possibly at the expense of additional capacity required for storing pre-deduped data).

Including time to compress data in CORE calculations without discrimination inaccurately biases the CORE toward inline solutions. Just because a solution have sub-ms time to compress in-band doesn't mean it should be rewarded over a solution with few ms time to compress out-of-band.

Assuming that time to compress in inline mode and post processing mode are equivalent, in CORE calculation, is flat out incorrect.

Why is Time to Compress being used as Time to compress the smallest unit compressed in the solution (e.g. file or multiple files or block)?

Is a dedupe solution that compresses 16KB block size in 0.001ms better than a solution that compresses 64KB block size in 0.003ms? The CORE fails right here.

For all other factors being equal, a solution that claims 0.001ms for compressing 16KB (smallest unit for the first solution) will produce higher CORE value than a solution that claims 0.003ms for compressing 64KB (smallest unit for the second solution). As specified currently, the time to compress, in turn CORE, doesn't take into consideration the variation in different unit size used by different solution. Is the CORE formula assuming that compressing/deduping in smaller units better than in larger units?

The smallest unit compressed varies across solutions by a wide range, even >1000x factor. The time to compress should be the amount of time it takes to compress a specified storage capacity and should be normalized across all solutions for CORE to be of any value. Comparing time to compress 16KB units versus 64KB units is like comparing oranges-to-apples. For 1MB data, in first case 64 units will need to be compressed (0.064ms) versus 16 units in later case (0.048ms). CORE using time to compress/dedupe without taking into consideration the unit size will penalize the second solution incorrectly.

In next post, I will further look in to CORE and take apart CORE formula ...


  1. Since we have started a number of these conversations around CORE with the phrase ‘completely flawed’ which is where I believe is the right terminology to use here as well. First, I am not so sure that time to ‘uncompress’ (which should really be decompress as it is the verb, but I digress) is a valid parameter IF all solutions are being compared identically, so let’s assume you are in a 100% write environment. (I realize that this would hardly ever be the case but we can assume for a moment that with this assumption all things are still equal.)
    Now let’s discuss two components. First I think we can all agree that decompression or rehydration is faster than optimization (compression, deduplication). Next we need to look at the I/O rate. Then, if you want to look at the performance, you need to assume an I/O load (take a standard 80:20 read:write scenario) and then take the performance of time to ‘compress’ (I prefer optimize) and then cut the time in half and call this time to rehydrate. Now apply the formula. I would assume that the new CORE value would come out very close as they are now.
    Also, without understanding how the solution works it is very difficult to debate the merits of the value of performance on that solution. For example, traditional optimization technologies when they do a rehydrate, they need to rehydrate the entire file. This generates a lot of I/O. Storwize is random access technology that when you need to decompress data, for modification for example, you only need to read the segment of the data for modification. Additionally, since the solution sits before the storage and maintain the envelop of the file (permissions, owner, ACLs etc…) the storage cache is the same factor as your compression ratio. A 5:1 compression ratio gives you a 5x increase in storage cache. Additionally the Storwize has read cache (no write cache) so in some instances you’re not even reading from the storage array. I have yet to hear another vendor speak about their solution as having no performance degradation or even increasing performance. Isn’t this good for the customer?
    You also mention that time to compress should be really zero in a post-process solution. This really hits the main thing I keep suggesting that everyone is overlooking is that CORE is a measurement of EFEFCTIVEENESS. This means what is the best overall solution for the customer. As with EVERY IT answer – IT DEPENDS! But that said, customers often ask for ‘best practices’. What CORE allows you to do is provide end users with characteristics to think about when building a primary storage optimization solution. How can you possibly say that a post process solution that has users:
    1) Buy full storage capacity (vs. less capacity with an inline solution)
    2) Having to change the application in order to read optimized data
    3) Having to have different downstream processes for things such as backup (especially with deduplicated solutions)
    is a good solution? Step out of the vendor shoes for a moment and put yourself in the shoes of the customer. Which would you want?
    Again, I would very must like to get a bunch of industry folks together to sponsor a survey to ask end users the requirements for primary storage optimization. Then have the Wikibon team get the actual blog post into the wiki so we as a community can update the formula, as appropriate, and help end users develop smart questions for building their primary storage optimization solution.

  2. Hello Anil. Thanks for taking the time to weigh in on CORE and share your perspectives.

    We agree with your point about time to optimize and de-optimize (e.g. read back). CORE doesn't currently account for that and needs to. In fact in my blog comments I ran a scenario with an 80% read:write ratio and the CORE of, for example ASIS dramatically improves.

    With regard to discriminating time to compress across solutions CORE absolutely does that. Your statment saying that we don't is untrue. Further, our tc is a proxy for ELAPSED TIME which is the key point here. To say post process solutions have no elapsed time impact because they batch the job in off hours would be incorrect.

    However, to your point...There are clearly use cases where elapsed time is not important and CORE as currently constituted unfairly penalizes such solutions. That is wrong. So CORE needs to evolve to be able to account for these situations.

    CORE was first published on Wikibon in February and we're getting some good feedback. So please keep it coming and when you pick apart the calculations try to help evolve the calculations so that they are more widely applicable to all use cases. Having a common metric that accurately reflects business value in a variety of use cases would be beneficial to users.

  3. Steve and Dave,

    Thank you both for detailed responses. My intention is not to disregard CORE all-together but to show its limitations and how it could possibly be improved upon. I am under the weather today. I will try to respond in detail to your comments soon.