As I commented on his blog, right away I noticed that CORE formula had an important parameter missing - time to uncompress/reconstitute (hereafter referred as time to uncompress) deduped data. It is an important parameter that impacts the rate of reading data from dedupe solution by applications/users. As time to uncompress need to be happen inline for both inline and post-processing solutions, logically there will be no major discrepancy in using time to uncompress and reading data from a dedupe solution interchangeably.
Is time to compress/dedupe also proxy to rate of data written to dedupe solution?
Another important parameter is rate of writing data to a dedupe solution as applications/users have certain expectations on how quickly data must be written to a storage system. David includes time to compress (tc) in his CORE calculation, what I assume, as a proxy to rate of data written to dedupe solution. I may be wrong as I didn't see an explicit statement about why time to compress/dedupe is important.
In my opinion, he incorrectly assumes the impact of time to compress/dedupe (hereafter referred as time to compress) to be same across various dedupe solutions whether inline or post processing solutions. The time to compress impacts the rate of writing data, more so, for a dedupe solution that uses inline processing. There is no impact on rate of writing data for post-processing solutions. So, to have apple-to-apple comparison, David need to either use the rate of writing data across all solutions or include time to compress data as penalty for inline solution due to slowing down the rate of writing data.
The low time to write data is a requirement of applications/users which inline solutions meet by reducing the time to compress as much as possible (possibly at the expense of lower dedupe ratio). Post processing solutions meet the same requirement by delaying the compression/deduplication for later (possibly at the expense of additional capacity required for storing pre-deduped data).
Including time to compress data in CORE calculations without discrimination inaccurately biases the CORE toward inline solutions. Just because a solution have sub-ms time to compress in-band doesn't mean it should be rewarded over a solution with few ms time to compress out-of-band.
Assuming that time to compress in inline mode and post processing mode are equivalent, in CORE calculation, is flat out incorrect.
Why is Time to Compress being used as Time to compress the smallest unit compressed in the solution (e.g. file or multiple files or block)?
Is a dedupe solution that compresses 16KB block size in 0.001ms better than a solution that compresses 64KB block size in 0.003ms? The CORE fails right here.
For all other factors being equal, a solution that claims 0.001ms for compressing 16KB (smallest unit for the first solution) will produce higher CORE value than a solution that claims 0.003ms for compressing 64KB (smallest unit for the second solution). As specified currently, the time to compress, in turn CORE, doesn't take into consideration the variation in different unit size used by different solution. Is the CORE formula assuming that compressing/deduping in smaller units better than in larger units?
The smallest unit compressed varies across solutions by a wide range, even >1000x factor. The time to compress should be the amount of time it takes to compress a specified storage capacity and should be normalized across all solutions for CORE to be of any value. Comparing time to compress 16KB units versus 64KB units is like comparing oranges-to-apples. For 1MB data, in first case 64 units will need to be compressed (0.064ms) versus 16 units in later case (0.048ms). CORE using time to compress/dedupe without taking into consideration the unit size will penalize the second solution incorrectly.
In next post, I will further look in to CORE and take apart CORE formula ...