Monday, May 03, 2010

Why does CORE fail? Part 2

... Continuation of my previous post on CORE deficiencies and how it could be improved upon.

What is CORE?

Let's look at originally defined CORE equation.
CORE = (S x R x V)/(C x tc)

where,

S = The capacity being reduced in TB. Dave in his post fixes the S value at 100TB to compare all solutions.

R = The percent reduction achieved. Dave shows the R value in decimal for different solutions, we can assume though R is described as percent reduction, decimal R is used in calculating CORE.

V = The value of capacity being saved. Though, Dave doesn't list the V values used for different solutions, it is not difficult to reverse-calculate this value using other parameters listed in his table.

C = The cost of solution doing the reducing.

tc = The elapsed time to compress the capacity. As covered in my last post, I consider this parameter to be stated incorrectly, incorporated inappropriately and irrelevant to the CORE. In place, a better parameter would have been the elapsed time to write.
Three things stand out in this CORE equation:
  1. CORE equation assumes first-order relationship with its variables. It may seem that for a specified value of S, the high CORE score can be achieved by achieving high data reduction (R) and the value of capacity being saved (V) and reducing the cost of solution (C) and the time to compress (tc).

  2. CORE equation has variables (S, R, V) in numerator that are normalized for solutions without data reduction but no such adjustment is made for variables (C, tc) in denominator.

  3. CORE equation is composed of dependent variables instead of independent variables.
Isn't V dependent on S and R?

V is defined as cost per TB (Ct) times amount of data reduced (Sr), according to the description of the math for CORE. Amount of data reduced (Sr) is the capacity being reduced (S) times percent reduction achieved (R).
V = Sr x Ct = S x R x Ct
Substituting V in original CORE equation,
CORE = (S^2) x (R^2) x Ct / (C x tc)
To a large extent, this modified CORE equation is composed of more independent variables than original one. Obviously, it is no longer a first order relationship with S and R.

What is interesting with CORE equation is that amount of data reduced has been included twice, once as amount of data reduced and then again as part of cost of amount of data reduced.

What is the CORE value for a solution with no data reduction technology?

For,
S = 100 TB,
R = 0% as there is no data reduction,
V = $0 as there is no capacity being saved,
C = 0 as there is no data reduction technology in play so there is no cost of data reduction solution, and
tc = 0 ms as there is no compression of data taking place,

CORE = (S x R x V) / (C x tc) = (100 x 0 x 0) / (0 x 0) = 0/0

CORE = 0/0 (indeterminate) this expression has no meaning.
You may agree that a relevant CORE value for a solution with no data reduction technology should be 0 or 1. It also makes sense in calculating value of a data reduction solution to have a solution with no data reduction as baseline.

How can we avoid division by zero?

1. Replace tc with tw or (tw + tc)

An equation that takes in to account time to write (tw) instead of or in addition to time to compress (tc) could help avoid division by zero when there is no compression/deduplication being used as even baseline solution with no data reduction will have a non-zero time to write. Either tw or (tw + tc) will be a better choice in place of tc in original CORE equation.

2. Redefine tc and tw

Of course, as originally defined in Dave's post, tc is time to compress the smallest unit compressed in the solution (e.g. file or multiple files or blocks) which ignores the variation in tc due to variation in the size of smallest unit across various solution. I recommend changing the definition of tw and tc, respectively, to time to write and to compress S amount or certain % of S, the value of S should remain same across all solutions. This will remove the parameter dependency on smallest unit compressed and normalize parameter across same amount of S.

3. Redefine C

As originally defined, C is the cost of data reduction solution. As Dave' post indicate NetApp doesn’t charge for ASIS – we took a percentage of the array’s cost, we can safely assume that C is only the cost of data reduction part of the solution, and not the whole solution. In this scenario C = 0 for a solution with no data reduction, thus making CORE value indeterminate again.

An equation that takes in to account the total cost of solution, i.e. cost of solution with no data reduction plus the cost of data reduction solution will help avoid division by zero. Of course, for a data reduction solution that uses existing storage, the total cost of solution will be net present value (NPV) of existing storage plus the cost of data reduction solution. Even better, subtract cost of capacity saved (V) from this cost instead of using V in numerator will result in Net cost of solution.

A better CORE equation, may be?
CORE = (S x R) / (C x tw x tr)

where,

S, R and V are same as originally defined.

C = Net Cost of Solution = Cost of data reduction solution + Cost of capacity used after reduction

Cost of capacity used after reduction = S (1 - R) x Ct = (S x Ct) - (S x R x Ct) = (S x Ct) - V

tw = time to write a pre-defined storage capacity or fraction of S

tr = time to read a pre-defined storage capacity or fraction of S
Of course, some may object to not including read/write ratio, there is no reason why read/write ratio shouldn't be included.

In the end, a CORE equation that is function of Storage Capacity (S), Percent data reduction (R), Net Cost of Solution (C), Read/Write ratio, Time to write (tw), and Time to read (tr) will be more valuable than the originally defined CORE equation. Of course, a lot more work is required to determine the interdependency of these variables.

Thursday, April 29, 2010

Why does CORE fail? Part 1 - Response

Steve Kenniston of Storwize made detailed comment in response to my last post Why does CORE fail? Part 1. I thought my response to his comment deserved a separate blog post. Frankly, I haven't kept up with developments at Storwize since May 2007 when I last wrote a series of blog posts on Storewiz so I don't claim any knowledge of current Storwize solution.
First, I am not so sure that time to 'uncompress' ... is a valid parameter IF all solutions are being compared identically,....
The time to decompress/reconstitution is as much important, if not more, than time to compress/dedupe. The compression/deduplication can be managed 'internally' to keep up with write expectations of applications and users whether through delaying writes just enough to allow data reduction in-band or through data reduction after writes complete or some hybrid approach. But, the read expectations must be met in-band so any decompression/reconstitution need to take place correctly and completely in the expected time. A solution that requires lower time to decompress should be rewarded in same fashion as a solution with lower time to compress being rewarded in CORE.
... First I think we can all agree that decompression or rehydration is faster than optimization (compression, deduplication). ... the performance of time to 'compress' (I prefer optimize) and then cut the time in half and call this time to rehydrate. Now apply the formula. I would assume that the new CORE value would come out very close as they are now.
I am not so sure of time to decompress/reconstitute being faster than time to compress/dedupe or being 50% of time to compress/dedupe as I haven't heard of a solution or seen data yet that supports such claim. Actually, the relationship may be reverse specially for solutions with large amount of compressed/deduped data and high data reduction ratio. Only related published data, I am aware of, is that of read speed being direct function of the smallest unit used for decompression/reconstitution - larger the unit size, higher the read speed.

As I questioned in my last post, are time to decompress and compress proxy for time to read and write from data reduction solution? If it is the case, CORE could be improved upon by including actual time to read and write (instead of time to decompress or compress) or including time to decompress/compress as penalty over normal read/write with a solution that has no data reduction technology - in essence, additional cost in the form of lower read/write performance in exchange for higher storage efficiency.
Also, without understanding how the solution works it is very difficult to debate the merits of the value of performance on that solution. ...
If CORE stays with the parameters that can be judged externally for a solution, it will be more relevant and valuable than trying to incorporate parameters internal to a solution like time to compress (tc). A CORE based on externally measured parameters like reduction ratio, read and write performance, and cost of solution over a range of storage capacity and time may produce a better value indicator. Any attempt to include internal mechanisms weakens the CORE due to lack of complete information and understanding of every solution and rapid changes in technology and techniques incorporated in such solutions.
How can you possibly say that a post process solution that has users: 1) Buy full storage capacity (vs. less capacity with an inline solution) ...... is a good solution? ...
Please read my post again. I never claim any one solution is better than other. CORE includes cost of solution as a parameter which supposedly should penalize the solution that includes more storage than required by other solutions.
Step out of the vendor shoes for a moment and put yourself in the shoes of the customer. Which would you want?
As a customer, I want a solution that will provide additional storage efficiency at reasonable cost while meeting my expectations for read and write performance, safeguards my data and doesn't require additional management overhead. Anything beyond that is vendor coloring the customer expectations to fit it's solution.

Monday, April 26, 2010

Why does CORE fail? Part 1

Recently, David Vellante at Wikibon wrote in his blog post Dedupe Rates Matter ... Just Not as Much as You Think about his Capacity Optimization Ratio Effectiveness (CORE) value for ranking dedupe/compression/capacity optimization solutions. He also applied CORE to few dedupe solutions for primary storage.

As I commented on his blog, right away I noticed that CORE formula had an important parameter missing - time to uncompress/reconstitute (hereafter referred as time to uncompress) deduped data. It is an important parameter that impacts the rate of reading data from dedupe solution by applications/users. As time to uncompress need to be happen inline for both inline and post-processing solutions, logically there will be no major discrepancy in using time to uncompress and reading data from a dedupe solution interchangeably.

Is time to compress/dedupe also proxy to rate of data written to dedupe solution?

Another important parameter is rate of writing data to a dedupe solution as applications/users have certain expectations on how quickly data must be written to a storage system. David includes time to compress (tc) in his CORE calculation, what I assume, as a proxy to rate of data written to dedupe solution. I may be wrong as I didn't see an explicit statement about why time to compress/dedupe is important.

In my opinion, he incorrectly assumes the impact of time to compress/dedupe (hereafter referred as time to compress) to be same across various dedupe solutions whether inline or post processing solutions. The time to compress impacts the rate of writing data, more so, for a dedupe solution that uses inline processing. There is no impact on rate of writing data for post-processing solutions. So, to have apple-to-apple comparison, David need to either use the rate of writing data across all solutions or include time to compress data as penalty for inline solution due to slowing down the rate of writing data.

The low time to write data is a requirement of applications/users which inline solutions meet by reducing the time to compress as much as possible (possibly at the expense of lower dedupe ratio). Post processing solutions meet the same requirement by delaying the compression/deduplication for later (possibly at the expense of additional capacity required for storing pre-deduped data).

Including time to compress data in CORE calculations without discrimination inaccurately biases the CORE toward inline solutions. Just because a solution have sub-ms time to compress in-band doesn't mean it should be rewarded over a solution with few ms time to compress out-of-band.

Assuming that time to compress in inline mode and post processing mode are equivalent, in CORE calculation, is flat out incorrect.

Why is Time to Compress being used as Time to compress the smallest unit compressed in the solution (e.g. file or multiple files or block)?

Is a dedupe solution that compresses 16KB block size in 0.001ms better than a solution that compresses 64KB block size in 0.003ms? The CORE fails right here.

For all other factors being equal, a solution that claims 0.001ms for compressing 16KB (smallest unit for the first solution) will produce higher CORE value than a solution that claims 0.003ms for compressing 64KB (smallest unit for the second solution). As specified currently, the time to compress, in turn CORE, doesn't take into consideration the variation in different unit size used by different solution. Is the CORE formula assuming that compressing/deduping in smaller units better than in larger units?

The smallest unit compressed varies across solutions by a wide range, even >1000x factor. The time to compress should be the amount of time it takes to compress a specified storage capacity and should be normalized across all solutions for CORE to be of any value. Comparing time to compress 16KB units versus 64KB units is like comparing oranges-to-apples. For 1MB data, in first case 64 units will need to be compressed (0.064ms) versus 16 units in later case (0.048ms). CORE using time to compress/dedupe without taking into consideration the unit size will penalize the second solution incorrectly.

In next post, I will further look in to CORE and take apart CORE formula ...