Monday, May 03, 2010

Why does CORE fail? Part 2

... Continuation of my previous post on CORE deficiencies and how it could be improved upon.

What is CORE?

Let's look at originally defined CORE equation.
CORE = (S x R x V)/(C x tc)

where,

S = The capacity being reduced in TB. Dave in his post fixes the S value at 100TB to compare all solutions.

R = The percent reduction achieved. Dave shows the R value in decimal for different solutions, we can assume though R is described as percent reduction, decimal R is used in calculating CORE.

V = The value of capacity being saved. Though, Dave doesn't list the V values used for different solutions, it is not difficult to reverse-calculate this value using other parameters listed in his table.

C = The cost of solution doing the reducing.

tc = The elapsed time to compress the capacity. As covered in my last post, I consider this parameter to be stated incorrectly, incorporated inappropriately and irrelevant to the CORE. In place, a better parameter would have been the elapsed time to write.
Three things stand out in this CORE equation:
1. CORE equation assumes first-order relationship with its variables. It may seem that for a specified value of S, the high CORE score can be achieved by achieving high data reduction (R) and the value of capacity being saved (V) and reducing the cost of solution (C) and the time to compress (tc).

2. CORE equation has variables (S, R, V) in numerator that are normalized for solutions without data reduction but no such adjustment is made for variables (C, tc) in denominator.

3. CORE equation is composed of dependent variables instead of independent variables.
Isn't V dependent on S and R?

V is defined as cost per TB (Ct) times amount of data reduced (Sr), according to the description of the math for CORE. Amount of data reduced (Sr) is the capacity being reduced (S) times percent reduction achieved (R).
V = Sr x Ct = S x R x Ct
Substituting V in original CORE equation,
CORE = (S^2) x (R^2) x Ct / (C x tc)
To a large extent, this modified CORE equation is composed of more independent variables than original one. Obviously, it is no longer a first order relationship with S and R.

What is interesting with CORE equation is that amount of data reduced has been included twice, once as amount of data reduced and then again as part of cost of amount of data reduced.

What is the CORE value for a solution with no data reduction technology?

For,
S = 100 TB,
R = 0% as there is no data reduction,
V = \$0 as there is no capacity being saved,
C = 0 as there is no data reduction technology in play so there is no cost of data reduction solution, and
tc = 0 ms as there is no compression of data taking place,

CORE = (S x R x V) / (C x tc) = (100 x 0 x 0) / (0 x 0) = 0/0

CORE = 0/0 (indeterminate) this expression has no meaning.
You may agree that a relevant CORE value for a solution with no data reduction technology should be 0 or 1. It also makes sense in calculating value of a data reduction solution to have a solution with no data reduction as baseline.

How can we avoid division by zero?

1. Replace tc with tw or (tw + tc)

An equation that takes in to account time to write (tw) instead of or in addition to time to compress (tc) could help avoid division by zero when there is no compression/deduplication being used as even baseline solution with no data reduction will have a non-zero time to write. Either tw or (tw + tc) will be a better choice in place of tc in original CORE equation.

2. Redefine tc and tw

Of course, as originally defined in Dave's post, tc is time to compress the smallest unit compressed in the solution (e.g. file or multiple files or blocks) which ignores the variation in tc due to variation in the size of smallest unit across various solution. I recommend changing the definition of tw and tc, respectively, to time to write and to compress S amount or certain % of S, the value of S should remain same across all solutions. This will remove the parameter dependency on smallest unit compressed and normalize parameter across same amount of S.

3. Redefine C

As originally defined, C is the cost of data reduction solution. As Dave' post indicate NetApp doesn’t charge for ASIS – we took a percentage of the array’s cost, we can safely assume that C is only the cost of data reduction part of the solution, and not the whole solution. In this scenario C = 0 for a solution with no data reduction, thus making CORE value indeterminate again.

An equation that takes in to account the total cost of solution, i.e. cost of solution with no data reduction plus the cost of data reduction solution will help avoid division by zero. Of course, for a data reduction solution that uses existing storage, the total cost of solution will be net present value (NPV) of existing storage plus the cost of data reduction solution. Even better, subtract cost of capacity saved (V) from this cost instead of using V in numerator will result in Net cost of solution.

A better CORE equation, may be?
CORE = (S x R) / (C x tw x tr)

where,

S, R and V are same as originally defined.

C = Net Cost of Solution = Cost of data reduction solution + Cost of capacity used after reduction

Cost of capacity used after reduction = S (1 - R) x Ct = (S x Ct) - (S x R x Ct) = (S x Ct) - V

tw = time to write a pre-defined storage capacity or fraction of S

tr = time to read a pre-defined storage capacity or fraction of S
Of course, some may object to not including read/write ratio, there is no reason why read/write ratio shouldn't be included.

In the end, a CORE equation that is function of Storage Capacity (S), Percent data reduction (R), Net Cost of Solution (C), Read/Write ratio, Time to write (tw), and Time to read (tr) will be more valuable than the originally defined CORE equation. Of course, a lot more work is required to determine the interdependency of these variables.