Clifton Royston <cliftonr(_at_)lava(_dot_)net> wrote:
<much cut>
The tentative definition for "dSpam" is:
10 * ( -log10(FP) - log10(FP) + log(1/4) )
where the log(1/4) addition is a normalizing factor. (This is
equivalent to -10*log(FP*FN*4), etc.)
<more cut>
Here are some problems with this metric, before everyone else points
them out:
1) The trivial systems which classify all mail as spam or all mail as
non-spam get arbitrarily high scores, because either lim(FN)->0 or
lim(FP)->0 as sample size increases.
2) Generally, a system which reduces either false positives or false
negatives to an extreme gets rated better than it "should"; e.g. dSpam
for 0.05% false negatives and 50% false positives = dSpam for a system
with .5% false positives and 5% false negatives, which would probably
be much more desirable.
3) The metric isn't biased against false positives, and should not be
used as the sole metric for a system.
There's a fairly straightforward modification which improves the metric in
these respects:
change the definition of dSpam to
dSpam = -10 * log10(FP + FN)
This fixes issues 1 and 2, and still seems to retain desirable properties.
To fix issue 3, it's easy enough to introduce a bias:
dSpam(b) = -10 * log10( 2*((b*FP) +FN) )/b+1)
Then you have a family of measures indexed by a bias value (positive real
number - limits as bias parameter approaches 0 or infinity are measures
based solely on FN or solely on FP respectively).
I think this still retains the desirable properties of the original
measure - for example dSpam for a toin toss filter is zero for all biases,
whatever the bias dSpam increases when either FN or FP decreases with the
other held constant, ...
Tom
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg