[Asrg] 2.0 Metrics (Was Re: [Asrg] spamstones)

On Tue, Apr 01, 2003 at 09:06:40AM -0800, Dave Crocker wrote:

MS> Anti-Virus error rate is 0 FN's and 1 in 1 million FPs.
MS> Anti-Spam error rate is about 1 in 1000 FPs based on detecting about
MS> 95% of spam.

thanks!

hmmm. it occurs to me that a technical research group on spam might want
to consider look to agree on standard methodology for deriving false
negatives and false positives. this would allow everyone to compare
mechanisms in an equivalent way.

given the nature of spam, and the nature of most technologies for
detecting it, the determination of FNs and FPs is not automatically
obvious.  that makes it a fertile opportunity for standardization.


  I had spent a while this past weekend looking at appropriate metrics
for a single number to compare FP and FN rates across a range of spam
blocking or filtering solutions.  I came up with one that seems
interesting as providing a single, fairly easy to grasp, number for the
overall accuracy discrimination of an email classification system.

  This measure should be applicable to any machine classification
system for email, whether it means rejected at an SMTP gateway, flagged
in an email header. for now I'm calling it "dSpam" because it is
modelled on a decibel like scale.

  This does not take into account the differential cost of false
positives and false negatives; I do want to look at the Androutsopoulos
paper for this TCR measure.  However, I think it is potentially useful.

  Begin by assuming "spam" to be identifiable on inspection by a human
being, and define ham = ! spam, because it is shorter than writing
not-spam all the time.  (For definitional purposes, we assume no middle
ground.) Define "flagged" as a classification of an email as spam by
the system under measurement.

  I define FP and FN with the provision that they are not allowed to
be = 0, but otherwise in the standard way:  

  measure     spam category       ham category 
  -------     ---------------     --------------
  flagged     N(flagged-spam)     N(flagged-ham)
  unflagged   N(unflagged-spam)   N(flagged-ham)
  -------     ---------------     --------------
  total       N(spam)             N(ham)

  Then FP = max(N(flagged-ham),0.5 ) / N(ham)
       FN = max(N(unflagged-spam),0.5) / N(spam)

  Using the minimum of 1.5 for the numerator avoids undefined values in
the log computation, and also deliberately penalizes results claiming
"zero false positives" or "zero false negatives" if they use a small
sample size.

  The tentative definition for "dSpam" is:
 10 * ( -log10(FP) - log10(FP) + log(1/4) ) 
  where the log(1/4) addition is a normalizing factor.  (This is
equivalent to -10*log(FP*FN*4), etc.)

  I would recommend that this measure always be qualified with a note
on the corpus it was based on, and a note on how the determination of
actual spam/non-spam was made.

  This measure has the following desirable properties:

1) dSpam for a random (coinflip) classification of mail as spam/nonspam
   is 0.

2) dSpam is independent of the total spam/ham ratio in an input set.

3) Increasing dSpam is "good"

4) Reducing either the FP rate or the FN rate while holding the other
   constant results in a monotonically increasing dSpam.

5) It has reasonable values over the range of rates we need to consider
   (one doesn't have to deal with exponential notations and so on)

6) The relationship between two values is fairly clear; if dSpam(A) is
   10 greater than dSpam(B), then we can say system A is roughly 10
   times "better" than system B in terms of its discrimination, in that
   FP(A)*FN(A) = (1/10) * FP(B)*FN(B).

Here are some values from the published STATISTICS.txt file for
SpamAssassin 2.50 on various thresholds, on its own testing corpus:

SpamAssassin threshold  Data surce      FP      FN              dSpam
---------------------------------------------------------------------
SA threshold 7.0        STATISTICS.txt  0.02%   16.28%          38.9
SA threshold 6.5        STATISTICS.txt  0.03%   14.42%          37.6
SA threshold 6.0        STATISTICS.txt  0.05%   10.83%          36.6
SA threshold 5.5        STATISTICS.txt  0.09%   8.88%           35.0
SA threshold 5.0        STATISTICS.txt  0.16%   6.95%           33.5
SA threshold 4.5        STATISTICS.txt  0.55%   5.52%           29.2
SA threshold 4.0        STATISTICS.txt  0.81%   4.51%           28.4
SA threshold 3.0        STATISTICS.txt  1.86%   2.91%           26.6
  
  This allows us to easily say, for instance, that SpamAssassin
discriminates more precisely at higher threshold settings.

  For another example, assume a hypothetical RBL (or perhaps a
sender-verification scheme) which rejects 98% of offered spam at the
SMTP level but also rejects mail from 3% of valid sites.  That would
stack up like this:

           Spam   Ham
 ---------------------
 flagged    98%    3%
 unflagged   2%   97%
 ---------------------
 total     100%  100%

  dSpam = -10*log(0.02*0.03*4) = 26.2, or around one order of
magnitude worse than SpamAssassin.

  Personal stats from a system I have been running here:

Filter                   Source         FP      FN              dSpam
---------------------------------------------------------------------
Paleopala-Gold (Jan-Feb) personal       0.19%   14.70%          29.5

  The magnitude of spam reduction that some people have talked about as
the needed breakthrough in damaging the economic model for Spam is a
thousand-fold reduction in delivered spam, with less than 0.1% false
positives (0.1% effect on normal mail.)  That's easily calculated as 
 dSpam =  -10*log(0.001*0.001*4) = 54.0, 
or about another 15 dSpam over existing solutions.

  Here are some problems with this metric, before everyone else points
them out:

1) The trivial systems which classify all mail as spam or all mail as
non-spam get arbitrarily high scores, because either lim(FN)->0 or
lim(FP)->0 as sample size increases.

2) Generally, a system which reduces either false positives or false
negatives to an extreme gets rated better than it "should"; e.g. dSpam
for 0.05% false negatives and 50% false positives = dSpam for a system
with .5% false positives and 5% false negatives, which would probably
be much more desirable.

3) The metric isn't biased against false positives, and should not be
used as the sole metric for a system.

  Its main virtue is that it allows us to start comparing different
systems with a single number which bears some relationship to the
desired attributes.  Comments on a better metric to improve or replace
this would be welcomed.

  -- Clifton

-- 
     Clifton Royston  --  LavaNet Systems Architect --  
cliftonr(_at_)lava(_dot_)net

  "If you ride fast enough, the Specialist can't catch you."
  "What's the Specialist?" Samantha says. 
  "The Specialist wears a hat," says the babysitter. "The hat makes noises."
  She doesn't say anything else.  
                      Kelly Link, _The Specialist's Hat_
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg