Dave CROCKER wrote:
Chris Lewis wrote:
Do we block an IP on one TIS hit? No. We compute good/bad ratios and
have heuristics on when its high enough to do something about.
The "bad" number is affirmative. People hit TIS. As a measure, the bad
therefore has a 100% confidence level of accuracy (as long as we are careful
about defining badness.)
But where do you get the 'good' number from and is it really equally forceful?
So, how do we factor in differential confidence levels in the final
Ironically, I'm in the process of rebuilding this code at the moment ;-)
When you read this keep in mind that this is in _addition_ to all the
other filtering (including DNSBLs, both 3rd party and local) that we use.
Basically what we do is generate a score based on the number of
non-blocked emails, contentblocked emails (not IP-blocked), trap volumes
and complaints, and pick a threshold score. Each of the numbers is
scaled differently in a computation something like this:
if (((complaints * cf + contentblocked * bf + trap * tf) / non-blocked)
go block the IP
[Notice that we're not factoring in blocked IP. Specifically to avoid
the thresholder locking up thru positive feedback ;-). They're blocked
anyway, so it doesn't matter.]
Where cf, bf, and tf are chosen thru experience and experimentation.
There's also some gunk in there to deal with when the numbers are too
small to be significant (especially non-blocked == 0 ;-). When the
non-blocked numbers are low, it doesn't matter very much whether you
block it or not anyway.
Note that there's also an implicit factor of how long the metrics are
over. In the past it was 2 days. Now it's probably going to 7 days
potentially with factoring in abrupt volume increases.
All of the metrics numbers have "100% confidence". The scaling factors
are a confidence factor for each number.
They're somewhat predictable.
Eg: Our "TIS hit per spam" compliance factor is (currently) about 1 in
50. Ignoring other factors, assuming smooth distribution, a cf of 25
will cause the IP to block when the output is 50% spam. We put lots of
headroom in to allow for uneven distribution.
In the past we were using something like 50, 2 and .01 respectively for
cf, bf and tf.
Asrg mailing list