On Mon, 10 Mar 2003 18:39:57 EST, Kee Hinckley said:
I currently have a sample database 22,000 confirmed spam messages
sent to roughly 200 real email accounts.
40% blocked by the country restriction.
4% blocked due to obvious viruses.
14% blocked due to system blacklist.
<1% blocked by user blacklists.
There's less than three percent overlap between those factors. The
Actually, there's a hidden assumption here that means that there's
a lot MORE than 3% overlap. Your 14% system blacklist refers to a
blacklist that was tailored thinking "and this list doesn't include
anything from .XY because we country-restrict them already".
What's *really* there is a system blacklist that accounts for 54%
of catches, where 70% of the rules are country-based and the other 30%
are rules to catch stuff the country rules dont....
Pick a country .XY and analyze it carefully - it's fairly likely that
if you didn't filter the country, you'd blacklist 3-4 spamhauses that
are 95% of the problem in that country.
The important question of course becomes whether or not the *rest* of
that country's population will start using e-mail enough to increase the
risk of false positives and skew your stats... ;)
rest are blocked solely on problems we saw with the headers. There's
certainly overlap between that and the other factors, but we don't
currently log it specifically, so I don't know how much.
It would be interesting and informative to have some other numbers.
What percent of mail was tagged with the country restriction but *NOT*
tagged as spam by users? (For instance, it would be quite easy to flag
all mail from .CN as spam - and although my users would probably tag back
100% of the spam from .CN, they'd not tag 100% of the mail from .CN, as
many have relatives there.. The fact that 40% of spam fails the country
test is not at all a reliable predictor unless there is a near-zero rate
of non-spam that fails the country test.
Is the "user blacklist" number the percentage caught by pre-established
user filters, or is that saying that your other checks were 99% effective
in identifying spam and only 1% got through to users for them to report?
Do you have any guesstimates of how much *unreported* spam got through
to the 200 accounts?
Or to turn up the satire, and point out the problem with the analysis:
40% of spammers drank milk at breakfast the day they spammed
I saw an amusing statistic once that 99.97% of all felonies are committed
while breathing air.... ;)
pgp8n3oaufndL.pgp
Description: PGP signature