Terry Sullivan wrote:
On Thu, 02 Oct 2003 00:07:24 -0400, Yakov Shafranovich wrote:
Terry Sullivan wrote:
If not, then the ASRG research agenda may need to be trimmed back.
As for archival data, SpamArchive and others provide lots of it.
The FTC also maintains a spam archive and they might be open to
the idea of running something against it if we ask them.
Hey, I'm all for analysis of passive data. But many of the
questions/proposals, etc. that have been floated on this list, from
greylisting to 550/CR, can't be answered by analyzing archival data.
A comment, since I've communicated some with Terry about this.
Not only we need more active data, we really need to run different
methods on the same accounts/site at the same time, it might help
to analyze some situations.
I have been running my proof-of-concept for the Earnest method
(filtering only on include URL's/Call center numbers, for you missing
it) on an old spam-ridden e-mail account, but when Blaster hit in
mid-Aug. my ISP gave up RBL and started up SpamAssassin in
baysian mode. I hade both one the same account.
It gave some really interesting effects and comparisons. My
data shows that both methods have ca 85 % hitrate each,
Earnest just above and SpamAssassin just below, but combined
they are incredibly effective, they stop well over 99% of the
spams with a low false positive rate, 2 out of 180 real mails
or ca 2 in 900 of the total amount of traffic. One because the
sender where from a domain, where the Earnest data file
extraction filter had extracted out a non-related URL-domain
from a spam, like a Hotmail link. The other because HTML in the
mail got SpamAssassin to lable it as spam.
The limitation of Earnest, as pointed out here before, is that
the data file with URL's and numbers needs to be often and
correctly updated, while a method as Spamassassin with baysian,
seems to be unable to recognize some types of spams (haven't
yet had time to check that).
A method like SpamAssassin also need more resources to handle
it's analysis. Earnest have ca 19.000 URL's, ca 400K, and works
with a simple grep, while SpamAssassin uses a number of criterias
and have typical ca 4-6 Mb data.
But the longer I have this combination, the more I lean on to
a multi-function process situation, two or three methods that
go through the mails, the most lightweighted first, leaving the
trickier cases to the more competent (and most resource prone)
Unfortunatly the figures above is on a bit low amount data, one
mailbox with ca 900 mails, 180 real and rest spams from Aug.
18 to Sept. 23. SpamAssassin was trained on ca 1500 active
mailboxes, where the av. spam level is ca 4 spam/day/box, while
Earnest data was manually extracted from 130.000 spams and
some hundreds more "live" through an automatic honeytrap
Therefore it would be interesting to get other sites with more
volume and a number of mailboxes, that gets about the same
spams, also to run different methods in parallel, as well as some
combinations, in order to get more input.
Tired of spam? Get advanced junk mail protection with MSN 8.
Asrg mailing list