All,
I have been wondering if any research has been done about the difference
between different (kinds of) spam corpora*; I believe this is the right place
to ask. (Oh, and hello, I am kind of new here too; a lurker for quite some
time, but not sure if I've posted before.)
* throughout this email, by corpus I mean all emails in a live mail stream,
used in real time.
To test a spam filter, or an anti-spam method or to do research about spam, it
is inevitable to use a spam corpus. As the spam sent to one email address, or
even one corporation, is unlikely to be representative of all the spam sent
globally during that period, most people add the spam sent to one or more spam
traps to their test. There is nothing wrong with approach, but, at least in
theory, a lot of spam will not end up in such traps: mailings sent by dodgy
ESPs; spam sent to addresses harvested from Outlook address books; spam sent to
addresses obtained by hacking a company's customer database (or, perhaps more
likely here in the UK, spam sent to addresses from a CD-Rom found on a train).
I am not sure how big a proportion of spam is of this latter kind, but I think
it would be interesting to find out. Over the past months I have sent both our
corporate mail stream and the spam from a distributed spam trap through a
number of spam filters and the difference in performance was striking, with
many products letting through ten or more times as much corportate spam as spam
trap spam. Now easy-to-filter is just one way of quantifying a difference
between spam corpora, but these results have led me to believe that spam traps,
much as they are extremely useful, don't show the full picture.
Martijn.
Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg