[Asrg] The difference between spam corpora

All,

I have been wondering if any research has been done about the difference 
between different (kinds of) spam corpora*; I believe this is the right place 
to ask. (Oh, and hello, I am kind of new here too; a lurker for quite some 
time, but not sure if I've posted before.)

* throughout this email, by corpus I mean all emails in a live mail stream, 
used in real time.

To test a spam filter, or an anti-spam method or to do research about spam, it 
is inevitable to use a spam corpus. As the spam sent to one email address, or 
even one corporation, is unlikely to be representative of all the spam sent 
globally during that period, most people add the spam sent to one or more spam 
traps to their test. There is nothing wrong with approach, but, at least in 
theory, a lot of spam will not end up in such traps: mailings sent by dodgy 
ESPs; spam sent to addresses harvested from Outlook address books; spam sent to 
addresses obtained by hacking a company's customer database (or, perhaps more 
likely here in the UK, spam sent to addresses from a CD-Rom found on a train).

I am not sure how big a proportion of spam is of this latter kind, but I think 
it would be interesting to find out. Over the past months I have sent both our 
corporate mail stream and the spam from a distributed spam trap through a 
number of spam filters and the difference in performance was striking, with 
many products letting through ten or more times as much corportate spam as spam 
trap spam. Now easy-to-filter is just one way of quantifying a difference 
between spam corpora, but these results have led me to believe that spam traps, 
much as they are extremely useful, don't show the full picture.

Martijn.

Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg