Does anyone besides Spam Assassin have a decent corpus of spam for training
and testing filters? This could be with or without non-spam.
I was searching and turned up the following ones:
[1] The spam corpuses from Ion Androutsopoulos' papers are linked from here:
http://www.aueb.gr/users/ion/publications.html
[2] The Spambase collection:
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase
[3] Spam Assassin's public corpus
http://www.spamassassin.org/publiccorpus/
[4] Another corpus that turned up, spam only:
http://clg.wlv.ac.uk/projects/junk-email/
[5] Grant Taylor's collection of spam
http://www2.picante.com:81/~gtaylor/download/spam.tar.gz
[1] is the stuff used for the original Bayesian filtering papers, as best I
can tell. Unfortunately, it's already processed a fair bit and only
contains Subject: and body. [2] seems to just be results, and [4] is also
processed and missing most of the headers.
Does anyone else have favourite links for spam collections?
Terri
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg