Need HAM:
After we settle on definitions, the main missing ingredient is a good
HAM corpus attached to similarly-sampled SPAM. Multi-language ham is
especially needed (I know SpamAssassin team has issued a call for it,
don't know if it will arrive.)
Need SNAPSHOTS:
In addition, time-indexed snapshots of external sources of information
(RBLs, DCC, Razor, etc) would be helpful as well. Does anyone know
whether the operators of those retain any historical data?
I have been accumulating some of both, I'm not sure whether I'll publish
it yet, or in what form (blinded or clear). Unfortunately I have very
little non-English ham, so my learning classifiers always lump all
Chinese, Portugese, and Turkish text into the spam category.
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg