procmail
[Top] [All Lists]

Re: Keep getting subborn spam with random words

2004-03-10 09:57:17

    >> to each spam message they send to skip(_at_)pobox(_dot_)com I will 
quickly train
    >> on the few false negatives that slip through and shift the spamprob
    >> of each of the hammy words in the paragraph in the direction of spam.
    >> That paragraph will cease to be effective and the spammer will have
    >> to find a new one.

    Sean> It still seems then that apparently _random_ words would slip by,
    Sean> _necessitating_ you to submit them for learning.

As I pointed out in an earlier message, by default Spambayes ignores all
tokens with a spam probability between 0.4 and 0.6.  Tokens which it has
never seen get a spamprob of 0.5 and are thus ignored.  Ignoring such unsure
tokens was decided upon based upon a lot of testing.  Other statistical
filters may have arrived at a different decision about how to treat
unrecognized tokens.  I don't think Graham's initial code threw them out, so
people naively implementing the scheme he outlined in "A Plan for Spam"
would probably see their classifiers overwhelmed by nonsense words.

A huge amount of testing went into getting Spambayes to the state it is at
today.  The classifier and tokenizer differ substantially from what Graham
originally proposed.  (I suspect whatever tool he uses now does as well.  A
lot of water has passed under the bridge since he wrote his original essay.)
I won't try to summarize all the testing here.  Interested people are
encouraged to look at the archives of spambayes-dev(_at_)python(_dot_)org and 
the very
earliest archives of spambayes(_at_)python(_dot_)org (before spambayes-dev 
existed).
In addition, reading the code for the tokenizer.py and classifier.py modules
in Spambayes is quite instructive even if you have no interest in filtering
spam or learning to program in Python.  They are perhaps the best commented
bits of source code I've ever seen.  Essentially all credit for the rigorous
testing and the well-written and well-commented code goes to Tim Peters.

Links:

    spambayes archives: http://mail.python.org/pipermail/spambayes/
    spambayes-dev archives: http://mail.python.org/pipermail/spambayes-dev/
    classifier/tokenizer source: 
http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://spambayes.sf.net/
skip(_at_)pobox(_dot_)com

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail