procmail
[Top] [All Lists]

Re: Recipe for poorly spelled emails or blacklisted keywords?

2004-02-11 18:35:18

    >> Hi all.  Just wondering two things.  Is there a recipe that would
    >> look at an email and grade its spamyness based on how many misspelled
    >> words are in it?  I've seen some emails lately that have been getting
    >> through solely on the basis that the spammer is good at using
    >> garbage, random or intentionally misspelled things to defeat the
    >> filters.  So now I'm looking for a way to watch for those emails and
    >> just can them if they're too badly misspelled.  Is there a way to do
    >> that or not yet?

I'm new to the procmail list, so I don't the past history about spam
filtering here, but my experience with ad hoc spam filters is they don't
really work (and I've tried it myself a few times in the past).  I've been
part of the Spambayes developer community just about since its inception and
use its sb_filter application from my .procmailrc file to score all my
email.  If you look at the clues that a (quasi) Bayesian spam filter uses,
you'll be amazed at all the clues there you'd have never thought to use.

As for random misspellings, this topic comes up periodically on the
Spambayes lists -- both user and developer.  Spelling analysis hasn't been
required so far.  The other tokenizing techniques already in use are more
than sufficient to properly classify incoming email with modest amounts of
training.  Reliance on spelling analysis also poses the problem of deciding
what to do about mail written in languages other than the recipient's
primary language.  Various Python-related mailing lists I subscribe to
periodically receive emails written in German or French (and other languages
to a lesser degree).  Someone can always read and respond to them.  It would
be a shame if they were ignored simply because nobody had a German or French
dictionary stitched into their spam filter.

At any rate, I suggest you investigate any of a number of (quasi) Bayesian
filters.  Besides Spambayes, there is SpamAssassin with its Bayesian
component, Bogofilter, CRM114 and I'm sure others I'm forgetting.  The
Spambayes URL is in my .sig and the site contains links to a number of
related systems.

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://spambayes.sf.net/
skip(_at_)pobox(_dot_)com

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>