Re: Help on simple spam filter

At 10:55 2001-03-06 -0500, Homer Wilson Smith wrote:

    The method we are playing with is a huge data base of specific
subject lines, 100,000 or more created by our users themselves as they

I have a file with ~ 202,000 domains in it which I refuse mail from(spamhavens and their ilk). The search criteria for these involves morethan just one header field, and the performance bog can be severe (timesthousands of messages a day) if it isn't tackled with some eye to optimization.

    A straight forward procmail recipe with 5000 recipe lines in it
stopped the mail server cold.

Well, uh, don't do that - there's a LOT of processing going onthere. Depending on precisely how you were processing, you might have beeninvoking a shell on each recipe, or parsing for a header over and over andover...

Put the desired subject lines into a file, and grep it against the subjectof the message being checked. This is even easier to manage, since thesubjects file is a plain text file which you can easily append new contentto. For the size of what you're talking about though, having somethingthat actually refers to a _real_ database might offer a significantperformance boost, since the match table file isn't being constantly loadedinto memory for processing. In that case, you wouldn't be using fgrep, butinstead, something written specifically to your needs.

Note that the below recipe will match substrings. If the Subject is "Re:Fwd: FREE SOFTWARE BLOWOUT", and your database includes "FREE SOFTWARE",it'd match - a feature which may significantly reduce the necessary size ofyour database, and the number of comparisons you must perform (= fasterhandling, fewer system resources). It might be a good idea to re-processyour file (with an external tool) when updating it to ensure that youweren't adding elements to it which are redundant to something which wouldalready match - or if something is added later which matches existing strings.


SPAMSUBJ=$SPAMDIR/subj.dat # spam subject phrase database

# this is done with grep -- token separation issues.
:0
* $? formail -xSubject: | fgrep -q -i -f $SPAMSUBJ
{
        LOG="SPAM: Subject Phrase match$SPAMVER"

        :0:
        |gzip -9fc>>$MAILDIR/spam.gz
}

(modify delivery to suit your tastes -- I don't discard, I compress and file)

For the really beefy searches (for me, those are domains in the headers), Iuse a custom grep I wrote called megagrep, which loads the match stringsinto a tree (because it expects to run through the list multiple times inthe course of checking a single message, and a tree makes the process veryefficient, esp when the program also takes the incoming text and forcescase on it ONCE, making string matches that much quicker). It is NOT afull grep implementation (in fact, it isn't a true grep at all) - it makesvery specific assumptions about the format of the data to be matched, whichpermits it to optimize for searches of that type of data.

For subject checking, you don't need that sort of handling, but you mightwant to look at the problem from a parsing perspective and figure out whereyou're obviously wasting cycles.

    All we want is a simple recipe that will direct all e-mail
to the program of our choice,


See Philip's suggestion, which is nice and concise.

Alternatley - and I haven't tried this - but as long as the delivery isunder a lockfile, I'd think that you could pipe the complete message toyour program, which could emit a procmail recipe. Since the recipewouldn't have been invoked unless the lock was absent, you shouldn't haveto worry about stepping on toes -- though if your filter program isobcenely slow, you could have backlog problems.

is being designed around many different needs and may itself
become quite complex depending upon what users want.

Then the supporting procmail script may need to become equally as complex(though if having the invoked program create a procmail delivery recipe isworkable, this means you'd isolate the changes to the support program).

Depending on the complexity, I would suggest breaking it out into severaldifferent programs - one for adressees, another for subjects, etc. Forinstance, you don't want to pipe the entire message body unless you have to-- so you'd filter the "cheaper" (header only) stuff first, hoping that themessage might match some criteria there first without necessitating theoverhead of the body match. For instance, in my own spam filters, I checkfor header inconsistencies BEFORE I perform megagrep operations - if asimple malformed header (or presence of any of a small list of X-Mailers)will peg it as spam, why waste the CPU cycles checking for spam via othermethods?

   What I don't want to have to do is actually *DELIVER* mail.

See above, which uses the return code from the grep invocation to determinewhether there was a match (= spam), and takes action accordingly. Samegoes for the solution Philip already offered, though his solution allowsfor parsing a text return (i.e. multiple results, versus just pass/fail).

   If procmail can do all this without a secondary piped program
I am all ears.

Well, as shown above, it can certainly do it using stock software. Youshould think of Procmail as a facilitator - often, it can do thingsentirely by itself, but when it can't, it resorts to managing contractors(spawned programs) to get the job done. Either way, it can get the jobdone if you want to tackle it in an organized fashion.


[snip - please trim posts]

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail