procmail
[Top] [All Lists]

Re: Help on simple spam filter

2001-03-06 11:57:54
At 10:55 2001-03-06 -0500, Homer Wilson Smith wrote:
    The method we are playing with is a huge data base of specific
subject lines, 100,000 or more created by our users themselves as they

I have a file with ~ 202,000 domains in it which I refuse mail from (spamhavens and their ilk). The search criteria for these involves more than just one header field, and the performance bog can be severe (times thousands of messages a day) if it isn't tackled with some eye to optimization.

    A straight forward procmail recipe with 5000 recipe lines in it
stopped the mail server cold.

Well, uh, don't do that - there's a LOT of processing going on there. Depending on precisely how you were processing, you might have been invoking a shell on each recipe, or parsing for a header over and over and over...

Put the desired subject lines into a file, and grep it against the subject of the message being checked. This is even easier to manage, since the subjects file is a plain text file which you can easily append new content to. For the size of what you're talking about though, having something that actually refers to a _real_ database might offer a significant performance boost, since the match table file isn't being constantly loaded into memory for processing. In that case, you wouldn't be using fgrep, but instead, something written specifically to your needs.

Note that the below recipe will match substrings. If the Subject is "Re: Fwd: FREE SOFTWARE BLOWOUT", and your database includes "FREE SOFTWARE", it'd match - a feature which may significantly reduce the necessary size of your database, and the number of comparisons you must perform (= faster handling, fewer system resources). It might be a good idea to re-process your file (with an external tool) when updating it to ensure that you weren't adding elements to it which are redundant to something which would already match - or if something is added later which matches existing strings.

SPAMSUBJ=$SPAMDIR/subj.dat # spam subject phrase database

# this is done with grep -- token separation issues.
:0
* $? formail -xSubject: | fgrep -q -i -f $SPAMSUBJ
{
        LOG="SPAM: Subject Phrase match$SPAMVER"

        :0:
        |gzip -9fc>>$MAILDIR/spam.gz
}

(modify delivery to suit your tastes -- I don't discard, I compress and file)

For the really beefy searches (for me, those are domains in the headers), I use a custom grep I wrote called megagrep, which loads the match strings into a tree (because it expects to run through the list multiple times in the course of checking a single message, and a tree makes the process very efficient, esp when the program also takes the incoming text and forces case on it ONCE, making string matches that much quicker). It is NOT a full grep implementation (in fact, it isn't a true grep at all) - it makes very specific assumptions about the format of the data to be matched, which permits it to optimize for searches of that type of data.

For subject checking, you don't need that sort of handling, but you might want to look at the problem from a parsing perspective and figure out where you're obviously wasting cycles.

    All we want is a simple recipe that will direct all e-mail
to the program of our choice,

See Philip's suggestion, which is nice and concise.

Alternatley - and I haven't tried this - but as long as the delivery is under a lockfile, I'd think that you could pipe the complete message to your program, which could emit a procmail recipe. Since the recipe wouldn't have been invoked unless the lock was absent, you shouldn't have to worry about stepping on toes -- though if your filter program is obcenely slow, you could have backlog problems.

is being designed around many different needs and may itself
become quite complex depending upon what users want.

Then the supporting procmail script may need to become equally as complex (though if having the invoked program create a procmail delivery recipe is workable, this means you'd isolate the changes to the support program).

Depending on the complexity, I would suggest breaking it out into several different programs - one for adressees, another for subjects, etc. For instance, you don't want to pipe the entire message body unless you have to -- so you'd filter the "cheaper" (header only) stuff first, hoping that the message might match some criteria there first without necessitating the overhead of the body match. For instance, in my own spam filters, I check for header inconsistencies BEFORE I perform megagrep operations - if a simple malformed header (or presence of any of a small list of X-Mailers) will peg it as spam, why waste the CPU cycles checking for spam via other methods?

   What I don't want to have to do is actually *DELIVER* mail.

See above, which uses the return code from the grep invocation to determine whether there was a match (= spam), and takes action accordingly. Same goes for the solution Philip already offered, though his solution allows for parsing a text return (i.e. multiple results, versus just pass/fail).

   If procmail can do all this without a secondary piped program
I am all ears.

Well, as shown above, it can certainly do it using stock software. You should think of Procmail as a facilitator - often, it can do things entirely by itself, but when it can't, it resorts to managing contractors (spawned programs) to get the job done. Either way, it can get the job done if you want to tackle it in an organized fashion.

[snip - please trim posts]

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>