Re: Help on simple spam filter
2001-03-06 11:57:54
At 10:55 2001-03-06 -0500, Homer Wilson Smith wrote:
The method we are playing with is a huge data base of specific
subject lines, 100,000 or more created by our users themselves as they
I have a file with ~ 202,000 domains in it which I refuse mail from
(spamhavens and their ilk). The search criteria for these involves more
than just one header field, and the performance bog can be severe (times
thousands of messages a day) if it isn't tackled with some eye to optimization.
A straight forward procmail recipe with 5000 recipe lines in it
stopped the mail server cold.
Well, uh, don't do that - there's a LOT of processing going on
there. Depending on precisely how you were processing, you might have been
invoking a shell on each recipe, or parsing for a header over and over and
over...
Put the desired subject lines into a file, and grep it against the subject
of the message being checked. This is even easier to manage, since the
subjects file is a plain text file which you can easily append new content
to. For the size of what you're talking about though, having something
that actually refers to a _real_ database might offer a significant
performance boost, since the match table file isn't being constantly loaded
into memory for processing. In that case, you wouldn't be using fgrep, but
instead, something written specifically to your needs.
Note that the below recipe will match substrings. If the Subject is "Re:
Fwd: FREE SOFTWARE BLOWOUT", and your database includes "FREE SOFTWARE",
it'd match - a feature which may significantly reduce the necessary size of
your database, and the number of comparisons you must perform (= faster
handling, fewer system resources). It might be a good idea to re-process
your file (with an external tool) when updating it to ensure that you
weren't adding elements to it which are redundant to something which would
already match - or if something is added later which matches existing strings.
SPAMSUBJ=$SPAMDIR/subj.dat # spam subject phrase database
# this is done with grep -- token separation issues.
:0
* $? formail -xSubject: | fgrep -q -i -f $SPAMSUBJ
{
LOG="SPAM: Subject Phrase match$SPAMVER"
:0:
|gzip -9fc>>$MAILDIR/spam.gz
}
(modify delivery to suit your tastes -- I don't discard, I compress and file)
For the really beefy searches (for me, those are domains in the headers), I
use a custom grep I wrote called megagrep, which loads the match strings
into a tree (because it expects to run through the list multiple times in
the course of checking a single message, and a tree makes the process very
efficient, esp when the program also takes the incoming text and forces
case on it ONCE, making string matches that much quicker). It is NOT a
full grep implementation (in fact, it isn't a true grep at all) - it makes
very specific assumptions about the format of the data to be matched, which
permits it to optimize for searches of that type of data.
For subject checking, you don't need that sort of handling, but you might
want to look at the problem from a parsing perspective and figure out where
you're obviously wasting cycles.
All we want is a simple recipe that will direct all e-mail
to the program of our choice,
See Philip's suggestion, which is nice and concise.
Alternatley - and I haven't tried this - but as long as the delivery is
under a lockfile, I'd think that you could pipe the complete message to
your program, which could emit a procmail recipe. Since the recipe
wouldn't have been invoked unless the lock was absent, you shouldn't have
to worry about stepping on toes -- though if your filter program is
obcenely slow, you could have backlog problems.
is being designed around many different needs and may itself
become quite complex depending upon what users want.
Then the supporting procmail script may need to become equally as complex
(though if having the invoked program create a procmail delivery recipe is
workable, this means you'd isolate the changes to the support program).
Depending on the complexity, I would suggest breaking it out into several
different programs - one for adressees, another for subjects, etc. For
instance, you don't want to pipe the entire message body unless you have to
-- so you'd filter the "cheaper" (header only) stuff first, hoping that the
message might match some criteria there first without necessitating the
overhead of the body match. For instance, in my own spam filters, I check
for header inconsistencies BEFORE I perform megagrep operations - if a
simple malformed header (or presence of any of a small list of X-Mailers)
will peg it as spam, why waste the CPU cycles checking for spam via other
methods?
What I don't want to have to do is actually *DELIVER* mail.
See above, which uses the return code from the grep invocation to determine
whether there was a match (= spam), and takes action accordingly. Same
goes for the solution Philip already offered, though his solution allows
for parsing a text return (i.e. multiple results, versus just pass/fail).
If procmail can do all this without a secondary piped program
I am all ears.
Well, as shown above, it can certainly do it using stock software. You
should think of Procmail as a facilitator - often, it can do things
entirely by itself, but when it can't, it resorts to managing contractors
(spawned programs) to get the job done. Either way, it can get the job
done if you want to tackle it in an organized fashion.
[snip - please trim posts]
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail
|
|