procmail
[Top] [All Lists]

Re: A tool for refining regex

2002-01-30 17:27:08
At 14:25 2002-01-30 -0800, Harry Putnam wrote:
I've been refining a spam catching string of recipes for a while now,
probably not the smartest ones out there and certainly could have more
sophistication in the regex department.   The way I go at is to have a
series of recipies with different screening regex, and NOT (!)
operators, that write to groups like spam_suspect[1-8].

You could set variables based on each set of tests which a message tripped on, then

Harken back to the discussion just this past week about continuing scoring. Similar principles would apply. Over a period, you might determine which rules are more effective at positively identifying spam more often, and then move them to the head of your rules and return to filing them away as suspect rather than continuing to test others. For all I know, you applied this technique to arrive at where you are now...

a similar process with false hits.  Find the offending filter regex
and fix it.

Some false hits are most easily avoided by greenlisting some lists or posters. Depends on the nature of your email. For instance, I use scoring to accumulate a weighted count of things such as exclamation points (which appear frequently in spam, because spammers are just so enthusiastic). Programming lists (including this one), as well as digests, shouldn't be subjected to such logic because they either make extensive use of bangs, or (in the case of digests) represent a large number of messages. I've thought about allowing X bangs per KB of message or something, but haven't sat down to figure out how to apply such logic.

The rub is that my .log file doesn't show me a very detailed report of
what was hit.  I have full logging on

Ouch. If you define your spam rules within a separate file, which you includerc into your main procmail.rc, then you can also include them from a standalone testing .rc file, which defines a different logfile and working directory, delivers default to /dev/null, and which enables verbose. Then, if you have a message (or a whole mailbox) you want to see more information about, you can pump it into the test script, which will report with verbosity (to the separate, testing logfile). Meanwhile each message coming into your mailbox isn't generating hundreds of lines of logfile content unnecessarily.

32133 procmail: Match on 
"^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
(\.tw |\.kr|[^0-9.]202\.|[^0-9]211\.|\
[^0-9.]6[1-6]\.|bogota\.supernet\.com\.co)"

Yea, not very useful.

        :0 D
    * ^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
       (\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
       [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)

:0D
* Other_conditions_just_like_you_have_them_but_for_efficiency_do_them_first
* 1^1 ^\/To:(_dot_)*(_at_)pop\(_dot_)newsguy\(_dot_)com
* 1^1 ^\/Received:.*\/(\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
        [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)
spam_suspect2.in

(note that in your recipe, you're not LOCKING, and you're also not escaping the dots in your mail host)

With the scoring, I'm busting your combined expression out into separate ones - that allows you to see the individual lines which would have matched when you have verbose logging on (perhaps BOTH the To: and a Received: would have matched - but with a combined expression, you wouldn't know). With the scoring number, you'll also have a feel for how many times an individual expression was matched - and EACH one will be emitted in the verbose log, although only ONE will remain in $MATCH to be used by your script. Also, unless you tack a '.*' onto the ends of those lines, the expression will terminate at the part of the line which actually matches - it will not show the FULL line to the end. I think it's more useful this way, since it gives you more feedback about which subexpression (all those ORs) actually matched.

I expect this is the glimmer in the darkness which you were looking for.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>