Re: A tool for refining regex

At 14:25 2002-01-30 -0800, Harry Putnam wrote:

I've been refining a spam catching string of recipes for a while now,
probably not the smartest ones out there and certainly could have more
sophistication in the regex department.   The way I go at is to have a
series of recipies with different screening regex, and NOT (!)
operators, that write to groups like spam_suspect[1-8].

You could set variables based on each set of tests which a message trippedon, then

Harken back to the discussion just this past week about continuingscoring. Similar principles would apply. Over a period, you mightdetermine which rules are more effective at positively identifying spammore often, and then move them to the head of your rules and return tofiling them away as suspect rather than continuing to test others. For allI know, you applied this technique to arrive at where you are now...

a similar process with false hits.  Find the offending filter regex
and fix it.

Some false hits are most easily avoided by greenlisting some lists orposters. Depends on the nature of your email. For instance, I use scoringto accumulate a weighted count of things such as exclamation points (whichappear frequently in spam, because spammers are just soenthusiastic). Programming lists (including this one), as well as digests,shouldn't be subjected to such logic because they either make extensive useof bangs, or (in the case of digests) represent a large number ofmessages. I've thought about allowing X bangs per KB of message orsomething, but haven't sat down to figure out how to apply such logic.

The rub is that my .log file doesn't show me a very detailed report of
what was hit.  I have full logging on

Ouch. If you define your spam rules within a separate file, which youincluderc into your main procmail.rc, then you can also include them from astandalone testing .rc file, which defines a different logfile and workingdirectory, delivers default to /dev/null, and which enables verbose. Then,if you have a message (or a whole mailbox) you want to see more informationabout, you can pump it into the test script, which will report withverbosity (to the separate, testing logfile). Meanwhile each messagecoming into your mailbox isn't generating hundreds of lines of logfilecontent unnecessarily.

32133 procmail: Match on 
"^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
(\.tw |\.kr|[^0-9.]202\.|[^0-9]211\.|\
[^0-9.]6[1-6]\.|bogota\.supernet\.com\.co)"


Yea, not very useful.

        :0 D
    * ^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
       (\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
       [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)


:0D
* Other_conditions_just_like_you_have_them_but_for_efficiency_do_them_first
* 1^1 ^\/To:(_dot_)*(_at_)pop\(_dot_)newsguy\(_dot_)com
* 1^1 ^\/Received:.*\/(\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
        [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)
spam_suspect2.in

(note that in your recipe, you're not LOCKING, and you're also not escapingthe dots in your mail host)

With the scoring, I'm busting your combined expression out into separateones - that allows you to see the individual lines which would have matchedwhen you have verbose logging on (perhaps BOTH the To: and a Received:would have matched - but with a combined expression, you wouldn'tknow). With the scoring number, you'll also have a feel for how many timesan individual expression was matched - and EACH one will be emitted in theverbose log, although only ONE will remain in $MATCH to be used by yourscript. Also, unless you tack a '.*' onto the ends of those lines, theexpression will terminate at the part of the line which actually matches -it will not show the FULL line to the end. I think it's more useful thisway, since it gives you more feedback about which subexpression (all thoseORs) actually matched.


I expect this is the glimmer in the darkness which you were looking for.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail