procmail
[Top] [All Lists]

A tool for refining regex

2002-01-30 15:42:57
[WARNING: Long winded discussion of grabbing lines with filter regex]

I suspect this has come up before but who knows what strings to search
on to find something like this?

I want some help from procmail to get my regexs' better refined.  That
is, I want procmail to show me exactly what is being hit.  Here's what
I mean:

I've been refining a spam catching string of recipes for a while now,
probably not the smartest ones out there and certainly could have more
sophistication in the regex department.   The way I go at is to have a
series of recipies with different screening regex, and NOT (!)
operators, that write to groups like spam_suspect[1-8].

I currently have 8 in use, and am catching most spam, and not
catching an over abundance of false hits.

I have a helper file where I send spam that was missed and every so
often take those messages pass them thur various of the screens until
I get some ideas how to catch them too.

a similar process with false hits.  Find the offending filter regex
and fix it.

The rub is that my .log file doesn't show me a very detailed report of
what was hit.  I have full logging on and have written a little search
script that takes a message-id and finds the section of the log where
that message was processed.  Right away you get to see what recipe hit
and what parts of it did the job.  What you don't see is the line or
lines that tripped it.

I've often tried to take the recipe regexp and run against a wrongly
caught false hit with egrep or grep to try to isolate the exact line
but often it is very time consuming and takes repeated tries to find
what the regexp in promail needs to look like to work in egrep or
grep.

I'll try to cut the verbosity level here a bit a show lines that I
see and the recipe that wrote them.

I got quite a few false hits on this one over a few weeks time:

First the log report of one such false hit:
(NOTE the end of line chars (\) added to show how line was wrapped for
mail)

[...]
32133 procmail: Match on 
"^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
(\.tw |\.kr|[^0-9.]202\.|[^0-9]211\.|\
[^0-9.]6[1-6]\.|bogota\.supernet\.com\.co)"
32134 procmail: Match on ! "^Return-Path:.*redhat\.com|owner-"
32135 procmail: Match on ! "^Sender:.*list"
32136 procmail: Match on ! "^List-Id:"
32137 procmail: Match on ! "^Delivered-To:(_dot_)*lula(_at_)yahoogroups"
32138 procmail: Match on ! "^From:.*Putnam"
32139 procmail: Match on ! "^Mailing-List:"
32140 procmail: Match on ! "^To:(_dot_)*reader(_at_)newsguy(_dot_)com"
32141 procmail: Match on ! "^Received:.*smtp10"
[...]


The recipe:
(NOTE long poorly constructed regex wrapped at (\) chars for
mail)
        :0 D
    * ^To:(_dot_)*(_at_)pop(_dot_)newsguy(_dot_)com|^Received:.*\
       (\.tw |\.kr|[^0-9.]202\.|[^0-9.]211\.|\
       [^0-9.]6[1-5]\.|bogota\.supernet\.com\.co)
    * ! ^Return-Path:.*redhat\.com|owner-
    * ! ^Sender:.*list
    * ! ^List-Id:
    * ! ^X-Loop:
    * ! ^Delivered-To:(_dot_)*lula(_at_)yahoogroups
    * ! ^From:.*Putnam
    * ! ^Mailing-List:
   #[HP  11/30/01 16:29  Added ] 
    * ! ^To:(_dot_)*reader(_at_)newsguy(_dot_)com
    * ! ^Received:.*smtp10
        spam_suspect2.in

The log leaves no doubt as to what the hit was, but I screwed around
for 30 minutes tying to run that regexp agains the headers from that
message with (egrep/grep) And never did see the line it matched.  I
tried all of it stuck between 'and ', parts of it snipped of it etc.

What I want here is for procmail to tell me exactly, by grabbing the
line or piece of it that the regex hits.  And putting that into the
log too.

I solved this one by adding another NOT (!) operator that skips
X-Loop: and there by stopping most of the false hits. Most of them
were from a debian mail group.

I'v use the MATCH operator a little and suspect it could be brought to
bare here to grab the line somehow.  Maybe having the regex twice.
Once in  a match and second as a screen like the one above.
But probably don't want to grab line that hit by the NOT (!) operator
since that wouldn't help clear it up much.

Not really sure how to do that or if that would be the way to go.
I'm hoping some of the sharp-shooters here will have done this long ago
and can explain how to do it.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>