procmail
[Top] [All Lists]

Re: Regexp fails in scoring recipe

2003-05-05 20:30:24


Dallman Ross wrote:

I am inclined to believe that some aspect of your
server environment acts differently when you are logged in from
how it acts when you are not.  Maybe in one case procmail runs
under your uid and shell, but otherwise runs under suid root and
root's (postulated to be different) shell?  Do you have a shell
definition line in your .procmailrc?  I recommend "SHELL = /bin/sh".

Yes, I use the following, which has the same effect (at least on Solaris):

SHELL=/usr/bin/sh

In any case, now that I have seen some sample traffic reports and received two directly myself upon having subscribed, I found that a recursive SWITCHRC can work for this and give you abbreviated reports that show you just what you want and not the cruft you don't want.

Sometimes, I like to look at the other cruft to remind myself why I pay more to live close to work :-)

The HTML highlighting helps me do a rapid scan.

First, though, some precursor stuff.  You had in your recipe,

 LOCATIONS="(dumbarton|(east )?palo alto|stanford|menlo park|\\
           redwood city|mountain view)"

As I mentioned in a previous reply, you could have a problem with
the word breaks.  I see in the actual reports that once in a while
the spacing between words is inconsistent, which only corroborates
my earlier concern.  I believe it would be easy to have, e.g.,
"PALO  ALTO" (with two spaces between) show up instead of what
you were expecting, and you'd miss it.  Also, a line end could
happen in the middle of the phrase.  I recommended using only
one word (and had said that you don't, in any case, need the EAST
for EAST PALO ALTO, since you are accepting the second two words
anyway).  If you don't want potential false hits with "REDWOOD"
or "MOUNTAIN", however, then here's another way. A tab and a space are found inside of each of the two pairs of square brackets:

WRDBRK    = ($[         ]*|[    ]+)
X         = $WRDBRK
LOCATIONS =
"(Dumbarton|(East${X})?Palo${X}Alto|Stanford|Menlo${X}Park|\\
            Redwood${X}City|Mountain${X}View)"

On Friday, I took some of your advice and arrived at two recipes that perform the filtering correctly (more on that in a bit). Unfortunately, I didn't get an answer to my original question.

For the locations variable, I now use this:

LOCATIONS="(palo( |^)alto|stanford|menlo( |^)park|\\
         redwood( |^)city|mountain( |^)view|dumbarton)"

I don't remember seeing two consecutive spaces in these reports as in "PALO ALTO", but I'll use your idea above when I get around to it. This change can't hurt, but it doesn't explain the inconsistency in behavior (for example, one road work incident was DUMBARTON, which is unaffected by this change, and yet, it was not matched in production mode).

I think two words are necessary to avoid false hits on say, "Redwood Highway, San Rafael".

This is the first recipe that works using scoring (at least it works in procmail 3.15.2):

 :0 B
 *  1^0
 *  1^1 $ (\<)road work(^.*($NSPC).*)?(^.*($NSPC).*)?(^.*($NSPC).*)?\
          .*(\<)$LOCATIONS\>
 * -1^1 $ (\<)$LOCATIONS\>
 /dev/null

where NSPC = "[^ ]" because I don't want an empty line between the road work line and a line with a location of interest.

The idea of the scoring recipe is that score = 1 + (number of road work events in locations) - (all events in locations) is positive if and only if the number of non-road-work events is zero, then the action is executed as I don't want to see this report.

For reports like the one in the original posting with two road work events in Menlo Park and one road-work event on Dumbarton and no non-road-work events in locations of interest, this filter works with procmail 3.15.2. However, it doesn't work on procmail 3.22 because in the last condition, two occurrences of Dumbarton are counted even though the report has only one occurrence. This is yet more weird behavior, albeit in a different version of procmail.

The second recipe that works doesn't use scoring:

 :0 B
 * $ (\<)((problem|accident|slowdown|stall)(s)?|advisor(y|ies))\
     (^.*($NSPC).*)?(^.*($NSPC).*)?(^.*($NSPC).*)?.*(\<)$LOCATIONS\>
 {
   KEEP=1
 }

 :0 E
 /dev/null

This approach cheats in that it attempts to list all the complementary events to road work (i.e. these are the events I want to see as opposed to the ones I don't want to see). What I don't like about this recipe is that some new classification could appear in the traffic reports (e.g. "disaster" or "flood"), and this recipe would delete the report even though I would want to see it.

All right, I used the above in my test harness, and it worked fine.
Here is the main recipe I put below that (goes in .procmailrc):

#-------------------------------------------------------------
:0
* ^From: KPIX\(_dot_)Traffic\(_dot_)Router(_at_)kpix\(_dot_)com
* ^Precedence: bulk
* $ B ?? ^\/\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
{ SWITCHRC = traffic }
#-------------------------------------------------------------

(I added the "Precedence:" check because you are /dev/nulling the
reports that don't have a city of interest in them, and I imagine that
the list administrator might write you some time with an announcement
that you'd otherwise miss.  In my confirmation mail from the list
for signing up, for example, there was no Precedence: header.)

Okay, I'll add the Precedence test.

Now I made a separate rc-file called "traffic".  That gets
run recursively.  It's important to have a breaking occurrence
in a recursive rc; otherwise, it will iterate until your server
goes kablooey, or something.  :-)  I tested this one on two of
the actual 8-a.m. traffic reports from KPIX:

#-------------------------------------------------------------
:0 Dich:
* ! MATCH ?? ^^(.* )?ROAD +WORK$
| echo "$MATCH" >> somefile

:0
* $ B ?? ^$\MATCH(.*$)*\/\[
()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
{ SWITCHRC = $_ }
#-------------------------------------------------------------

(Heh.  Note that there's no scoring.  Not that I have anything against
scoring, but . . . I didn't need it.)  That long condition might wrap
before it gets to the list, so I'll put a version here with a line
break:

* $ B ?? ^$\MATCH(.*$)*\/\[ ()[0-9]:.*$(.+$)*(.*\<)?\
          $LOCATIONS\>.*$(.+$)*.*


I'll keep this one in mind for when I give up on scoring. Thanks for your help.

Since I don't care to dive into the internals of procmail to find the answer to my original question, I'll put it on the back burner for now.

Kevin



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail