procmail
[Top] [All Lists]

RE: Regexp fails in scoring recipe

2003-05-06 16:29:03
Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:

recursive SWITCHRC can work for this and give you 
abbreviated reports that show you just what you want 
and not the cruft you don't want.

 
Sometimes, I like to look at the other cruft to remind myself 
why I pay more to live close to work :-)

You can, uh, always save the original report and look at it
at your leisure.  However,  It's trivial to alter what I
showed very slightly to do what you were doing before:

Main recipe (unchanged from yesterday):

 :0
 * $ B ?? ^\/\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
 { SWITCHRC = traffic }

In the file called "traffic":


 :0 D:
 * ! MATCH ?? ^^(.* )?ROAD +WORK$
 DELIVER_ME

 :0  # we're here only if we didn't already deliver per above
 * $ B ?? ^$\MATCH(.*$)*\/\[
()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
 { SWITCHRC = $_ }

But see all the way down to the bottom, because we don't need the
recursive SWITCHRC after all.


WRDBRK    = "($[     ]*|[    ]+)"
X         = $WRDBRK
LOCATIONS =
"(Dumbarton|(East${X})?Palo${X}Alto|Stanford|Menlo${X}Park|\\
            Redwood${X}City|Mountain${X}View)"

For the locations variable, I now use this:

LOCATIONS="(palo( |^)alto|stanford|menlo( |^)park|\\
          redwood( |^)city|mountain( |^)view|dumbarton)"

I don't remember seeing two consecutive spaces in these reports as in 
"PALO  ALTO", but I'll use your idea above when I get around 

Okay, take what you have and just put a plus sign after the space.
However, from the look of the reports, it would not surprise me to
find a newline followed by a space and then "ALTO".


This is the first recipe that works using scoring (at least 
it works in procmail 3.15.2):

  :0 B
  *  1^0
  *  1^1 $ (\<)road work(^.*($NSPC).*)?(^.*($NSPC).*)?(^.*($NSPC).*)?\
           .*(\<)$LOCATIONS\>
  * -1^1 $ (\<)$LOCATIONS\>
  /dev/null

where NSPC = "[^     ]" because I don't want an empty line 
between the road work line and a line with a location of interest.

The idea of the scoring recipe is that score = 1 + (number of 
road work events in locations) - (all events in locations) is positive

if and only if the number of non-road-work events is zero, then the
action 
is executed as I don't want to see this report.

Okay, that's not bad.  I'll note that a dot followed by a plus suffices
in place of your "not whitespace" syntax, however.  That's in fact what
I meant with the plus signs in my line:

* $ B ?? ^\/\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
............................^...........................^

"One or more chars on this line" is essentially what's in the parens
there, in pseudo-code language.

We can slightly optimize the scoring.  We don't need to 1^0 push at
the top if we reverse things.  We can count each instance of $LOCATIONS
once, and subtract each instance of ROAD WORK followed by $LOCATIONS.
I'm afraid the mailman package is going to wrap my lines again where
I don't want them wrapped, so I will break up the second condition with
a continuation slash.  This is similar to the above of mine, but with 
weighting added and with the `\/' match token removed:

 :0:
 * $  1^1  B ?? ^\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>
 * $ -1^1  B ?? ^\[ ()[0-9]:.* ROAD +WORK$(.+$)*(.*\<)?\
                $LOCATIONS\>
 KEEP_ME

 :0 E  # else
 { HOST = byebye  # slightly more efficient that saving to /dev/null }


That works fine.  I just tested it under 3.22 and NetBSD 1.6.  It has
an advantage over what you show, because sometimes the locations appear
more than once in an entry.  You'd get bogus negative scores then.
My scoring anchors each instance of $LOCATIONS from the start of the 
entry to avoid that.

I'd be very curious to see if that scoring fails in any way on your
system.  I have a new theory about what's going wrong on your system,
anyway


For reports like the one in the original posting with two road work 
events in Menlo Park and one road-work event on Dumbarton and no 
non-road-work events in locations of interest, this filter works with 
procmail 3.15.2. However, it doesn't work on procmail 3.22 because in 
the last condition, two occurrences of Dumbarton are counted 

There is a known bug in procmail concerning putting B on the initial
recipe line, where it then can't be turned off in subsequent recipes.
Or is it putting H on the initial recipe lines?  Or both?  I don't
remember, for two reasons: one, I hardly ever use body greps (my
extensive spam traps are almost all header-only checks); and two,
I always use the alternative syntax I show above, specifically
because of this bug.  So I don't run in to it personally.  However,
could it be that your rc is invoking H or B flags on recipes and
then running into this bug?  Instead of

        :0 B
        * body condition

Try

        :0
        * B ?? body condition



The second recipe that works doesn't use scoring:

  :0 B
  * $ (\<)((problem|accident|slowdown|stall)(s)?|advisor(y|ies))\
      (^.*($NSPC).*)?(^.*($NSPC).*)?(^.*($NSPC).*)?.*(\<)$LOCATIONS\>
  {
    KEEP=1
  }

  :0 E
  /dev/null

Okay.  (We don't need the E, because if we're here, the one above
will not have delivered.  But, yeah, good.)


This approach cheats in that it attempts to list all the 
complementary events to road work (i.e. these are the events I 
want to see as opposed to the ones I don't want to see). What 
I don't like about this recipe is that some new classification 
could appear in the traffic reports (e.g. "disaster" or "flood"), 

All right, here is a way around that.  We define "not road work"
and use it.  Here it is.  If you plug it in to your recipe, it
should work just fine.

  SPACE  = " "
  TAB    = "    "
  WS     = "$SPACE$TAB"

  NOT_RW =          "[^R]|R[^o]|Ro[^a]|Roa[^d]|Road[^$WS]"
  NOT_RW =  "$NOT_RW|Road[$WS][^W]|Road[$WS]W[^o]"
  NOT_RW = "($NOT_RW|Road[$WS]Wo[^r]|Road[$WS]Wor[^k])"

I use something similar for "$NOT_RCVD", which I use to check for
split Received: headers in mail.  (It's one of the spammers' tricks.)

-- 
Dallman Ross


"If you find a path with no obstacles, it probably does not lead to
anywhere."
        Thoughts of Rev. Sunnan Kubose, from _Zen in the Markets_ 



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail