procmail
[Top] [All Lists]

Re: Regexp fails in scoring recipe

2003-04-30 06:16:40
On Tue, Apr 29, 2003 at 04:45:01PM -0700, Kevin Wu wrote:
I have a recipe that filters traffic reports delivered to me in
the afternoon one hour intervals. I only want to see reports with
accidents, slowdowns, and traffic advisories in locations along my
commute, and not those with only scheduled road work. [. . .]

Does anyone know why procmail behaves inconsistently between
production mode and test mode (see the diagnostic output below)? The
IMAP server may have recently had a sendmail patch applied, but I
don't know if that is related in any way. The platform is procmail
3.15.2 on Solaris 9.

Not sure what inconcistency you are seeing.  But your scoring recipe
is, at best, odd.  I would think a clean-up would help, even if it
doesn't solve your problem altogether.


The recipe:
----------
LOCATIONS="(dumbarton|(east )?palo alto|stanford|menlo park|\\
         redwood city|mountain view)"

Okay.  I actually hadn't known that a double-slash could continue an
assignment line, but I just tested this, and it works.


:0
* ^From:.*(\<)KPIX\.Traffic\.Router
{
 # Delete the traffic report if it has no incidents in our
 # commute region.
 #
 :0 B
 * $ ! (\<)$LOCATIONS\>
 /dev/null

Okay so far, I think.


 # Debug traffic report filtering.
 #
 :0
 * DEBUG ?? 0

Btw, the above would also match on 10, 20, 30, . . . 100, 101, . . .
109, 110, 120, . . .

So you want

        * DEBUG ?? ^^0^^


 {
   VERBOSE=YES
   LOGFILE=$MYLOGFILE
 }

 # Delete the traffic report if the only references to nearby
 # locations are for recurring events.
 #
 :0 B
 *  1^0
 *  1^1 $ (\<)(road work((^.*)$LOCATIONS))\>
 *  1^1 $ (\<)(road work((^.*)[a-z]+.*)((^.*)$LOCATIONS))\>
 *  1^1 $ (\<)(road work((^.*)[a-z]+.*)((^.*)[a-z]+.*)((^.*)$LOCATIONS))\>
 * -1^1 $ (^.*)$LOCATIONS\>
 /dev/null

Let's clarify.  You are adding 1 unconditionally, for what reason I'm
not sure, but for what I presume to be a fall-through; then you are
adding 1 for each count of "road work" where the next line contains one
of your locations; then 1 for each count where "road work" is followed by
*a line with at least one letter in it* and where the *next* line
contains one of your locations; then where the phrase is followed
by a line containing at least one letter which, in turn, is followed
by a line containing at least one letter, which line is followed by
a line with one of your locations.  Ugh!  Then you remove a point for
each instance of one of your locations.

First, a few points about the regexes:  One, you are not anchoring or
limiting the left side of $LOCATIONS here.  So while we'd find an
instance of "Mountain View", we'd also find an instance of "it's
a wonderful intermountain view up in the Rockies."  :-)  I don't
see that as a reason for your odd results, but it's something to
take note of.  (Use a `\<' delimiter at the front of $LOCATIONS.)

Next, you don't need most of those parens around ^.*, and they
seem to me only to add a level of visual complexity.

Next, you don't need

        (^.*)$LOCATIONS\>
or
        ^.*$LOCATIONS\>

to match $LOCATIONS.  You only need

        ()\<$LOCATIONS\>

(The empty quote before the first slash is needed only when the slash 
is the first element in the statement.)

Next, to match "any letter" you don't need

        ((^.*)[a-z]+.*)
or even
        ^.*[a-z]+.*
(which would be a visual improvement, anyway), but only
        [a-z]

But anyway, what if the line has only numbers?  Such as

        CALL CALTRANS TO REPORT ANY PROBLEMS AT
        1-415-555-1212.

So you really probably want just this:

        .

That one dot was on purpose, yes.

Basically, if the body contains a location we want, and it doesn't
follow on the same line as, the subsequent line to, or the next
following line to "road work", that has at least one character in it,
you want to delete the mail?

And what if the lines are these?

        CALTRANS SAYS AVOID ROAD
        WORK AT DUMBARTON BRIDGE


So I think you want this (untested, however):

        :0 B  # inside brackets are a space and a tab
        * ()\<road[     ]+^?work\>.*(^.+)?(^.+)?^(.*\<)?$LOCATIONS\>


Now let's talk about the algorithm.  I don't think you need to
count up every incidence that matches and then subtract every
on that doesn't in order to see if you want the mail.  You've
already deleted items that had no $LOCATIONS at all.  (You could
have done that all in the one scoring recipe, but that's not
really important.)

What you could do instead is forget scoring, and look for $LOCATIONS
that *don't* follow "road work", and simply accept them.

Here is a recipe that combines all of these ideas in one.  That is,
you don't need the first recipe that dev-nulled things that didn't
contain $LOCATIONS:

        :0 B  # inside brackets are a space and a tab
        *   ()\<$LOCATIONS\>
        * ! ()\<road[   ]+^?work\>.*(^.+)?(^.+)?^(.*\<)?$LOCATIONS\>
        deliver_me

-- 
dman

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>