procmail
[Top] [All Lists]

RE: Regexp fails in scoring recipe

2003-05-15 08:42:40
Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:

I still don't like padding $NOT_AB.  [. . .]

DeMorgan's Law: !(X == 'a' && Y == 'b') equals (X != 'a' || Y != 'b')

Yup.
[good background stuff deleted]

When we implement this logic with a regexp, we need to 
consider strings of length other that two between a word 
boundary character on the left and a newline on the right. 
So even NOT_AB = "(.|[^a].|a[^b])" is not good enough. 
The right answer is

NOT_AB = "(.?|[^a].|a[^b]|(.*)(${NWB}..|${WB}[^a].|${WB}a[^b]))"

where $NWB is not a word boundary character and $WB is a word 
boundary character.

This covers all the cases: strings of length 0, 1, 2, 3, ..., between 
word boundary and newline. After all, we are trying to match 
all strings that are not "ab", and all strings of length other than 
2 match that description. I think this is exhaustive. It already looks

ugly for two characters. Perhaps there is some way to combine the
regexps 
for length 2 and lengths 3, 4, ..., but that's beyond me right now.

That's a very good recital of the problem and a decent proposal,
imho.  However, I *still* think it would work just as well (if not
better) this way:

  NOT_AB = "(.?|[^a]|a[^b])"

and so on.  Although I thought I understood the idiosyncrasies of
what you stated by way of explanation (elided here), I confess I
still don't get the reasons for two chars after your $NWB above.

I had already tried, in my days of playing before finally answering
again last week,

  ^.*\<?($NOT_AB)*$
  .....^.........^
        \       \
         and I want especially to point out the `?' and `*'.

I was disappointed that I couldn't get that to work, even where
$NOT_AB is simply "([^a]|a[^b])", though I also tried with the 
leading ".?|" thing.  In any event, my thought is that $NOT_AB
should stay a clean definition, and the regex can be built around
it to accommodate length of 0-infinity ${NWB} chars.


Here's the full solution:

WSPC    = "     "               # whitespace: space + tab
SPC     = "[$WSPC]"             # Regexp: space + tab
NSPC    = "[^$WSPC]"            # negation
X         = "($[        ]*|[    ]+)"   # Optional word break


Good, so far . . .


LOCATIONS = "(Palo${X}Alto|Stanford|Menlo${X}Park|\\
             Redwood${X}City|Mountain${X}View|Dumbarton)"

NEL       = "(.*($NSPC).*\$)"   # Non-empty line

NOT_WORK4 = "([^w]...|w[^o]..|wo[^r].|wor[^k])"
NOT_ROAD_WORK9 = "(\\
              [^r]........|r[^o].......|ro[^a]......|roa[^d].....|\\
              road${NSPC}....|road${SPC}[^w]...|road${SPC}w[^o]..|\\
              road${SPC}wo[^r].|road${SPC}wor[^k])"
NOT_ROAD_WORK10 = "(\\
              
${NSPC}.........|${SPC}[^r]........|${SPC}r[^o].......|\\
              ${SPC}ro[^a]......|${SPC}roa[^d].....|\\
              ${SPC}road${NSPC}....|${SPC}road${SPC}[^w]...|\\
              ${SPC}road${SPC}w[^o]..|${SPC}road${SPC}wo[^r].|\\
              ${SPC}road${SPC}wor[^k])"
NOT_ROAD_WORK = "(.?.?.?|$NOT_WORK4|.?.?.?.....|$NOT_ROAD_WORK9|\\
              (.*)$NOT_ROAD_WORK10)"

:0 
* ^From:.*(\<)KPIX\.Traffic\.Router
* Precedence: bulk
{
  :0         
  * $ B ?? (\$\$)\[ ?[0-2]?[0-9]:(.*)(\<)($NOT_ROAD_WORK)(\$)\
           ($NEL)*((.*)(\<))?$LOCATIONS\>
  {
    KEEP_IT = 1
  }

  :0 E
  /dev/null
}



I started using this solution on Friday. As of today, it has 
worked for eight traffic reports in production mode.

Cool beans.  I hope you have a decent-size LINEBUF setting.  :)

Fortunately, this whole exchange has been enlightening, and 
it yielded a non-scoring solution, which I previously thought 
was too difficult to even consider. I'll use the non-scoring 
solution for a while to see if any bugs pop up. But I'll revert 
to scoring eventually, as long as it scoring works in some form 
on my mail server.

Thanks, Dallman.

I've enjoyed the exchange too, at least when I wasn't cursing it.  :-)

In a more recent post, Kevin added:

I found the answers to my questions:
    Q1. Why did my original recipe stop working?
    Q2. Why does the recipe fail in production mode while 
succeeding in 
test mode?

The reason became evident while I was testing the new 
non-scoring recipe in production mode: It was also failing when 
the traffic report had only road work events in the locations 
of interest. In other words, the new recipe was matching road 
work events when the regexp was designed to match everything 
except road work events. To debug this, I used the \/ 
token to determine the matching text and put it into the log 

Good trick (have done it myself).  :)

file. This is what I found:

A1: The traffic report body was in DOS format! 

Okay, that's great that you found that; but why do the recipes
work for me with mail from the kpix traffic list?  I did test
it for a few days, after all.  And I also fired up vi (well,
vim) more than a few times on the traffic reports themselves,
and I never say ^Ms!

Dallman


"If you find a path with no obstacles, it probably does not lead to
anywhere."
        Thoughts of Rev. Sunnan Kubose, from _Zen in the Markets_  



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail