Re: Regexp fails in scoring recipe



Dallman Ross wrote:

I still don't like padding $NOT_AB.  The application of it
simply does not work right when "$NOT" is sandwiched between
two "$IS"es, I think is the final result of what Kevin uncovered.
I have played with it for too long over the last few days, and

it kind of presses on the brain after a while. :)


Dallman,

I have to agree that the regexp is difficult to design (and when you seeit, you'll probably think its ugly too). I'll present the solution forNOT_ROAD_WORK below after a bit of background.

The $NOT_AB expression is an application of DeMorgan's law (see somediagrams at http://www.math.ubc.ca/~feldman/m302/deMorgan.pdf). For ourproblem of finding NOT_AB, let X be the first character and Y be thesecond character, then in C notation,


DeMorgan's Law: !(X == 'a' && Y == 'b') equals (X != 'a' || Y != 'b')

If we draw this as a 2D grid with X horizontal and Y vertical, we seethat the two regions (expressed on the right side) overlap. To avoid theoverlap, it may be desirable to apply an identity:


1 equals (X == 'a' || X != 'a')

We can apply this to the right side above using the AND operator to get

!(X == 'a' && Y == 'b') equals (X != 'a' || (X == 'a' && Y != 'b'))

When we look at our 2D grid, the final expression makes good intuitivesense. For the "X != 'a'" part, what is Y? It can be any valid value. Wecan express this with a wild card in the regexp.

When we implement this logic with a regexp, we need to consider stringsof length other that two between a word boundary character on the leftand a newline on the right. So even NOT_AB = "(.|[^a].|a[^b])" is notgood enough. The right answer is


NOT_AB = "(.?|[^a].|a[^b]|(.*)(${NWB}..|${WB}[^a].|${WB}a[^b]))"

where $NWB is not a word boundary character and $WB is a word boundarycharacter.

This covers all the cases: strings of length 0, 1, 2, 3, ..., betweenword boundary and newline. After all, we are trying to match all stringsthat are not "ab", and all strings of length other than 2 match thatdescription. I think this is exhaustive. It already looks ugly for twocharacters. Perhaps there is some way to combine the regexps for length2 and lengths 3, 4, ..., but that's beyond me right now.

One more thing before the solution for NOT_ROAD_WORK. Here are someexamples of strings that we want to match because they are not " road work":


xyz
xyz a
xyz ab
xyz abc
xyz ork
xyz abcd
xyz work
xyz abcde
xyz abcdef
xyz abcdefg
xyz artwork
xyz abcdefgh
xyz homework
xyz abcdefghi
xyz housework
xyz road worm
xyz abcdefghij
xyz go to work
xyz broad work
xyz abcdefghijk
xyz abroad work

This list is not exhaustive, but any solution must match these strings.

Here's the full solution:

WSPC    = "     "               # whitespace: space + tab
SPC     = "[$WSPC]"             # Regexp: space + tab
NSPC    = "[^$WSPC]"            # negation
X         = "($[        ]*|[    ]+)"   # Optional word break

LOCATIONS = "(Palo${X}Alto|Stanford|Menlo${X}Park|\\
            Redwood${X}City|Mountain${X}View|Dumbarton)"

NEL       = "(.*($NSPC).*\$)"   # Non-empty line

NOT_WORK4 = "([^w]...|w[^o]..|wo[^r].|wor[^k])"
NOT_ROAD_WORK9 = "(\\
             [^r]........|r[^o].......|ro[^a]......|roa[^d].....|\\
             road${NSPC}....|road${SPC}[^w]...|road${SPC}w[^o]..|\\
             road${SPC}wo[^r].|road${SPC}wor[^k])"
NOT_ROAD_WORK10 = "(\\
             ${NSPC}.........|${SPC}[^r]........|${SPC}r[^o].......|\\
             ${SPC}ro[^a]......|${SPC}roa[^d].....|\\
             ${SPC}road${NSPC}....|${SPC}road${SPC}[^w]...|\\
             ${SPC}road${SPC}w[^o]..|${SPC}road${SPC}wo[^r].|\\
             ${SPC}road${SPC}wor[^k])"
NOT_ROAD_WORK = "(.?.?.?|$NOT_WORK4|.?.?.?.....|$NOT_ROAD_WORK9|\\
             (.*)$NOT_ROAD_WORK10)"

:0* ^From:.*(\<)KPIX\.Traffic\.Router

* Precedence: bulk
{

:0* $ B ?? (\$\$)\[ ?[0-2]?[0-9]:(.*)(\<)($NOT_ROAD_WORK)(\$)\

          ($NEL)*((.*)(\<))?$LOCATIONS\>
 {
   KEEP_IT = 1
 }

 :0 E
 /dev/null
}

I started using this solution on Friday. As of today, it has worked foreight traffic reports in production mode.

My inclination
would be to revert to what works and is elegant.  For me, that's
this one:

:0:
* $  1^1  B ?? ^\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>
* $ -1^1  B ?? ^\[ ()[0-9]:.* ROAD +WORK$(.+$)*(.*\<)?\
               $LOCATIONS\>
KEEP_ME

:0 E  # else
{ HOST = byebye }

I agree that this solution is more elegant. Your scoring solution hasseveral improvements over my original recipe that I posted in the rootof this thread, but it follows the same principle of taking thedifference between the number of occurrences of any events in locationsand that of road work events in locations.

My original recipe stopped working, which is why this thread started. Imay never find the answer to the question of why the original scoringrecipe behaves differently in production mode vs. test mode.Fortunately, this whole exchange has been enlightening, and it yielded anon-scoring solution, which I previously thought was too difficult toeven consider. I'll use the non-scoring solution for a while to see ifany bugs pop up. But I'll revert to scoring eventually, as long as itscoring works in some form on my mail server.


Thanks, Dallman.

Kevin



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail