procmail
[Top] [All Lists]

Re: Regexp fails in scoring recipe

2003-05-12 19:48:45


Dallman Ross wrote:

I still don't like padding $NOT_AB.  The application of it
simply does not work right when "$NOT" is sandwiched between
two "$IS"es, I think is the final result of what Kevin uncovered.
I have played with it for too long over the last few days, and
it kind of presses on the brain after a while. :)

Dallman,

I have to agree that the regexp is difficult to design (and when you see it, you'll probably think its ugly too). I'll present the solution for NOT_ROAD_WORK below after a bit of background.

The $NOT_AB expression is an application of DeMorgan's law (see some diagrams at http://www.math.ubc.ca/~feldman/m302/deMorgan.pdf). For our problem of finding NOT_AB, let X be the first character and Y be the second character, then in C notation,

DeMorgan's Law: !(X == 'a' && Y == 'b') equals (X != 'a' || Y != 'b')

If we draw this as a 2D grid with X horizontal and Y vertical, we see that the two regions (expressed on the right side) overlap. To avoid the overlap, it may be desirable to apply an identity:

1 equals (X == 'a' || X != 'a')

We can apply this to the right side above using the AND operator to get

!(X == 'a' && Y == 'b') equals (X != 'a' || (X == 'a' && Y != 'b'))

When we look at our 2D grid, the final expression makes good intuitive sense. For the "X != 'a'" part, what is Y? It can be any valid value. We can express this with a wild card in the regexp.

When we implement this logic with a regexp, we need to consider strings of length other that two between a word boundary character on the left and a newline on the right. So even NOT_AB = "(.|[^a].|a[^b])" is not good enough. The right answer is

NOT_AB = "(.?|[^a].|a[^b]|(.*)(${NWB}..|${WB}[^a].|${WB}a[^b]))"

where $NWB is not a word boundary character and $WB is a word boundary character.

This covers all the cases: strings of length 0, 1, 2, 3, ..., between word boundary and newline. After all, we are trying to match all strings that are not "ab", and all strings of length other than 2 match that description. I think this is exhaustive. It already looks ugly for two characters. Perhaps there is some way to combine the regexps for length 2 and lengths 3, 4, ..., but that's beyond me right now.

One more thing before the solution for NOT_ROAD_WORK. Here are some examples of strings that we want to match because they are not " road work":

xyz
xyz a
xyz ab
xyz abc
xyz ork
xyz abcd
xyz work
xyz abcde
xyz abcdef
xyz abcdefg
xyz artwork
xyz abcdefgh
xyz homework
xyz abcdefghi
xyz housework
xyz road worm
xyz abcdefghij
xyz go to work
xyz broad work
xyz abcdefghijk
xyz abroad work

This list is not exhaustive, but any solution must match these strings.

Here's the full solution:

WSPC    = "     "               # whitespace: space + tab
SPC     = "[$WSPC]"             # Regexp: space + tab
NSPC    = "[^$WSPC]"            # negation
X         = "($[        ]*|[    ]+)"   # Optional word break

LOCATIONS = "(Palo${X}Alto|Stanford|Menlo${X}Park|\\
            Redwood${X}City|Mountain${X}View|Dumbarton)"

NEL       = "(.*($NSPC).*\$)"   # Non-empty line

NOT_WORK4 = "([^w]...|w[^o]..|wo[^r].|wor[^k])"
NOT_ROAD_WORK9 = "(\\
             [^r]........|r[^o].......|ro[^a]......|roa[^d].....|\\
             road${NSPC}....|road${SPC}[^w]...|road${SPC}w[^o]..|\\
             road${SPC}wo[^r].|road${SPC}wor[^k])"
NOT_ROAD_WORK10 = "(\\
             ${NSPC}.........|${SPC}[^r]........|${SPC}r[^o].......|\\
             ${SPC}ro[^a]......|${SPC}roa[^d].....|\\
             ${SPC}road${NSPC}....|${SPC}road${SPC}[^w]...|\\
             ${SPC}road${SPC}w[^o]..|${SPC}road${SPC}wo[^r].|\\
             ${SPC}road${SPC}wor[^k])"
NOT_ROAD_WORK = "(.?.?.?|$NOT_WORK4|.?.?.?.....|$NOT_ROAD_WORK9|\\
             (.*)$NOT_ROAD_WORK10)"

:0 * ^From:.*(\<)KPIX\.Traffic\.Router
* Precedence: bulk
{
:0 * $ B ?? (\$\$)\[ ?[0-2]?[0-9]:(.*)(\<)($NOT_ROAD_WORK)(\$)\
          ($NEL)*((.*)(\<))?$LOCATIONS\>
 {
   KEEP_IT = 1
 }

 :0 E
 /dev/null
}

I started using this solution on Friday. As of today, it has worked for eight traffic reports in production mode.

My inclination
would be to revert to what works and is elegant.  For me, that's
this one:

:0:
* $  1^1  B ?? ^\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>
* $ -1^1  B ?? ^\[ ()[0-9]:.* ROAD +WORK$(.+$)*(.*\<)?\
               $LOCATIONS\>
KEEP_ME

:0 E  # else
{ HOST = byebye }

I agree that this solution is more elegant. Your scoring solution has several improvements over my original recipe that I posted in the root of this thread, but it follows the same principle of taking the difference between the number of occurrences of any events in locations and that of road work events in locations.

My original recipe stopped working, which is why this thread started. I may never find the answer to the question of why the original scoring recipe behaves differently in production mode vs. test mode. Fortunately, this whole exchange has been enlightening, and it yielded a non-scoring solution, which I previously thought was too difficult to even consider. I'll use the non-scoring solution for a while to see if any bugs pop up. But I'll revert to scoring eventually, as long as it scoring works in some form on my mail server.

Thanks, Dallman.

Kevin



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail