procmail
[Top] [All Lists]

Re: Regexp fails in scoring recipe

2003-05-07 11:34:23


Dallman Ross wrote:

Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:
        Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

                Dallman Ross wrote:

        All right, here is a way around that.  We define "not road work"
        and use it.  Here it is.  If you plug it in to your recipe, it
        should work just fine.
        
          SPACE  = " "
          TAB    = "       "
          WS     = "$SPACE$TAB"
        
          NOT_RW =          "[^R]|R[^o]|Ro[^a]|Roa[^d]|Road[^$WS]"
          NOT_RW =  "$NOT_RW|Road[$WS][^W]|Road[$WS]W[^o]"
          NOT_RW = "($NOT_RW|Road[$WS]Wo[^r]|Road[$WS]Wor[^k])"

I think that idea needs a tweak assuming some word anchors around NOT_RW:

In 2D: NOT_AB = "[^a].|a[^b]"

In 3D: NOT_ABC = "[^a]..|a[^b].|ab[^c]"

In 4D: NOT_ABCD = "[^a]...|a[^b]..|ab[^c].|abc[^d]"

and so on. I'll use this idea in a non-scoring recipe.

No, I don't see it that way.  For NOT_AB, we don't care if
there is a second char at all if the first is not A.  Why
parse for the second char?  It just uses up cycles.
Here, we see that it's not A, and we stop.

As for anchors, I realize that "road work" is not to be
confused with, "she was driving and overbroad working rig along I-80"..............................^^^^^^^^^. But I purposely didn't code word boundaries in, because that does not, imho, belong in the definition of "NOT_whatever"; but rather in the surrounding recipe's code.

For example, with NOT_AB defined as "([^a]|a[^b])", if we
know it's two letters and want to code it that way, we could
code

        * ()\<$NOT_AB\>

and that's that.  If you'll notice on my search for ROAD WORK
in previous conditions I coded, I always put a $ at the end
of WORK, because, without exception, every entry I see in those
traffic reports happens that way.  One day they could slip
up and put a space or a tab thereafter, but then I'll get
a false positive and see a report that I otherwise might
not have -- not a huge detriment to the trade-off of a clean,
known word boundary.

If there's some specific reason to have a char count, then,
sure, go with "([^a].|a[^b])".

It appears that we are both wrong in at least one case. Suppose we use this recipe:

:0
* $ ()\<$NOT_AB\$
DID_NOT_FIND_AB

:0 E
DID_FIND_AB

and the sample text is

xyz a

If we use your definition

NOT_AB = "([^a]|a[^b])"

then \<[^a]$ is false and \<a[^b]$ is false so the condition is false and the mail is delivered to DID_FIND_AB, which is wrong because we did not find AB.

If we use my definition (with parentheses added)

NOT_AB = "([^a].|a[^b])"

then \<[^a].$ is false and \<a[^b]$ is false (as before) so the condition is false and the mail is delivered to DID_FIND_AB, which is wrong because we did not find AB. I can fix this case by changing the definition to

NOT_AB = "(.|[^a].|a[^b])"

If we change the sample text to

xyz bc

your definition still delivers to DID_FIND_AB (wrong) while mine delivers to DID_NOT_FIND_AB (right).

Kevin

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail