Re: Regexp fails in scoring recipe



Dallman Ross wrote:

Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:

        Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

                Dallman Ross wrote:

        All right, here is a way around that.  We define "not road work"
        and use it.  Here it is.  If you plug it in to your recipe, it
        should work just fine.
        
          SPACE  = " "
          TAB    = "       "
          WS     = "$SPACE$TAB"
        
          NOT_RW =          "[^R]|R[^o]|Ro[^a]|Roa[^d]|Road[^$WS]"
          NOT_RW =  "$NOT_RW|Road[$WS][^W]|Road[$WS]W[^o]"
          NOT_RW = "($NOT_RW|Road[$WS]Wo[^r]|Road[$WS]Wor[^k])"

I think that idea needs a tweak assuming some word anchorsaround NOT_RW:


In 2D: NOT_AB = "[^a].|a[^b]"

In 3D: NOT_ABC = "[^a]..|a[^b].|ab[^c]"

In 4D: NOT_ABCD = "[^a]...|a[^b]..|ab[^c].|abc[^d]"

and so on. I'll use this idea in a non-scoring recipe.


No, I don't see it that way.  For NOT_AB, we don't care if
there is a second char at all if the first is not A.  Why
parse for the second char?  It just uses up cycles.
Here, we see that it's not A, and we stop.

As for anchors, I realize that "road work" is not to be

confused with, "she was driving and overbroad working rigalong I-80"..............................^^^^^^^^^.But I purposely didn't code word boundaries in, becausethat does not, imho, belong in the definition of "NOT_whatever";but rather in the surrounding recipe's code.


For example, with NOT_AB defined as "([^a]|a[^b])", if we
know it's two letters and want to code it that way, we could
code

        * ()\<$NOT_AB\>

and that's that.  If you'll notice on my search for ROAD WORK
in previous conditions I coded, I always put a $ at the end
of WORK, because, without exception, every entry I see in those
traffic reports happens that way.  One day they could slip
up and put a space or a tab thereafter, but then I'll get
a false positive and see a report that I otherwise might
not have -- not a huge detriment to the trade-off of a clean,
known word boundary.

If there's some specific reason to have a char count, then,
sure, go with "([^a].|a[^b])".

It appears that we are both wrong in at least one case. Suppose we usethis recipe:


:0
* $ ()\<$NOT_AB\$
DID_NOT_FIND_AB

:0 E
DID_FIND_AB

and the sample text is

xyz a

If we use your definition

NOT_AB = "([^a]|a[^b])"

then \<[^a]$ is false and \<a[^b]$ is false so the condition is falseand the mail is delivered to DID_FIND_AB, which is wrong because we didnot find AB.


If we use my definition (with parentheses added)

NOT_AB = "([^a].|a[^b])"

then \<[^a].$ is false and \<a[^b]$ is false (as before) so thecondition is false and the mail is delivered to DID_FIND_AB, which iswrong because we did not find AB. I can fix this case by changing thedefinition to


NOT_AB = "(.|[^a].|a[^b])"

If we change the sample text to

xyz bc

your definition still delivers to DID_FIND_AB (wrong) while minedelivers to DID_NOT_FIND_AB (right).


Kevin

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail