procmail
[Top] [All Lists]

RE: Regexp fails in scoring recipe

2003-05-07 04:35:38
Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:

      Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

              Dallman Ross wrote:


what I meant with the plus signs in my line:
     
* $ B ?? ^\/\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
............................^...........................^
     
"One or more chars on this line" is essentially what's 
in the parens there, in pseudo-code language.

.+ matches whitespace. I had wanted some optional lines that 
are "visibly non-empty" if they are present; i.e. there must 
be some visible character on the line and not just white space.

Okay.  No objection from this end.

There is a known bug in procmail concerning putting [some
iteration of B and H flags] . . . on the initial
recipe line, [such that] . . . it then can't be turned off in 
subsequent recipes.

[. . . .]
could it be that your rc is invoking H or B flags on recipes and
then running into this bug?  Instead of
     
             :0 B
             * body condition
      Try
             :0
             * B ?? body condition
     

I didn't know about that bug. I'll have to revisit all my recipes.

Okay.  If any of your odd results change, do check back in here.
I'm both puzzled by and curious about what you've experienced.


     All right, here is a way around that.  We define "not road work"
     and use it.  Here it is.  If you plug it in to your recipe, it
     should work just fine.
     
       SPACE  = " "
       TAB    = "    "
       WS     = "$SPACE$TAB"
     
       NOT_RW =          "[^R]|R[^o]|Ro[^a]|Roa[^d]|Road[^$WS]"
       NOT_RW =  "$NOT_RW|Road[$WS][^W]|Road[$WS]W[^o]"
       NOT_RW = "($NOT_RW|Road[$WS]Wo[^r]|Road[$WS]Wor[^k])"


I think that idea needs a tweak assuming some word anchors 
around NOT_RW:

In 2D: NOT_AB = "[^a].|a[^b]"

In 3D: NOT_ABC = "[^a]..|a[^b].|ab[^c]"

In 4D: NOT_ABCD = "[^a]...|a[^b]..|ab[^c].|abc[^d]"

and so on. I'll use this idea in a non-scoring recipe.

No, I don't see it that way.  For NOT_AB, we don't care if
there is a second char at all if the first is not A.  Why
parse for the second char?  It just uses up cycles.
Here, we see that it's not A, and we stop.

As for anchors, I realize that "road work" is not to be
confused with, "she was driving and overbroad working rig 
along I-80"..............................^^^^^^^^^.
But I purposely didn't code word boundaries in, because 
that does not, imho, belong in the definition of "NOT_whatever"; 
but rather in the surrounding recipe's code.

For example, with NOT_AB defined as "([^a]|a[^b])", if we
know it's two letters and want to code it that way, we could
code

        * ()\<$NOT_AB\>

and that's that.  If you'll notice on my search for ROAD WORK
in previous conditions I coded, I always put a $ at the end
of WORK, because, without exception, every entry I see in those
traffic reports happens that way.  One day they could slip
up and put a space or a tab thereafter, but then I'll get
a false positive and see a report that I otherwise might
not have -- not a huge detriment to the trade-off of a clean,
known word boundary.

If there's some specific reason to have a char count, then,
sure, go with "([^a].|a[^b])".

-- 
        "Weltbedenkend, ortlich lenkend!"
                -- Original von W. Dallman Ross


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail