procmail
[Top] [All Lists]

RE: Regexp fails in scoring recipe

2003-05-16 06:31:34
Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:
Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:
 
So even NOT_AB = "(.|[^a].|a[^b])" is not good enough. 
The right answer is

NOT_AB = "(.?|[^a].|a[^b]|(.*)(${NWB}..|${WB}[^a].|${WB}a[^b]))"

where $NWB is not a word boundary character and $WB is a word 
boundary character.


That's a very good recital of the problem and a decent proposal,
imho.  However, I *still* think it would work just as well (if not
better) this way:

 NOT_AB = "(.?|[^a]|a[^b])"

and so on.  Although I thought I understood the idiosyncrasies of
what you stated by way of explanation (elided here), I confess I
still don't get the reasons for two chars after your $NWB above.

 

By way of example, we want to avoid matches on

    xyz ab

but we want to match each of the following

    xyz tab
    xyz crab
    xyz Schwab

The regexp "\<(.*)(${NWB}..)$" succeeds on all these 
examples. It's true that my definition of NOT_AB has an 
implicit assumption about the boundaries around it, and 
that's not a desirable characteristic. But it works, 
and that's a desirable thing. Here's a way to make the 
definition more compact via logical algebra:

NOT_AB = "(.?|(.*)${NWB}..|((.*)${WB})?([^a].|a[^b]))"

In any event, my thought is that $NOT_AB
should stay a clean definition, and the regex can be built 
around it to accommodate length of 0-infinity ${NWB} chars.


That would be great if it can be done.


Don't look now, but I think I may have solved it in a way
that leaves me satisfied.

It occurred to me while trying to sleep (often when I get good
ideas, but the computer has been turned off by then) :-p
that, since we are focusing on a rightward anchor, the line
end, we should develop our regex NOT_AB from the right, not
the left.  Woo-hoo, but that seemed like the key!  And I think
it is.

In the below, $WS is a space and a tab.  $NL is a newline.
I used a header test instead of body, and created a header called
X-AB-Check: for the testing.

--------------------------------------------------
 NOT_AB = "(.|[^$WS]*([^b]|[^a]b|[^$WS]ab))"

 :0
 * $ ^X-AB-Check:((.*\<)?$NOT_AB)?$
 { LOG = "$NL NOT_AB $NL" }

 :0 E
 { LOG = "$NL AB $NL" }
--------------------------------------------------

So far, this seems to work on whatever I test it on, from
an empty header to just whitespace to one letter on up,
including when AB directly abuts the colon from the header.

If the header itself is missing, then, yes, we get a false 
result, but that seems to be beyond the call of the question.
(Maybe we can even solve that part, though.)

Dallman


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail