procmail
[Top] [All Lists]

RE: Regexp fails in scoring recipe

2003-05-11 15:21:55
David W. Tamkin wrote:


parv asked,

I am confused, why would regex "a[^b]$" not match text 
"xyz a", thus turning condition to be true?

There's nothing there to match [^b].  If it had been [^b]* or [^b]? 
there would have been a match, because null can match either of those 
expressions, but null can't match [^b]; that requires exactly one 
character, which cannot be newline or b (or B if the matching is case-
insensitive).

[^b] needs a character to match to.

Thanks, David, for stating it so succinctly.  I was at first puzzled,
and then later intrigued, by Kevin's observation.  I'd wanted to
look further at the circumstance and formulate a reply, but had
to postpone it until this evening.

In procmail, \<, $, \>, and ^ are not anchors; 
they require actual characters to match to.  Even 
^^ matches on the putative newline, not on the transition.
Usually the result is identical, but sometimes 
thinking of those expressions as anchors to transition points 
(as they are in perl or egrep, but not in procmail) will get you 
into trouble.

Good to have repeated this old lesson, as well.  You enlightened
me with it about a year ago.  One tiny quibble I might have with
your wording is that I'd consider that an actual character can
act as an "anchor," as well, in my mental sense of what an anchor
is.  But, yeah, as you say, not a "word anchor" in the sense
that perl uses them.

Now to Kevin's (and my) problem, wherein he wrote in response
to my reply responding, in turn, to him:

I think that idea needs a tweak assuming some word anchors 
around NOT_RW:

In 2D: NOT_AB = "[^a].|a[^b]"

In 3D: NOT_ABC = "[^a]..|a[^b].|ab[^c]"

In 4D: NOT_ABCD = "[^a]...|a[^b]..|ab[^c].|abc[^d]"

and so on. I'll use this idea in a non-scoring recipe.

No, I don't see it that way.  For NOT_AB, we don't care if
there is a second char at all if the first is not A.  Why
parse for the second char?  It just uses up cycles.
Here, we see that it's not A, and we stop.

[. . . .]
             If you'll notice on my search for ROAD WORK
in previous conditions I coded, I always put a $ at the end
of WORK, because, without exception, every entry I see in those
traffic reports happens that way.  One day they could slip
up and put a space or a tab thereafter, but then I'll get
a false positive and see a report that I otherwise might
not have -- not a huge detriment to the trade-off of a clean,
known word boundary.


It appears that we are both wrong in at least one case. Suppose we use

this recipe:

:0
* $ ()\<$NOT_AB\$
DID_NOT_FIND_AB

:0 E
DID_FIND_AB

and the sample text is

xyz a

If we use your definition

NOT_AB = "([^a]|a[^b])"

then \<[^a]$ is false and \<a[^b]$ is false so the condition is false 
and the mail is delivered to DID_FIND_AB, which is wrong because we
did 
not find AB.

If we use my definition (with parentheses added)

NOT_AB = "([^a].|a[^b])"

then \<[^a].$ is false and \<a[^b]$ is false (as before) so the 
condition is false and the mail is delivered to DID_FIND_AB, which is 
wrong because we did not find AB. I can fix this case by changing the 
definition to

NOT_AB = "(.|[^a].|a[^b])"

If we change the sample text to

xyz bc

your definition still delivers to DID_FIND_AB (wrong) while mine 
delivers to DID_NOT_FIND_AB (right).


This caused me to burn up some brain cycles, I admit.  Wow.
Good catch.

I still don't like padding $NOT_AB.  The application of it
simply does not work right when "$NOT" is sandwiched between
two "$IS"es, I think is the final result of what Kevin uncovered.
I have played with it for too long over the last few days, and
it kind of presses on the brain after a while.  :)  My inclination
would be to revert to what works and is elegant.  For me, that's
this one:

 :0:
 * $  1^1  B ?? ^\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>
 * $ -1^1  B ?? ^\[ ()[0-9]:.* ROAD +WORK$(.+$)*(.*\<)?\
                $LOCATIONS\>
 KEEP_ME

 :0 E  # else
 { HOST = byebye }




Later, Kevin added in reply to Parve:

BTW, there is a mistake in the recipe. We have been 
discussing searching the message body instead of the 
headers in this thread.

For the real thing, yes.  For the tests we've been doing with
$NOT_AB, I've been content to run it in headers, and I manually
edited a sample email's headers in order to accomodate my tests.
(I created an "X-AB-Test:" header, sticking, e.g., "xyz a" aat the
end of it.)

-- 
        "Weltbedenkend, ortlich lenkend!"
                -- Original von W. Dallman Ross



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail