Re: Matching Word Boundaries with \< and \>, and Extended Character Set

At 02:58 PM 4/2/99 +0200, Ralph SOBEK wrote:

      I think that I have found a deficiency in the Word Boundary
match operators, \< and \>, of procmail v3.11pre7.


They are documented as not being quite the same as egrep's; for example,
procmail's must match a newline or a character.

 What to do with,
for example ISO-Latin-1 characters at word boundaries?  They are not
considered as being in the same class as normal alphabetic letters.

For example,

:0 B
* \<lance\>

will match against the French word "élance".


How does procmail know that's a French word and not a Martian delimiter
0xE9 before "lance"?

Can the match algorithm be corrected to allow extended character sets?


It's not a bug, and I doubt it would be desireable to break existing
.procmailrc files throughout the solar system.

Of course, "\<" could be replaced by a big "[...]" enumerating the
special non-alphabetical characters, but this may be more costly in
terms of execution-speed.


Actually, they already ARE just a synonym for a `[^a-zA-Z0-9_]' that
also matches newlines (probably something like "($|[^a-zA-Z0-9_])"
but forgive me if I've mistyped it) so rolling your own is probably
equally efficient.  I haven't looked, but I suspect *any* [...] is just
a table lookup.  Remember, you can put your big ugly into a variable,
and then use the variable in your conditions for readability.


<speculation>
Now, if procmail were taking enhancements (I don't think it is), the
way to do it might be to allow setting a variable LOCALE, and then
have \< and \> match any character c for which
   c != "_" && !isalnum(c)
although procmail support for locales could get messy (it should, for
example, impact the D flag among other things), and the list of supported
locales might vary from installation to installation, so exchanging
or moving recipes around could become tricky.
</speculation>

Hope that helps,
Stan

Re: Matching Word Boundaries with \< and \>, and Extended Character Sets