At 02:58 PM 4/2/99 +0200, Ralph SOBEK wrote:
I think that I have found a deficiency in the Word Boundary
match operators, \< and \>, of procmail v3.11pre7.
They are documented as not being quite the same as egrep's; for example,
procmail's must match a newline or a character.
What to do with,
for example ISO-Latin-1 characters at word boundaries? They are not
considered as being in the same class as normal alphabetic letters.
For example,
:0 B
* \<lance\>
will match against the French word "élance".
How does procmail know that's a French word and not a Martian delimiter
0xE9 before "lance"?
Can the match algorithm be corrected to allow extended character sets?
It's not a bug, and I doubt it would be desireable to break existing
.procmailrc files throughout the solar system.
Of course, "\<" could be replaced by a big "[...]" enumerating the
special non-alphabetical characters, but this may be more costly in
terms of execution-speed.
Actually, they already ARE just a synonym for a `[^a-zA-Z0-9_]' that
also matches newlines (probably something like "($|[^a-zA-Z0-9_])"
but forgive me if I've mistyped it) so rolling your own is probably
equally efficient. I haven't looked, but I suspect *any* [...] is just
a table lookup. Remember, you can put your big ugly into a variable,
and then use the variable in your conditions for readability.
<speculation>
Now, if procmail were taking enhancements (I don't think it is), the
way to do it might be to allow setting a variable LOCALE, and then
have \< and \> match any character c for which
c != "_" && !isalnum(c)
although procmail support for locales could get messy (it should, for
example, impact the D flag among other things), and the list of supported
locales might vary from installation to installation, so exchanging
or moving recipes around could become tricky.
</speculation>
Hope that helps,
Stan