Re: Matching Word Boundaries with \< and \>, and Extended Character Sets

Ralph SOBEK <sobek(_at_)irit(_dot_)fr> writes:

      I think that I have found a deficiency in the Word Boundary
match operators, \< and \>, of procmail v3.11pre7.  What to do with,
for example ISO-Latin-1 characters at word boundaries?  They are not
considered as being in the same class as normal alphabetic letters.

For example,

:0 B
* \<lance\>

will match against the French word "\xE9lance".

Can the match algorithm be corrected to allow extended character sets?
Of course, "\<" could be replaced by a big "[...]" enumerating the
special non-alphabetical characters, but this may be more costly in
terms of execution-speed.


Actually, \< and \> are internally treated as shortcuts for the
character class [^a-zA-Z0-9_], so rolling your own negated character
class would match no slower.

As for internationalizing the procmail regexp engine, this is not as
simple as it may look.  Different parts of the message are effectively
in different locales -- the header is in the 'C' locale, while the body
may be in several locales if it's a multipart message.  Should procmail
try to automatically switch locales or should it be completely manual?
Wouldn't the latter just lead to a lot of 'mostly working (but not
really)' pseudo-solutions?

If proper locale processing is critical for a given class of message, I
would recommend putting enough intelligence in the procmailrc to
recognize those message and feed them to a perl script where you have a
full-blown programming language to use with proper hooks for locale
handling.


Philip Guenther