perl-unicode

Re: possible regexp feature for 5.6: "ignore diacritics"

1999-10-18 02:18:14

Jarkko Hietaniemi wrote:

The notation is [=c=], where c is a character (the context equivalence
classes have been previously mentioned has been the POSIX regexp
extensions in general, such as the recently implemented [:class:]
extension).

[snip]

In addition to the POSIX 1003.2 notation,[=c=], I think we could allow
for a new regexp flag to turn the "ignore diacritics", just like we
have "ignore case".  Maybe /d?  (/e would have been be nice but s///e
pre\xEBmpted us.)

I am not emotionally *that* deeply attached to the feature, mostly
because I'm really low on tuits, and will be for some time.  But I
know it's a useful concept, and have a fair idea of how it could be
done, and I wanted the idea to be thrown to the table.

Perhaps I can save you some tuits by pointing out that I don't even
see Perl having a need for the suggested //d flag, at least in simple 
cases.  Is it not the case already that if you wanted to match only a 
non diacritic e in a text that may contain \xEB then you just match that 
with /e/ or /\145/ not a pattern with /\xEB/ and certainly not the 
rather verbose /[=e=]/d (or somesuch).  
Hand coded diacritic variant "classes" can still be individually 
enclosed in /[]/ for the case where selective diacritic matching needs 
to be done (e.g. /[e\xEB]/).
In more complex patterns perhaps (?d:[=c=]) would be useful hence a /d 
flag might need to be implemented, eventually.  But I think the utility 
of [=c=] is pretty high even without a //d like "ignore diacritics" flag.

Peter Prymmer