perl-unicode

Re: possible regexp feature for 5.6: "ignore diacritics"

1999-10-18 02:23:06
On Sun, Oct 17, 1999 at 12:29:28AM +0300, Jarkko Hietaniemi wrote:

In POSIX (1003.2) regexps there is a feature called "equivalence
classes".  What this means that certain characters belong into such
classes, and _any_member_of_a_class_stands_for_any_member_of_the_class_.

This concept is handy when matching for diacritic-laden variants of
non-ASCII encodings.  For example finding "bär" when matching with
"bar" would often be most convenient.  The concept is not limited for
Western alphabets, it works also on Cyrillic/Greek/Hebrew/Arabic/...
alphabets.


This sounds like a tempting idea, but it should -only- be done, if there
is a clear definition of which characters are diacritic variants of
other characters and which are separate.

E.g. in Danish, the character 'å' is often described as "a-with-ring",
e.g. å in HTML-speak. In just about every usefull case that I can think
of, we would -not- want 'å' to match 'a' as we consider them two
distinct letters, not just variants of an 'a'. In other words. 'å' is
really not a-with-ring, but a separate letter in the alphabet, whose glyph
happens to look like a a-with-ring, but it sounds quite different.

Secondly, in German I personally would never expect 'a' to match 'ä',
but Germans may just want that, I don't really know.

My conclusion is that this feature would be usefull if done rigth, but
I don't know what 'right' is here  :-)

It might be done by making the  character classes locale dependent, but
I'm still not convinced that this is feasible.

best regards
Erik Bertelsen, UNI-C