On Sun, Oct 17, 1999 at 12:29:28AM +0300, Jarkko Hietaniemi wrote:
In POSIX (1003.2) regexps there is a feature called "equivalence
classes". What this means that certain characters belong into such
classes, and _any_member_of_a_class_stands_for_any_member_of_the_class_.
This concept is handy when matching for diacritic-laden variants of
non-ASCII encodings. For example finding "bär" when matching with
"bar" would often be most convenient. The concept is not limited for
Western alphabets, it works also on Cyrillic/Greek/Hebrew/Arabic/...
alphabets.
This sounds like a tempting idea, but it should -only- be done, if there
is a clear definition of which characters are diacritic variants of
other characters and which are separate.
E.g. in Danish, the character 'å' is often described as "a-with-ring",
e.g. å in HTML-speak. In just about every usefull case that I can think
of, we would -not- want 'å' to match 'a' as we consider them two
distinct letters, not just variants of an 'a'. In other words. 'å' is
really not a-with-ring, but a separate letter in the alphabet, whose glyph
happens to look like a a-with-ring, but it sounds quite different.
Secondly, in German I personally would never expect 'a' to match 'ä',
but Germans may just want that, I don't really know.
My conclusion is that this feature would be usefull if done rigth, but
I don't know what 'right' is here :-)
It might be done by making the character classes locale dependent, but
I'm still not convinced that this is feasible.
best regards
Erik Bertelsen, UNI-C