Erik Bertelsen writes:
On Sun, Oct 17, 1999 at 12:29:28AM +0300, Jarkko Hietaniemi wrote:
In POSIX (1003.2) regexps there is a feature called "equivalence
classes". What this means that certain characters belong into such
classes, and _any_member_of_a_class_stands_for_any_member_of_the_class_.
This concept is handy when matching for diacritic-laden variants of
non-ASCII encodings. For example finding "bär" when matching with
"bar" would often be most convenient. The concept is not limited for
Western alphabets, it works also on Cyrillic/Greek/Hebrew/Arabic/...
alphabets.
This sounds like a tempting idea, but it should -only- be done, if there
is a clear definition of which characters are diacritic variants of
other characters and which are separate.
Ha! I just expecting some Danes (or maybe Norwegians?) to jump up
here and wave frantically their hands, and look how right I was :-)
E.g. in Danish, the character 'å' is often described as "a-with-ring",
e.g. å in HTML-speak. In just about every usefull case that I can think
of, we would -not- want 'å' to match 'a' as we consider them two
distinct letters, not just variants of an 'a'. In other words. 'å' is
really not a-with-ring, but a separate letter in the alphabet, whose glyph
happens to look like a a-with-ring, but it sounds quite different.
Yes, this is a classical example.
Secondly, in German I personally would never expect 'a' to match 'ä',
but Germans may just want that, I don't really know.
That's exactly why I was noting that a way to do customizations would be needed.
Let's take a look back of why the feature would be useful: for
searching, especially when one cannot be certain which diacritics have
been used/have the correct diacritics, or one is unable/too lazy/too
busy to input all the correct diacritics for the search expression/you
cannot expect all the searchable material to be in language X (the last
one is a killer).
I am not reaching for a definition that would be 100% correct for all
the possible languages simultaneously; I'm looking for a good enough
definition that would work for most of the cases, and I think the Unicode
database gives us just that.
Moreover, I'm not qualified to make language-(locale)-dependent
definitions for all the languages of the world.
My conclusion is that this feature would be usefull if done rigth, but
I don't know what 'right' is here :-)
It might be done by making the character classes locale dependent, but
I'm still not convinced that this is feasible.
I would say it is for all practical purposes impossible: there still
is no commonly agreed upon set of locale definitions, enforced by some
international standards body. Maybe the open source projects will
come up with something in time.
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen