perl-unicode

Re: possible regexp feature for 5.6: "ignore diacritics"

1999-10-18 02:23:02

Erik Bertelsen writes:
On Sun, Oct 17, 1999 at 12:29:28AM +0300, Jarkko Hietaniemi wrote:

In POSIX (1003.2) regexps there is a feature called "equivalence
classes".  What this means that certain characters belong into such
classes, and _any_member_of_a_class_stands_for_any_member_of_the_class_.

This concept is handy when matching for diacritic-laden variants of
non-ASCII encodings.  For example finding "bär" when matching with
"bar" would often be most convenient.  The concept is not limited for
Western alphabets, it works also on Cyrillic/Greek/Hebrew/Arabic/...
alphabets.


This sounds like a tempting idea, but it should -only- be done, if there
is a clear definition of which characters are diacritic variants of
other characters and which are separate.

Ha!  I just expecting some Danes (or maybe Norwegians?)  to jump up
here and wave frantically their hands, and look how right I was :-)

E.g. in Danish, the character 'å' is often described as "a-with-ring",
e.g. å in HTML-speak. In just about every usefull case that I can think
of, we would -not- want 'å' to match 'a' as we consider them two
distinct letters, not just variants of an 'a'. In other words. 'å' is
really not a-with-ring, but a separate letter in the alphabet, whose glyph
happens to look like a a-with-ring, but it sounds quite different.

Yes, this is a classical example.

Secondly, in German I personally would never expect 'a' to match 'ä',
but Germans may just want that, I don't really know.

That's exactly why I was noting that a way to do customizations would be needed.

Let's take a look back of why the feature would be useful: for
searching, especially when one cannot be certain which diacritics have
been used/have the correct diacritics, or one is unable/too lazy/too
busy to input all the correct diacritics for the search expression/you
cannot expect all the searchable material to be in language X (the last
one is a killer).

I am not reaching for a definition that would be 100% correct for all
the possible languages simultaneously; I'm looking for a good enough
definition that would work for most of the cases, and I think the Unicode
database gives us just that.

Moreover, I'm not qualified to make language-(locale)-dependent
definitions for all the languages of the world.

My conclusion is that this feature would be usefull if done rigth, but
I don't know what 'right' is here  :-)

It might be done by making the  character classes locale dependent, but
I'm still not convinced that this is feasible.

I would say it is for all practical purposes impossible: there still
is no commonly agreed upon set of locale definitions, enforced by some
international standards body.  Maybe the open source projects will
come up with something in time.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen