perl-unicode

Re: possible regexp feature for 5.6: "ignore diacritics"

1999-10-18 02:23:01

Ilya Zakharevich writes:
Jarkko Hietaniemi writes:
This concept is handy when matching for diacritic-laden variants of
non-ASCII encodings.  For example finding "bär" when matching with
"bar" would often be most convenient.  The concept is not limited for
Western alphabets, it works also on Cyrillic/Greek/Hebrew/Arabic/...
alphabets.

What do you mean by this?  Is [=a=] going to stand for \N{cyrillic:a}
and \N{arabic:alef}?

*I* do not mean anything; I will just use the classification done by the
Unicode folk.

For your particular question, no.  "LATIN SMALL LETTER A" is not
"CYRILLIC SMALL LETTER A" is not "ARABIC LETTER ALEF".

Of do you mean [=\N{arabic:alef}=] to stand for
\N{ARABIC LETTER ALEF WITH HAMZA ABOVE}?

This one.

Note that the latter concept is not that good for cyrillic.  Well,
there are *some* languages where there are tiny changes in chars, but
at least for Russian it is very hard to justify.  
Say, \N{cyrillic:i} is a vowel, but \N{cyrillic:short i} is a

So?  "CYRILLIC SMALL LETTER I" and "CYRILLIC SMALL LETTER SHORT I"
are already different, my suggestion does not make them equal.

The first one is and has diacritic variants:

0418;CYRILLIC CAPITAL LETTER I;Lu;0;L;;;;;N;CYRILLIC CAPITAL LETTER II;;;0438;
0438;CYRILLIC SMALL LETTER I;Ll;0;L;;;;;N;CYRILLIC SMALL LETTER II;;0418;;0418
0439;CYRILLIC SMALL LETTER SHORT I;Ll;0;L;0438 0306;;;;N;CYRILLIC SMALL LETTER 
SHO9
045D;CYRILLIC SMALL LETTER I WITH GRAVE;Ll;0;L;0438 0300;;;;N;;;040D;;040D
04E3;CYRILLIC SMALL LETTER I WITH MACRON;Ll;0;L;0438 0304;;;;N;;;04E2;;04E2
04E5;CYRILLIC SMALL LETTER I WITH DIAERESIS;Ll;0;L;0438 0308;;;;N;;;04E4;;04E4
a

The second one has no variants:

0439;CYRILLIC SMALL LETTER SHORT I;Ll;0;L;0438 0306;;;;N;CYRILLIC SMALL LETTER 
SHORT II;;0419;;0419

semiconsonant (though in writing one looks as another with a
"checkish" mark).  There is no direct relationship between them.

Unicode agrees with you so I don't see the problem.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen