Re: Unicode::Normalize surprise with dotless i


On Thu, 05 Sep 2002 13:06:49 +0200
andreas(_dot_)koenig(_at_)anima(_dot_)de (Andreas J. Koenig) wrote:

Hi, Tomoyuki,

is it a bug in Unicode::Normalize or in my code: I expected that for
combining a circumflex with a small letter i, I'd have to use the
dotless i, but to my surprise, NFC refuses to combine with the dotless
i. Here's a demo progam:

% perl -le '
use Unicode::Normalize;
use Encode;
use charnames ":full";
for my $e (qw(ascii)){
  print Encode::encode($e,
    NFKC("combining with i: i\N{COMBINING CIRCUMFLEX ACCENT}
combining with dotless i: \N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING 
CIRCUMFLEX ACCENT}"),
    Encode::FB_PERLQQ); 
}
'
combining with i: \x{00ee}
combining with dotless i: \x{0131}\x{0302}


What do you think?


Hello.
I have a short and a long answer, respectively.

(1) 

<LATIN SMALL LETTER I WITH CIRCUMFLEX> is not
<LATIN SMALL LETTER DOTLESS I WITH CIRCUMFLEX>.

(2)
Ok, please suppose NFC of <dotless-i, circumflex> is <i-circumflex>.
If NFC of a string is equal to NFC of another string,
they are called canonical equivalent.
Similarly, if NFKC of two strings are equal each other,
they are called compatibility equivalent.

Then <dotless-i> must be either canonical or compatibility equivalent
to <i>, since <i-circumflex> is NFC (or NFKC) of
<dotless-i, circumflex> as well as that of <i, circumflex>.
In such a case, users of Turkish or other some languages would
be disallowed to use them in different senses.

Japanese people also use <i-circumflex> in Latin transliteration
of Japanese, called ROMAJI, as long "i". (Long "i" is usually
represented by "ii" or <i-macron>, though.)
If <i-circumflex> might be <dotless-i> with <circumflex>,
but not <i> with <circumflex>, <i-circumflex> should be
a long sound of <dotless i>, but not long "i".
That is also surprising.  :)

Regards,
SADAHIRO Tomoyuki