perl-unicode

Re: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 08:38:12
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
On 2002.02.01, at 19:24, Nick Ing-Simmons wrote:
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv
and SuSE6.4 linux iconv differ as to the UTF-8 representation if
table.euc

Both converters will round-trip with themselves and give byte exact
copy of table.euc

Weirdly they differ in how they map '\' and '~' in ASCII space as
well as some spots in higher characters.

  Oh, yes.  This is the problem of the original Unicode 2.x map;  It is
not ASCII preservative.  I have posted this problem to perl-
unicode(_at_)perl(_dot_)org when I first released Jcode.  Several discussions
later, I made Jcode so that it preserves ASCII by default and added
$Jcode::Unicode::PEDANTIC to change the behavior

Ah. I take your point. If we used ICU's pedantic form
Both UNIX ~/foo and MS C:\Foo get mangled.

The other differences (having looked at diff in yudit) seems to be
mapping \xA2 (cent),\xA3 (pound) ,\xAC (not) and one of the longer dashes to
different width variants (full width for ICU).

I am going off ICU ...


  So far as I see Linux iconv is ascii-preservative while ICS's is
Unicode-strict.
  From Perl's point of view ASCII preservative should be default.
  FYI I have reported this brain-dead mapping problem to Unicode
Consortium but never got an answer.  Well, they are not public society
in a way they charge for the membership to say anything.   One of the
reasons so many Japanese love to hate Unicode...

Our current euc-jp.ucm is compatible with Linux iconv.

  Right choice.

Dan the Man with So Many Charsets to Deal With
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/