perl-unicode

Re: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 03:59:37
On 2002.02.01, at 19:24, Nick Ing-Simmons wrote:
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv
and SuSE6.4 linux iconv differ as to the UTF-8 representation if table.euc

Both converters will round-trip with themselves and give byte exact
copy of table.euc

Weirdly they differ in how they map '\' and '~' in ASCII space as
well as some spots in higher characters.

Oh, yes. This is the problem of the original Unicode 2.x map; It is not ASCII preservative. I have posted this problem to perl- unicode(_at_)perl(_dot_)org when I first released Jcode. Several discussions later, I made Jcode so that it preserves ASCII by default and added $Jcode::Unicode::PEDANTIC to change the behavior
  Here is the exerpt from Jcode::Unicode

VARIABLES
       $Jcode::Unicode::PEDANTIC
           When set to non-zero, x-to-unicode conversion becomes
           pedantic.  That is, '\' (chr(0x5c)) is converted to
           zenkaku backslash and '~" (chr(0x7e)) to JIS-x0212
           tilde.

           By Default, Jcode::Unicode leaves ascii ([0x00-0x7f])
           as it is.


Linux iconv will not take ICU's UTF-8.
ICU's uconv will read the iconv output but does produce same as original
table.euc.

So far as I see Linux iconv is ascii-preservative while ICS's is Unicode-strict.
  From Perl's point of view ASCII preservative should be default.
FYI I have reported this brain-dead mapping problem to Unicode Consortium but never got an answer. Well, they are not public society in a way they charge for the membership to say anything. One of the reasons so many Japanese love to hate Unicode...

Our current euc-jp.ucm is compatible with Linux iconv.

  Right choice.

Dan the Man with So Many Charsets to Deal With