Re: ICU's uconv vs Linux iconv and UTF-8

On 2002.02.01, at 19:24, Nick Ing-Simmons wrote:

As part of the mystery of CJK encodings I notice that IBM's ICU's uconv

and SuSE6.4 linux iconv differ as to the UTF-8 representation iftable.euc


Both converters will round-trip with themselves and give byte exact
copy of table.euc

Weirdly they differ in how they map '\' and '~' in ASCII space as
well as some spots in higher characters.

Oh, yes. This is the problem of the original Unicode 2.x map; It isnot ASCII preservative. I have posted this problem to perl-unicode(_at_)perl(_dot_)org when I first released Jcode. Several discussionslater, I made Jcode so that it preserves ASCII by default and added$Jcode::Unicode::PEDANTIC to change the behavior

  Here is the exerpt from Jcode::Unicode

VARIABLES
       $Jcode::Unicode::PEDANTIC
           When set to non-zero, x-to-unicode conversion becomes
           pedantic.  That is, '\' (chr(0x5c)) is converted to
           zenkaku backslash and '~" (chr(0x7e)) to JIS-x0212
           tilde.

           By Default, Jcode::Unicode leaves ascii ([0x00-0x7f])
           as it is.

Linux iconv will not take ICU's UTF-8.
ICU's uconv will read the iconv output but does produce same as original
table.euc.

So far as I see Linux iconv is ascii-preservative while ICS's isUnicode-strict.

  From Perl's point of view ASCII preservative should be default.

FYI I have reported this brain-dead mapping problem to UnicodeConsortium but never got an answer. Well, they are not public societyin a way they charge for the membership to say anything. One of thereasons so many Japanese love to hate Unicode...

Our current euc-jp.ucm is compatible with Linux iconv.


  Right choice.

Dan the Man with So Many Charsets to Deal With