On 2002.02.01, at 19:24, Nick Ing-Simmons wrote:
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv
and SuSE6.4 linux iconv differ as to the UTF-8 representation if
table.euc
Both converters will round-trip with themselves and give byte exact
copy of table.euc
Weirdly they differ in how they map '\' and '~' in ASCII space as
well as some spots in higher characters.
Oh, yes. This is the problem of the original Unicode 2.x map; It is
not ASCII preservative. I have posted this problem to perl-
unicode(_at_)perl(_dot_)org when I first released Jcode. Several discussions
later, I made Jcode so that it preserves ASCII by default and added
$Jcode::Unicode::PEDANTIC to change the behavior
Here is the exerpt from Jcode::Unicode
VARIABLES
$Jcode::Unicode::PEDANTIC
When set to non-zero, x-to-unicode conversion becomes
pedantic. That is, '\' (chr(0x5c)) is converted to
zenkaku backslash and '~" (chr(0x7e)) to JIS-x0212
tilde.
By Default, Jcode::Unicode leaves ascii ([0x00-0x7f])
as it is.
Linux iconv will not take ICU's UTF-8.
ICU's uconv will read the iconv output but does produce same as original
table.euc.
So far as I see Linux iconv is ascii-preservative while ICS's is
Unicode-strict.
From Perl's point of view ASCII preservative should be default.
FYI I have reported this brain-dead mapping problem to Unicode
Consortium but never got an answer. Well, they are not public society
in a way they charge for the membership to say anything. One of the
reasons so many Japanese love to hate Unicode...
Our current euc-jp.ucm is compatible with Linux iconv.
Right choice.
Dan the Man with So Many Charsets to Deal With