perl-unicode

Re: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 08:35:31
On 2002.02.01, at 23:57, Mark Leisher wrote:
    Dan> FYI I have reported this brain-dead mapping problem to Unicode
    Dan> Consortium but never got an answer.  Well, they are not public
Dan> society in a way they charge for the membership to say anything. One
    Dan> of the reasons so many Japanese love to hate Unicode...

This kind of false information is why many Japanese continue to love to hate Unicode. If you were actually on the Unicode mailing list, you wouldn't be
repeating garbage like this.

Sign up and send a message about the mapping tables. You will get an answer.

I have signed up to unicode(_at_)unicode(_dot_)org a long ago and I thought I did since I am still getting invitation to conferences and such. But I checked lister(_at_)unicode(_dot_)org and it did subscribe my address again instead of getting an error message saying I have already subscribed. Hmm.... Anyway, I have resubscribed so here I go.... Okay. Here is. let me begin with the original message. Sorry for repetition, folks in perl-unicode(_at_)perl(_dot_)org(_dot_)

On 2002.02.01, at 19:24, Nick Ing-Simmons wrote:
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv
and SuSE6.4 linux iconv differ as to the UTF-8 representation if table.euc

Both converters will round-trip with themselves and give byte exact
copy of table.euc

Weirdly they differ in how they map '\' and '~' in ASCII space as
well as some spots in higher characters.

Oh, yes. This is the problem of the original Unicode 2.x map; It is not ASCII preservative. I have posted this problem to perl- unicode(_at_)perl(_dot_)org when I first released Jcode. Several discussions later, I made Jcode so that it preserves ASCII by default and added $Jcode::Unicode::PEDANTIC to change the behavior
  Here is the exerpt from Jcode::Unicode

VARIABLES
       $Jcode::Unicode::PEDANTIC
           When set to non-zero, x-to-unicode conversion becomes
           pedantic.  That is, '\' (chr(0x5c)) is converted to
           zenkaku backslash and '~" (chr(0x7e)) to JIS-x0212
           tilde.

           By Default, Jcode::Unicode leaves ascii ([0x00-0x7f])
           as it is.

Linux iconv will not take ICU's UTF-8.
ICU's uconv will read the iconv output but does produce same as original
table.euc.

So far as I see Linux iconv is ascii-preservative while ICS's is Unicode-strict.
  From Perl's point of view ASCII preservative should be default.
FYI I have reported this brain-dead mapping problem to Unicode Consortium but never got an answer. Well, they are not public society in a way they charge for the membership to say anything. One of the reasons so many Japanese love to hate Unicode...

Our current euc-jp.ucm is compatible with Linux iconv.

  Right choice.

Dan the Man with So Many Charsets to Deal With

Now let me repeat the same question I have asked a long ago. Why is the Unicode - JISX2xxx map remains so that it does not preserve ASCII part? Despite the fact most converters ignores the original map and leaves ASCII part as is? One more question. Where has the contents in ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/ gone?

_____  Dan Kogai
  __/ ____   CEO, DAN co. ltd.
 /__ /-+-/  2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan
   /--/--- mailto: dankogai(_at_)dan(_dot_)co(_dot_)jp / http://www.dan.co.jp/ 
---------
__/  /    Tel:+81 3-5665-6131   Fax:+81 3-5665-6132
         GPG Key: http://www.dan.co.jp/~dankogai/dankogai.gpg.asc