perl-unicode

Re: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 16:22:34
It is definitely a problem to try to interpret what any given label is
supposed to be. The problem is that MIME labels and others are
ambiguous, and are interpreted different ways on different systems.

MIME/IANA is the best registry we have, but there are a number of
significant problems:

- because for most mappings there is no published mapping in the
registry to
and from Unicode/10646 it is not clear, and certainly not easy, to
figure
out exactly what the "unambiguous decoding" is.

- in practice, the industry does NOT interpret the same bytes the same
way;
example, you will get different decodings from "SJIS" on different
platforms.

One of the current projects under development for an upcoming release
of ICU is to have a more precise API, where you can pass in a label
AND a platform (AND version), and get what the platform interprets
that label to mean. That way you can ask for "EUC-JP" as interpreted
on, say, Solaris.

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Nick Ing-Simmons" <nick(_at_)ing-simmons(_dot_)net>
To: <mark(_at_)macchiato(_dot_)com>
Cc: <dankogai(_at_)dan(_dot_)co(_dot_)jp>; <nick(_at_)ing-simmons(_dot_)net>;
<perl-unicode(_at_)perl(_dot_)org>; "SADAHIRO Tomoyuki" 
<bqw10602(_at_)nifty(_dot_)com>
Sent: Friday, February 01, 2002 10:21
Subject: Re: ICU's uconv vs Linux iconv and UTF-8


Mark Davis <mark(_at_)macchiato(_dot_)com> writes:
ICU's pedantic form

The goal for ICU is to be charset neutral, and support all of the
conversions that are in modern use. There are a large number of
variants of character sets;


Fair enough - but as shipped (I downloaded it earlier this week)
it comes with a convrtrs.txt which maps MIME's EUC-JP onto
something it calls ibm-33722 which has the behaviour I reported in
at
the start of this thread.

you can use the one you want.

It is not a question of which _I_ want - it is a question of which
one(s)
CJK perl users want/expect/need.

In so far a _I_ want any particular one it is the one which is going
to match the X11 font encoding so I can in my naive westerner's way
see what it looks like - and I have not a clue which one that is ...

See:

http://oss.software.ibm.com/icu/charset/index.html

I huge list and I don't see how to "grep" it for the provenance of
the table (not that many seem to have any).

So can the experts - ideally native reading experts not theorists -
tell
me which ICU (or other open source) table(s) they want/expect/need,
or failing that which ones have proven troublesome.

There seem to be at least 4 EUC-JP mappings in that list
AIX, Solaris, glibc and Java

If we cannot get any answers "quickly" then I think Dan is correct -
we should un-bundle the whole CJK encoding stuff from the "core"
into
a family of CPAN modules.

Which gives me a design choice:

A. Bundle a "pragmatic" set of CJK which are fast and causes least
build
   pain for non CJK users (i.e. compact precompiled form)

B. Make it as easy as possible for end-user to drop in a new
encoding
   from (say) a .ucm file.

I can obvioulsy try for both - but they seem to be pulling in
opposite
directions at present.

Meanwhile I will go fix the bugs in the core's :encoding logic ...

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/