perl-unicode

Re: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 11:22:44
Mark Davis <mark(_at_)macchiato(_dot_)com> writes:
ICU's pedantic form

The goal for ICU is to be charset neutral, and support all of the
conversions that are in modern use. There are a large number of
variants of character sets; 


Fair enough - but as shipped (I downloaded it earlier this week)
it comes with a convrtrs.txt which maps MIME's EUC-JP onto 
something it calls ibm-33722 which has the behaviour I reported in at 
the start of this thread. 

you can use the one you want. 

It is not a question of which _I_ want - it is a question of which one(s)
CJK perl users want/expect/need.

In so far a _I_ want any particular one it is the one which is going 
to match the X11 font encoding so I can in my naive westerner's way 
see what it looks like - and I have not a clue which one that is ...

See:

http://oss.software.ibm.com/icu/charset/index.html

I huge list and I don't see how to "grep" it for the provenance of 
the table (not that many seem to have any).

So can the experts - ideally native reading experts not theorists - tell 
me which ICU (or other open source) table(s) they want/expect/need,
or failing that which ones have proven troublesome.

There seem to be at least 4 EUC-JP mappings in that list 
AIX, Solaris, glibc and Java

If we cannot get any answers "quickly" then I think Dan is correct - 
we should un-bundle the whole CJK encoding stuff from the "core" into 
a family of CPAN modules.

Which gives me a design choice:

A. Bundle a "pragmatic" set of CJK which are fast and causes least build 
   pain for non CJK users (i.e. compact precompiled form)

B. Make it as easy as possible for end-user to drop in a new encoding
   from (say) a .ucm file.

I can obvioulsy try for both - but they seem to be pulling in opposite 
directions at present. 

Meanwhile I will go fix the bugs in the core's :encoding logic ...

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/