Mark Davis <mark(_at_)macchiato(_dot_)com> writes:
ICU's pedantic form
The goal for ICU is to be charset neutral, and support all of the
conversions that are in modern use. There are a large number of
variants of character sets;
Fair enough - but as shipped (I downloaded it earlier this week)
it comes with a convrtrs.txt which maps MIME's EUC-JP onto
something it calls ibm-33722 which has the behaviour I reported in at
the start of this thread.
you can use the one you want.
It is not a question of which _I_ want - it is a question of which one(s)
CJK perl users want/expect/need.
In so far a _I_ want any particular one it is the one which is going
to match the X11 font encoding so I can in my naive westerner's way
see what it looks like - and I have not a clue which one that is ...
See:
http://oss.software.ibm.com/icu/charset/index.html
I huge list and I don't see how to "grep" it for the provenance of
the table (not that many seem to have any).
So can the experts - ideally native reading experts not theorists - tell
me which ICU (or other open source) table(s) they want/expect/need,
or failing that which ones have proven troublesome.
There seem to be at least 4 EUC-JP mappings in that list
AIX, Solaris, glibc and Java
If we cannot get any answers "quickly" then I think Dan is correct -
we should un-bundle the whole CJK encoding stuff from the "core" into
a family of CPAN modules.
Which gives me a design choice:
A. Bundle a "pragmatic" set of CJK which are fast and causes least build
pain for non CJK users (i.e. compact precompiled form)
B. Make it as easy as possible for end-user to drop in a new encoding
from (say) a .ucm file.
I can obvioulsy try for both - but they seem to be pulling in opposite
directions at present.
Meanwhile I will go fix the bugs in the core's :encoding logic ...
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/