perl-unicode

Re: Encode and Unicode 3.2 and CJK

2002-03-29 13:12:41
On Saturday, March 30, 2002, at 04:44 , Jarkko Hietaniemi wrote:
Gentlemen, you may want to read Unicode 3.2
( http://www.unicode.org/unicode/reports/tr28/ ) It does say something
about Han, Katakana, and Hangul (sections 10.1, 10.3, and 10.4). (No,
I don't know what happened to 10.2).  What I'm after is whether the
said CJK changes affect Encode?

For Japanese, I pretty much doubt it, at least for the time being. JIS X 0213:2000, as you see, is only two years old and encodings that support are not popular -- yet. The support will take a form of ADDITION, not MODIFICATION, at least so long as JIS X 0213 is concerned.

But let me post a summery of (proposed) encodings for JIS X 0213 for the record.

(See also http://www.asahi-net.or.jp/~wq6k-yn/code/enc-x0213.html if your browser supports Japanese)

JIS X 0213
==========

Is; tidy (JIS X 0208 + JIS X0212). It consists of two 94x94 planes. plane 1 corresponds to 0208 and 0212. But some of the code points are rearranged so 0213-1 != 0208 and 0213-2 != 0208

EUC-JISX0213
============

Encoding scheme is the same as EUC-JP.  Here is the diagram

        G0      US-ASCC
        G1      JISX0213-1
        (G2  JISX0201-kana (depreciated))
        G3   JISX0213-2

Technical difficulty is minimum. All I need is a table. I may make a UCM out of Unihan DB and post it to something like Encode::JPExtra or something.

When in use, this encoding supersedes EUC-JP because you can't tell the difference by looking at a given string. You must explicitly set your encoding to this or "classical" EUC-JP

ISO-2022-JP-3
=============

Basically This one is ISO-2022-JP with new escape sequences.

Esc. Seq.               Charset
------------------------
ESC $(O                 JISX0213-1
ESC $(P                 JISX0213-2

This one is easy, too.

Unlike EUC-JISX0213, this one EXTENDS ISO-2022-JP and old 0208/0212 and 0213 can coexist, thanks to escape sequences.

Shift_JISX0213
==============

And the most controversial one. This one squeezes what was not used in Shift_JIS. Shift_JIS was already acrobatic and this one is a nightmare. However, this one also has only 2 bytes max so the support for this is not that hard. But unlike the cases above, I need UTF-8 => Shift_JISX0213 mapping instead of vanilla JISX0213, which I am not sure if it is available. I'll look into it.

As for Hangul.   I'll let the experts like Jungshik review the impact....

Dan the Man with Even more Encodings

<Prev in Thread] Current Thread [Next in Thread>
  • Re: Encode and Unicode 3.2 and CJK, Dan Kogai <=