perl-unicode

Re: Encode::CJKguide (> 500 lines Long!)

2002-03-27 05:53:18
Hello, Dan!

1) That's been a great job!

Especially the way you have explained
the 0x21-0x7E and 0xA1-0xFE ranges via the tables, I like it! :-)


2.1) And.. maybe just strip off the Unicode part and we'll get
a good guide on CJK? It's a great thing to have the CJK explanation
bundled, isn't it?

2.2)We do not need to cover Unicode here, other pieces of documentation
do this! Jarkko?

2.3) Just let us limit the Unicode section to

The L<Unicode|perlunicode> is designed to be a technical
superset of many of the existing coded character sets.
The best currently supported subset of Unicode, the BMP
(Basic Multilingual Plane, roughly U+0000-U+FFFE)
allows for looseless conversion from encodings that cover most
popular coded character sets including basic CJK ones:
JIS X 208, JIS X 212, GB 2312-80, KS C 5601.
(other coded character sets that allow looseless
conversion to Unicode may be added later, on demand).

As Perl's internal international coded character set is Unicode,
you should have no difficulty in importing and exporting data
from/to EUC-JP, EUC-TW, EUC-CN (aka GB2312), ISO-2022-JP-1 and
other popular encodings that deal with a single CJK script.

An important fact to consider is that
Japanese script (JIS X 208/212),
Simplified Chinese script (GB2312),
Traditional Chinese script(Big5 or CNS)
and
Corean script (KSC C 5601)
map to the same area in Unicode.
This is called Han Unification. The script differences
should be conveyed by upper-level protocol details, like
lang attribute in HTML and xml:lang attribute in XML.

Considerably saving the Unicode codespace Han Unification makes
unambiguous conversion from Unicode to multi-script encodings
like ISO-2022-JP-2 quite problematic.

--Dot. No more on Unicode.

The idea is to turn a question from a political into a technical one :-)

Present Unicode just as a _technical_ superset
of the existing (and only _basic_!) CJK encodings.

This will probably satisfy both deep Unicode adherents and those
who feel reserved about it.

Here's what Ken Lunde says in cjk.inf:

I feel that it is best to think of ISO 10646-1:1993 as
"just another character set."

Okay, let us present it this way :-)

--

Here's a slightly more emotional variants of my above wording, but
I'm not sure that even that little emotions are allowed :-)

  As Perl's internal international coded character set is Unicode,
  you should have no difficulty in importing and exporting data
  from/to EUC-JP, EUC-TW, EUC-CN (aka GB2312), ISO-2022-JP-1 and
  other popular encodings that deal with a basic character repertoire
  of a single CJK script.


  Being welcomed by a large body of experts and users Han Unification
  has met a reserved reaction from others.
  As a direct result Han Unification for example makes unambiguous
  conversion from Unicode to ISO-2022-JP-2 quite problematic.
  Another issue still to be solved is unification of
  JIS X * (and possibly GB *, KS C *) to/from Unicode
  conversion tables between major software vendors.
  To a great distress they are still reported to be discrepant.
  (provide links to articles?)
  You should not hit the trouble however unless you do something
  similar to:
  - translated data from, for example, EUC-JP to Unicode by one application,
  - send it as Unicode to another,
  - there convert back to EUC-JP
  and get a text different from the original. Avoid this scenario
  unless you know that the conversion tables used by both applications
  are identical.

Feel free to use it as raw material to patch in anywhere you like :-)

I have found the following point in some of the articles on the topic
(this is my retelling :-)

  The Yen sign and the backslash tend to be the most troublesome
  characters as a single codepoint in 8-bit encodings has a tendation
  to be used for both.

Is this true? Stick it somewhere, with a statement that we treat
this codepoint as a backslash, not Yen?
A KNOWN ISSUES section?

3)
I have made several corrections too the document. I believe they will
be useful in either case: if the doc goes as a pod and if goes as a
web page.

(sorry, I'm not as neat as
Autrijus, so this is not a real diff -u :-(

I tried not to duplicate corrections by Autrijus

3.1)
Maybe

  But there were prices to be paid.  It is harder to port applications
  than EUC because the second byte may look like ASCII when the second
  byte is in 0x40-0xFE.

3.1.1)
  0x40-0xFE -> 0x40-0x7E

3.1.2)
  But there were prices to be paid.  It is harder to port applications
- than EUC because ...
+ from ASCII to this encoding then to EUC because ...

3.1.3.1)

- ... the second byte may look like ASCII when the second
- byte is in 0x40-0xFE.

+ ... the second byte may look like ASCII.

3.1.3.2)

- ... the second byte may look like ASCII when the second
- byte is in 0x40-0xFE.

+ ... the second byte may look like ASCII if it falls into
+ the 0x40-0x7E range.


3.2)

DK> There are two cases to consider.  Those they look different but means
DK> the same (Case 1) and vise varsa (Case 2).  The Han Unification of
DK> Unicode decided to unify based upon Case 1;  Let's unify the ones with
DK> the same shape!

Certainly agree with Autrijus's correction reversing the definition of
Cases 1 and 2 :-)

3.3)

To make it look more like the previous paragraph introduce 'special'

DK> When you receive an escape sequence, swap GL with the character

- When you receive an escape sequence, swap GL with the character
+ When you receive an special escape sequence, swap GL with the character
  set the escape sequence specifies.  This is called Character Set

3.4)

Just look at it ;-)
  
DK>           JIS X 0208-1978                  94 ** 2   \e $ @
DK>           JIS X 0208-1983                  94 ** 2   \e $ @

-            JIS X 0208-1983                  94 ** 2   \e $ @
+            JIS X 0208-1983                  94 ** 2   \e $ B

Haven't checked the rest, only these two because they were strikingly
the same

My best regards, Anton (after some sleep :-)