perl-unicode

Re: 5.8 roadmap and Encode

2002-02-28 09:05:19
On Thu, Feb 28, 2002 at 11:02:43PM +0900, SADAHIRO Tomoyuki wrote:
Gb2312.enc is used for HZ (type H) and ISO-2022-CN (type E)
  (they use 7-bit encoding)
as one of their sub-charsets, isn't it?

No. GB2312 isn't really one encoding specification; instead it's
a charset that could be encoded in one of three ways:

- 'euc-cn', the preferred encoding; it's available both as 'euc-cn'
  and 'gb2312' in gnu libiconv.
- 'hz', a 7-bit escaped encoding. The "raw" doublebyte representation
  is escaped with ~{...~} sequences.
- 'iso-2202-cn', similiar to 'hz', but with 8-bit escape strings.

The current gb2312.enc seems to map to the "raw" doublebyte representation
instead of any of the above; I tested it with gnu libiconv 1.7, and it
can't parse any one of these charsets. Similarily, the text generated
by '>:encoding(gb2312)' seems to be a doublebyte charset illegible to
euc-cn, hz or iso2202cn.

(To further complicate the matter, what Windows means by 'GB2312' is
really GBK (the 'extended' GB2312, including Traditional Chinese
characters), which is not yet supported in the .enc files.)

Executive summary:

* Simplified Chinese in Encode.pm may be considered 'working' for
  what most people uses (gb2312's euc-cn).

* Traditional Chinese in Encode.pm may also be considered 'working'
  for the basic big-5 range; its punctuation mappings was fixed and
  patched according to the big5p spec.

* The gb2312.enc is very broken. Afaik nobody uses the raw/unencoded
  GB2312, since it's not interoperable with 7-bit ascii. We should
  either make it synonymous with euc-cn, or remove it.
  
For Chinese usage, following 7 encodings are not here yet, but we can
also add them if desired:

  - 'hz' and 'iso-2022-cn', two different encoding tables for gb2312
    described above.

  - 'gb18030', used in glibc2.2, is a superset of gbk, which is a super
    set of gb2312; we should use that instead of 'gbk' if we want gbk
    support.

  - 'iso-ir-165', a different extension to gb2312, adding gb6345 and
    gb8565 support. Not in wide use.

  - 'iso-2022-cn-ext', the iso-2022'ized version of all characters in
    gb(2312|12345|7589|7590), iso-ir-165, and cns-11643-*. it's a sort
    of 'unified chinese code'.

  - 'big5p', the Big5+ Traditional Chinese encoding, is similarily a
    superset of 'big5', which provides a more complete unicode mapping,
    which covers most of Taiwan's uses. 

  - 'big5-hkscs', a different extension to big5, adding characters used
    is Hong Kong, incompatible with big5p.

Gnu libiconv has most of the above mappings other than big5p; I'm willing
to supply their maps if it's ok with the list.

/Autrijus/

Attachment: pgpcGYu3L6nTI.pgp
Description: PGP signature