On Thu, Feb 28, 2002 at 11:02:43PM +0900, SADAHIRO Tomoyuki wrote:
Gb2312.enc is used for HZ (type H) and ISO-2022-CN (type E)
(they use 7-bit encoding)
as one of their sub-charsets, isn't it?
No. GB2312 isn't really one encoding specification; instead it's
a charset that could be encoded in one of three ways:
- 'euc-cn', the preferred encoding; it's available both as 'euc-cn'
and 'gb2312' in gnu libiconv.
- 'hz', a 7-bit escaped encoding. The "raw" doublebyte representation
is escaped with ~{...~} sequences.
- 'iso-2202-cn', similiar to 'hz', but with 8-bit escape strings.
The current gb2312.enc seems to map to the "raw" doublebyte representation
instead of any of the above; I tested it with gnu libiconv 1.7, and it
can't parse any one of these charsets. Similarily, the text generated
by '>:encoding(gb2312)' seems to be a doublebyte charset illegible to
euc-cn, hz or iso2202cn.
(To further complicate the matter, what Windows means by 'GB2312' is
really GBK (the 'extended' GB2312, including Traditional Chinese
characters), which is not yet supported in the .enc files.)
Executive summary:
* Simplified Chinese in Encode.pm may be considered 'working' for
what most people uses (gb2312's euc-cn).
* Traditional Chinese in Encode.pm may also be considered 'working'
for the basic big-5 range; its punctuation mappings was fixed and
patched according to the big5p spec.
* The gb2312.enc is very broken. Afaik nobody uses the raw/unencoded
GB2312, since it's not interoperable with 7-bit ascii. We should
either make it synonymous with euc-cn, or remove it.
For Chinese usage, following 7 encodings are not here yet, but we can
also add them if desired:
- 'hz' and 'iso-2022-cn', two different encoding tables for gb2312
described above.
- 'gb18030', used in glibc2.2, is a superset of gbk, which is a super
set of gb2312; we should use that instead of 'gbk' if we want gbk
support.
- 'iso-ir-165', a different extension to gb2312, adding gb6345 and
gb8565 support. Not in wide use.
- 'iso-2022-cn-ext', the iso-2022'ized version of all characters in
gb(2312|12345|7589|7590), iso-ir-165, and cns-11643-*. it's a sort
of 'unified chinese code'.
- 'big5p', the Big5+ Traditional Chinese encoding, is similarily a
superset of 'big5', which provides a more complete unicode mapping,
which covers most of Taiwan's uses.
- 'big5-hkscs', a different extension to big5, adding characters used
is Hong Kong, incompatible with big5p.
Gnu libiconv has most of the above mappings other than big5p; I'm willing
to supply their maps if it's ok with the list.
/Autrijus/
pgpcGYu3L6nTI.pgp
Description: PGP signature