perl-unicode

Re[2]: [tagunov(_at_)motor(_dot_)ru: Better names for JIS X 0201/0208/0212? (was: ISO-8859-1 vs ISO 8859-1 (typo + UTF8 case too :)]

2002-03-19 15:24:05
Hello, Dan!
Hello, Jarkko!
Hello, Nick!

I'm a bit confused with perl-unicode(_at_)perl(_dot_)org Is that a better place
for our conversation?  Is it alive? Has any traffic?

First of all I'm terribly glad to have somebody with a .jp address
to step into the discussion! :-)))

1) Yes, I have read lately a lot. And I have caught the ideas behind
   ISO 2022 and EUC.

2) Indeed I have sent plenty of mails to perl5-porters(_at_)perl(_dot_)org
   about JIS 0201, JIS 0208, JIS0212, GB 1988, GB 2312 in Encode.pm

   My first reaction to seeing them in Encode.pm was the same
   as Dan's: "what are they doing there?"

  (using terminology from the
  http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html
  page)
  "they are Coded Character Sets, not Character Encoding Schemes!"

  "Wipe 'em out of Encode.pm!"

3) But I did not meet much approval for this.
   Here are some of
   NIS == Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com> 's 
comments:
   
   AT>  GB 2312 ...
   AT>  I would advise to exclude it from Encode.pm
   NIS> We have it due to fact we inherited stuff from Tcl/Tk - which needs
   NIS> it for fonts e.g.:
   NIS>
   NIS> nick(_at_)bactrian 1051$ xlsfonts | grep 2312
   NIS> -cc-song-medium-r-normal-jiantizi-40-400-75-75-c-400-gb2312.1980-0
   AT>  GB 12345 (GB/T 12345-9)
   NIS> Used for fonts I assume.
   AT>  CNS 11643-1992
   NIS> I see a trend here - perhaps a separate bundle for font encodings?
   AT>  GB 18030
   AT>  BIG5PLUS
   NIS> They are mostly inherited from Tcl

   AT>  is GB 2312 valid as a parameter to Encode::encode?
   NIS> We have an gb2312.enc
   NIS>
   NIS> FWIW (and worth has to be -ve) I get a daily pile of SPAM with Subjects
   NIS> like:
   NIS>
   NIS> Subject: =?GB2312?B?0LvQu8Tjo6E=?=
   NIS>
   NIS> So something thinks it is an encoding.

   Since I dit not have an immediate 100% approval to wipe 'me out of
   Encode.pm I have suspected that I was wrong and continued my
   searches for new knowledge.
   And seemed 2 me today that I have understood why
   JIS 0201, JIS 0208, JIS 0212, GB 2312 are in Encode.pm

4) We have a terminology mess here. What is a "coded character set"?
   rfc1345 (http://www.ietf.org/rfc/rfc1345.txt):

 From RFC1345:
 The ISO definition of the term "coded character set" is as follows:
 "A set of unambiguous rules that establishes a character set and the
 one-to-one relationship between the characters of the set and their
 coded representation." and this definition may be subject to
 different interpretations.  This memo does not put further
 restrictions on the term of "coded character set" than the following:
 "A coded character set is a set of rules that unambiguously and
 completely determines which sequence of characters, if any, is
 represented by each possible sequence of n-bit bytes for a certain
 value of n." This implies that e.g. a coded character set extended
 with one or more other coded character sets by means of the extension
 techniques of ISO 2022 constitutes a coded character set in its own
 right.  In this memo the term "charset" is used to refer to the above
 interpretation of the ISO term "coded character set".

   on the other hand
   http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html
   uses Coded Character Set and Character Encoding Scheme
   separately (though I haven't found where the terms used in
   that page are defined)

   What meaning of a "coded character set" should we stick to?
   
   Let me use
   - "CCS" (acronym of Coded Character Set)
     when I mean abstract enumeration of characters in general
     (f.e. the JIS X 0208 94 x 94 table enumerates chars
      via the KUTEN (row-cell) code, then this 94 x 94
      table specifies a CCS)
   - "CES" (acronum of Character Encoding Scheme)
     when I mean mapping from sequences of 8-bit (or 7-bit)
     bytes into character sequences and back
   This terminology comes from
   http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html
   

   On the other hand let me use "coded character set"
   in the meaning that the cited fragment of rfc1345 has
   attached to it.

   So
     "CES" === "coded character set"
     "CCS"  ne "coded character set"
     "CES"  ne "CCS"

   http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html
   says "ISO 8859 is both a series of CCS and a series of encodings"
   That is the salt. The CCS and CES at the same time.
   
   The question:

   are JIS X 0201, JIS X 0208, JIS X 0212 only "CCS"-es or
   "CCS"-es and "CES"-es at the same time?
   Dan's mail said they are only "CCS"-es.
   Encode.pm operates on "CES"-es.
   That was my original opinion too.

   But:
   http://www.iana.org/assignments/character-sets
   is a "Character Sets Registry"
   
   What meaning does this registry attach to "coded character set"
   term? I believe that this is "CES" after all. It contains
   things like Shift_JIS and no one will doubt that Shift_JIS
   is a "CES".

   But it (the IANA registry) also contains things like

   Name: JIS_C6220-1969-jp                       [RFC1345,KXS2]
   Source: ECMA registry
   Alias: JIS_C6220-1969
   Alias: iso-ir-13
   Alias: katakana

   Name: JIS_C6220-1969-ro                                 [RFC1345,KXS2]
   Source: ECMA registry
   Alias: iso-ir-14
   Alias: jp
   Alias: ISO646-JP
   Alias: csISO14JISC6220ro

   How do we understand this? Is JIS_C6220-1969-jp a "CES"?
   For the answer we should look at iso-ir-13. I have no
   access to the ISO documents, but RFC1345 seems to
   retell the ISO documents. RFC1345 explains to us what
   iso-ir-13 is.



 From RFC1345:
 It is the intention of this
 memo to document precisely the mapping between all characters and
 their corresponding coded representations in various coded character
 sets, and give names to these coded character sets, so they can be
 referenced unambiguously in Internet standards.
 ...
 The coded character sets covered include all parts of ISO 8859, ISO
 6937-2 and all ISO 646 conforming coded character sets in the ISO
 character set registry managed by ECMA according to ISO 2375.  Almost
 all graphic coded character sets in the ECMA registry (1) are
 covered. ... The
 East-Asian 16-bit character sets from the ECMA registry is also
 included in this memo.
 
   So rfc1345 actually specifies a number of "coded character sets".
   this is what it has for iso-ir-13:

  &charset JIS_C6220-1969-jp
  &rem source: ECMA registry
  &alias JIS_C6220-1969
  &alias iso-ir-13
  &alias katakana
  &alias x0201-7
  &g0esc x2849 &g1esc x2949 &g2esc x2a49 &g3esc x2b49
  &code 0
  NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
  DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
  SP ._ <' >' ,_ .6 Wo a6 i6 u6 e6 o6 YA YU YO TU
  -6 A6 I6 U6 E6 O6 Ka Ki Ku Ke Ko Sa Si Su Se So
  Ta Ti Tu Te To Na Ni Nu Ne No Ha Hi Hu He Ho Ma
  Mi Mu Me Mo Ya Yu Yo Ra Ri Ru Re Ro Wa N6 "5 05
  ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
  ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? DT

  It has taken me quite a bit of time to understand this,
  but I have done that :-)
  - &g0esc... line is not very interesting to us, it
    only lists the ESC sequences that are used in ISO 2022
    to plug in JIS_C6220-1969-jp as G0/G1/G2/G3
  - &code 0   is more interesting. This is how &code0 is
    defined:

 From RFC1345:
 "&code" has one parameter indicating the byte number allocated to the
 following character mnemonic. After the "&code" specification the
 characters are listed with their mnemonic in ascending order.  A
 character mnemonic of "??" indicates that the position is unused.

  This means that RFC1345 defines "CES"-es too! Look, &code 0 mean
  that that bytes numbered starting with 0 are allocated to the
  following character mnemonics!! Relation between sequences of
  bytes and characters!!

  The mnemonics  NU SH SX EX ET.. stand for C0 control chars,
  Ta Ti Tu Te To Na.. stand for Katakana characters. (RFC1345
  explains the mnemonics in section 3, but it so boring to read :-)

 From RFC1345:
 "&code2" has 2 parameters specifying the row and column in certain
 16-bit character sets.  The value 32 must be added to obtain the
 first and second byte respectively.

  This is how RFC1345 deals with JIS X 0208:

  &charset JIS_C6226-1983
  &rem source: ECMA registry
  &alias iso-ir-87
  &bits 16
  &g0esc x2442 &g1esc x242942 &g2esc x242a42 &g3esc x242b42
  &alias x0208
  &alias JIS_X0208-1983
  &code2 1 1
  SP ,_ ._ , . .6 : ; ! "5 05 '' '! ': '> '- _ *6 +6 *5 +5 +"
  
  so we see that SP (space) in JIS_C6226-1983 gets two byte
  code. We get it this way: take &code2's "1 1", conclude
  it is row 1, cell 1, add 0x20 to each of them, get
  0x21 0x21 and take this to be the two-byte code for SP
  in JIS_C6226-1983

  
5) So, my conclusion: everything listed in
   http://www.iana.org/assignments/character-sets
   are "CES"-es.

   All coded character sets mentioned in RFC 1345 are CES-es too.

   This is what both RFC 1345 and IANA registry have:

   &charset JIS_X0201
   &alias X0201

   Name: JIS_X0201                       [RFC1345,KXS2]
   Alias: X0201

   This _must_ be a CES. What is it? If we look into RFC, we'll see
   a table with 256 mnemonics. First half is JIS X 0201 Roman,
   the second half is JIS X 0201 Katakana.
   The IANA registry usually just names the encoding. It does
   not explain what it is. But in some cases there's an explanation.
   JIS_X0201 is such a case:
   
     Name: JIS_X0201                                [RFC1345,KXS2]
     Source: JIS X 0201-1976.   One byte only, this is equivalent to
     JIS/Roman (similar to ASCII) plus eight-bit half-width
     Katakana

   Just what we could conclude from RFC1345. Hurray! :-)))

   Then the only thing we should do about this is to come up
   with a better name: JIS 0201 is bad. Would it better be
   JIS_X0201? X0201? 'JIS X 0201'? 'JIS-X-0201'?

   Same thing with JIS 0208, JIS 0212.
   They are registered both in RFC1345 and in the IANA registry.

   Why should we wipe them from Encode.pm if these are CES-es
   and they are doing something reasonable?
   
6) The situation is worse with GB 1988.
   Unlike JIS X 0201 it defines only one set of characters:
   ASCII: $->Yuan.

   But our 1988.enc seems to use the codes 0xA1-0xFE the same
   way as for JIS X 0201, for Katakana. Is this reasonable?
   Do Chinese use Katakana?
   (But this was in another mail of mine)

7) GB-2312_80 is just in the same situation: at the first glance
   it is only a CCS, but at the second one it is a CES too!
   It is registered both at RFC1345 and in IANA registry.
   The only thing we have to do with it is think of a better
   name. (See my other name to see why GB 2312 is ambigeous)

8) The situation with
    CNS 11643
    GB 12345
   is worse because they are registered neither in RFC1345 nor
   at the IANA registry. And they also look like CCS-es at
   the first glance. But them being in Encode.pm seems
   to indicate they are full weight CES-es on their own,
   just like all others mentioned so far.


Uff.. It hasn't been a piece of cake to understand this,
but once I have done this I'm more or less content :-))

The only things left to do are
- fix the comments in Encode.pm:
  JIS 0201,JIS 0208,JIS 0212,GB 2312 are not good names
  
- GB 1988 may be just broken (should there be katakana in GB 1988?)

- put at least _some_ of the _this_ knowledge into Encode.pm's
  comments (this is what my patch is partially about)

- submit a bugreport saying that aliasing mechanisms are broken
  in Encode.pm

And have some sleep :-)))

My best regards, Anton

DK> Aanton,

DK>    You may not know this but JISX2\d\d does not determine encoding at at 
DK> all.  They are tables that actual encodings base upon.  In other words, 
DK> RAW JISX2\d\d are NEVER used in actual encodings.  Here is how EUC-JP 
DK> uses ALL of them.

DK> EUC-JP consists of....
DK> ASCII:                  0x00-0x7f
DK> JIS X 0201-1978:        0x8E + (X0201 Code beyond 0x80)
DK> JIS X 0208-1990:        (X0208 Code) + 0x8080
DK> JIS X 0212-1990:        0x8F . (X0212 Code)

DK>    In other words, virtually all multibyte encodings are not only 
DK> multibyte, but also multitabled.
DK>    Then how come there are Encode/jis02\d\d ?  Good question.  They are 
DK> used by Encode::Tcl to implement 7-bit jis.  Encode/7bit-jis.enc looks 
DK> like this.

DK> # Encoding file: 7bit-jis, escape-driven
DK> E
DK> name            7bit-jis
DK> init            {}
DK> final           {}
DK> ascii           \x1b(B
DK> ascii           \x1b(J
DK> 7bit-kana       \x1b(I
DK> jis0208         \x1b$B
DK> jis0208         \x1b$@
DK> jis0208         \x1b&@\x1b$B
DK> jis0212         \x1b$(D

DK>    This is how ISO-2022 implements a given encoding.  It switches the 
DK> "current" encoding by escape sequence.  Since escape sequence is used, 
DK> "raw" encodings are directly applied, while EUC turns Most significant 
DK> bit (MSB) on.
DK>    As a transfer encoding,  ISO-2022 is great because in theory it can 
DK> swallow any number of subcodings.  However, it is pain to the neck to 
DK> use it as internal encoding because you can't tell in what subcoding we 
DK> are now in just by looking a given byte.  In case of EUC you tell if it 
DK> is single byte or multibyte just by looking at MSB.  And Unicode was 
DK> born out of even more ambition by making all character double-byte (but 
DK> this has failed (or compromised if you don't like the word) when 
DK> surrogate pair was introduced).

DK> Dan the Man with Too Many Codings

DK> On Wednesday, March 20, 2002, at 04:28 , Jarkko Hietaniemi wrote:
More from Anton...

----- Forwarded message from Anton Tagunov <tagunov(_at_)motor(_dot_)ru> 
-----

Subject: Better names for JIS X 0201/0208/0212? (was:  ISO-8859-1 vs 
ISO 8859-1 (typo + UTF8 case too :)
From: Anton Tagunov <tagunov(_at_)motor(_dot_)ru>
Date: Tue, 19 Mar 2002 20:32:26 +0300
Message-ID: <17271715972(_dot_)20020319203226(_at_)motor(_dot_)ru>
To: Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com>
Cc: nick(_at_)unfortu(_dot_)net, perl5-porters(_at_)perl(_dot_)org
In-reply-To: 
<20020319085528(_dot_)1397(_dot_)4(_at_)bactrian(_dot_)elixent(_dot_)com>

Hello, Nick! Hello, all!

I certainly think that the names like 'JIS 0201'
are embarrassing.

Here's rfc1345

  &charset JIS_X0201
  &alias X0201
  ...8-bit, JIS-Roman (0xA1-0x7E) + JIS-Katakana (0xA1-0xFE)

  &charset JIS_C6226-1983
  &alias iso-ir-87
  &bits 16
  &alias x0208
  &alias JIS_X0208-1983

  &charset JIS_X0212-1990
  &alias x0212
  &alias iso-ir-159
  &bits 16

here's IANA registry

  Name: JIS_X0201                                  [RFC1345,KXS2]
  MIBenum: 15
  Source: JIS X 0201-1976.   One byte only, this is equivalent to
          JIS/Roman (similar to ASCII) plus eight-bit half-width
          Katakana
  Alias: X0201

  Name: JIS_C6226-1983                             [RFC1345,KXS2]
  Alias: iso-ir-87
  Alias: x0208
  Alias: JIS_X0208-1983

  Name: JIS_X0212-1990                             [RFC1345,KXS2]
  MIBenum: 98
  Alias: x0212
  Alias: iso-ir-159


Are

  JIS_X0201      / X0201 /
  JIS_C6226-1983 / X0208 /
  JIS_X0212      / X0212

better candidates?

- Anton

P.S. I have also seen JIS_X0201 referred to as

  JIS X 0201-1976
  JIS X 0201 Katakana/JIS X 0201 Roman
  JISX0201.

P.P.S. (currently
JIS 0201/JIS 0208/JIS 0212 do not seem to work for me:

perl15173 -MEncode -MEncode::JP -we "print Encode::decode('JIS 
0210','aaa')"

gives me Unknown encoding 'JIS 210' at -e line 1, only
Encode::encode('JIS0201','aaa') behaves okay..

and if they are not working, we're free to change the names for anything
we like ;-)


----- End forwarded message -----

--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen



-- 
Best regards,
 Anton                            mailto:tagunov(_at_)motor(_dot_)ru


<Prev in Thread] Current Thread [Next in Thread>