perl-unicode

Re: [tagunov(_at_)motor(_dot_)ru: Better names for JIS X 0201/0208/0212? (was: ISO-8859-1 vs ISO 8859-1 (typo + UTF8 case too :)]

2002-03-19 12:52:15
Aanton,

You may not know this but JISX2\d\d does not determine encoding at at all. They are tables that actual encodings base upon. In other words, RAW JISX2\d\d are NEVER used in actual encodings. Here is how EUC-JP uses ALL of them.

EUC-JP consists of....
ASCII:                  0x00-0x7f
JIS X 0201-1978:        0x8E + (X0201 Code beyond 0x80)
JIS X 0208-1990:        (X0208 Code) + 0x8080
JIS X 0212-1990:        0x8F . (X0212 Code)

In other words, virtually all multibyte encodings are not only multibyte, but also multitabled. Then how come there are Encode/jis02\d\d ? Good question. They are used by Encode::Tcl to implement 7-bit jis. Encode/7bit-jis.enc looks like this.

# Encoding file: 7bit-jis, escape-driven
E
name            7bit-jis
init            {}
final           {}
ascii           \x1b(B
ascii           \x1b(J
7bit-kana       \x1b(I
jis0208         \x1b$B
jis0208         \x1b$@
jis0208         \x1b&@\x1b$B
jis0212         \x1b$(D

This is how ISO-2022 implements a given encoding. It switches the "current" encoding by escape sequence. Since escape sequence is used, "raw" encodings are directly applied, while EUC turns Most significant bit (MSB) on. As a transfer encoding, ISO-2022 is great because in theory it can swallow any number of subcodings. However, it is pain to the neck to use it as internal encoding because you can't tell in what subcoding we are now in just by looking a given byte. In case of EUC you tell if it is single byte or multibyte just by looking at MSB. And Unicode was born out of even more ambition by making all character double-byte (but this has failed (or compromised if you don't like the word) when surrogate pair was introduced).

Dan the Man with Too Many Codings

On Wednesday, March 20, 2002, at 04:28 , Jarkko Hietaniemi wrote:
More from Anton...

----- Forwarded message from Anton Tagunov <tagunov(_at_)motor(_dot_)ru> -----

Subject: Better names for JIS X 0201/0208/0212? (was: ISO-8859-1 vs ISO 8859-1 (typo + UTF8 case too :)
From: Anton Tagunov <tagunov(_at_)motor(_dot_)ru>
Date: Tue, 19 Mar 2002 20:32:26 +0300
Message-ID: <17271715972(_dot_)20020319203226(_at_)motor(_dot_)ru>
To: Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com>
Cc: nick(_at_)unfortu(_dot_)net, perl5-porters(_at_)perl(_dot_)org
In-reply-To: 
<20020319085528(_dot_)1397(_dot_)4(_at_)bactrian(_dot_)elixent(_dot_)com>

Hello, Nick! Hello, all!

I certainly think that the names like 'JIS 0201'
are embarrassing.

Here's rfc1345

  &charset JIS_X0201
  &alias X0201
  ...8-bit, JIS-Roman (0xA1-0x7E) + JIS-Katakana (0xA1-0xFE)

  &charset JIS_C6226-1983
  &alias iso-ir-87
  &bits 16
  &alias x0208
  &alias JIS_X0208-1983

  &charset JIS_X0212-1990
  &alias x0212
  &alias iso-ir-159
  &bits 16

here's IANA registry

  Name: JIS_X0201                                  [RFC1345,KXS2]
  MIBenum: 15
  Source: JIS X 0201-1976.   One byte only, this is equivalent to
          JIS/Roman (similar to ASCII) plus eight-bit half-width
          Katakana
  Alias: X0201

  Name: JIS_C6226-1983                             [RFC1345,KXS2]
  Alias: iso-ir-87
  Alias: x0208
  Alias: JIS_X0208-1983

  Name: JIS_X0212-1990                             [RFC1345,KXS2]
  MIBenum: 98
  Alias: x0212
  Alias: iso-ir-159


Are

  JIS_X0201      / X0201 /
  JIS_C6226-1983 / X0208 /
  JIS_X0212      / X0212

better candidates?

- Anton

P.S. I have also seen JIS_X0201 referred to as

  JIS X 0201-1976
  JIS X 0201 Katakana/JIS X 0201 Roman
  JISX0201.

P.P.S. (currently
JIS 0201/JIS 0208/JIS 0212 do not seem to work for me:

perl15173 -MEncode -MEncode::JP -we "print Encode::decode('JIS 0210','aaa')"

gives me Unknown encoding 'JIS 210' at -e line 1, only
Encode::encode('JIS0201','aaa') behaves okay..

and if they are not working, we're free to change the names for anything
we like ;-)


----- End forwarded message -----

--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>