Aanton,
You may not know this but JISX2\d\d does not determine encoding at at
all. They are tables that actual encodings base upon. In other words,
RAW JISX2\d\d are NEVER used in actual encodings. Here is how EUC-JP
uses ALL of them.
EUC-JP consists of....
ASCII: 0x00-0x7f
JIS X 0201-1978: 0x8E + (X0201 Code beyond 0x80)
JIS X 0208-1990: (X0208 Code) + 0x8080
JIS X 0212-1990: 0x8F . (X0212 Code)
In other words, virtually all multibyte encodings are not only
multibyte, but also multitabled.
Then how come there are Encode/jis02\d\d ? Good question. They are
used by Encode::Tcl to implement 7-bit jis. Encode/7bit-jis.enc looks
like this.
# Encoding file: 7bit-jis, escape-driven
E
name 7bit-jis
init {}
final {}
ascii \x1b(B
ascii \x1b(J
7bit-kana \x1b(I
jis0208 \x1b$B
jis0208 \x1b$@
jis0208 \x1b&@\x1b$B
jis0212 \x1b$(D
This is how ISO-2022 implements a given encoding. It switches the
"current" encoding by escape sequence. Since escape sequence is used,
"raw" encodings are directly applied, while EUC turns Most significant
bit (MSB) on.
As a transfer encoding, ISO-2022 is great because in theory it can
swallow any number of subcodings. However, it is pain to the neck to
use it as internal encoding because you can't tell in what subcoding we
are now in just by looking a given byte. In case of EUC you tell if it
is single byte or multibyte just by looking at MSB. And Unicode was
born out of even more ambition by making all character double-byte (but
this has failed (or compromised if you don't like the word) when
surrogate pair was introduced).
Dan the Man with Too Many Codings
On Wednesday, March 20, 2002, at 04:28 , Jarkko Hietaniemi wrote:
More from Anton...
----- Forwarded message from Anton Tagunov <tagunov(_at_)motor(_dot_)ru> -----
Subject: Better names for JIS X 0201/0208/0212? (was: ISO-8859-1 vs
ISO 8859-1 (typo + UTF8 case too :)
From: Anton Tagunov <tagunov(_at_)motor(_dot_)ru>
Date: Tue, 19 Mar 2002 20:32:26 +0300
Message-ID: <17271715972(_dot_)20020319203226(_at_)motor(_dot_)ru>
To: Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com>
Cc: nick(_at_)unfortu(_dot_)net, perl5-porters(_at_)perl(_dot_)org
In-reply-To:
<20020319085528(_dot_)1397(_dot_)4(_at_)bactrian(_dot_)elixent(_dot_)com>
Hello, Nick! Hello, all!
I certainly think that the names like 'JIS 0201'
are embarrassing.
Here's rfc1345
&charset JIS_X0201
&alias X0201
...8-bit, JIS-Roman (0xA1-0x7E) + JIS-Katakana (0xA1-0xFE)
&charset JIS_C6226-1983
&alias iso-ir-87
&bits 16
&alias x0208
&alias JIS_X0208-1983
&charset JIS_X0212-1990
&alias x0212
&alias iso-ir-159
&bits 16
here's IANA registry
Name: JIS_X0201 [RFC1345,KXS2]
MIBenum: 15
Source: JIS X 0201-1976. One byte only, this is equivalent to
JIS/Roman (similar to ASCII) plus eight-bit half-width
Katakana
Alias: X0201
Name: JIS_C6226-1983 [RFC1345,KXS2]
Alias: iso-ir-87
Alias: x0208
Alias: JIS_X0208-1983
Name: JIS_X0212-1990 [RFC1345,KXS2]
MIBenum: 98
Alias: x0212
Alias: iso-ir-159
Are
JIS_X0201 / X0201 /
JIS_C6226-1983 / X0208 /
JIS_X0212 / X0212
better candidates?
- Anton
P.S. I have also seen JIS_X0201 referred to as
JIS X 0201-1976
JIS X 0201 Katakana/JIS X 0201 Roman
JISX0201.
P.P.S. (currently
JIS 0201/JIS 0208/JIS 0212 do not seem to work for me:
perl15173 -MEncode -MEncode::JP -we "print Encode::decode('JIS
0210','aaa')"
gives me Unknown encoding 'JIS 210' at -e line 1, only
Encode::encode('JIS0201','aaa') behaves okay..
and if they are not working, we're free to change the names for anything
we like ;-)
----- End forwarded message -----
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen