perl-unicode

Re: Jcode->new(q(~Greetings!~), 'utf8')->sjis eq '~Greetings~' ?

1999-07-17 16:31:34
On Sun, 18 Jul 1999, Dan Kogai wrote:

Hello folks,

  Hi.  My name is Dan Kogai.  I have just uploaded Jcode perl module to 
CPAN recently.  This module is designed as a successor of jcode.pl (If you 
are a Perl coder in Japan, you gotta know that script).  You can find more 
about it at http://openlab.ring.gr.jp/Jcode/ .  One of the major 
enhancements of Jcode.pm from jcode.pl is the ability to handle Unicode (
UCS2 and UTF8, so far).
  Now here is the question.  my 1st implementation of UCS2 <-> EUC-JP 
conversion was very simple; just faithfully obey the rule that Unicode Inc. 
casts.  It seemed okay until one of my friends gave me the following 
complaint.

    Jcode->new('~k16', 'utf8')->sjis doesn't return '~k16' !!

  And here is why.

* Jcode->new stores the string in EUC-JP (That's the only code perl can 
swallow in the script.

I don't understand this. Yes, you can't inline it in *displayable* form in
the Perl code, but you can write it in \x notation: 

@JIS0201ToUCS2Map=(
        "","","","","","","","",
        "","","","","","","","",
        "","","","","","","","",
        "","","","","","","","",
"\x0020","\x0021","\x0022","\x0023","\x0024","\x0025","\x0026","\x0027",
"\x0028","\x0029","\x002A","\x002B","\x002C","\x002D","\x002E","\x002F",
"\x0030","\x0031","\x0032","\x0033","\x0034","\x0035","\x0036","\x0037",
"\x0038","\x0039","\x003A","\x003B","\x003C","\x003D","\x003E","\x003F",
"\x0040","\x0041","\x0042","\x0043","\x0044","\x0045","\x0046","\x0047",
"\x0048","\x0049","\x004A","\x004B","\x004C","\x004D","\x004E","\x004F",
"\x0050","\x0051","\x0052","\x0053","\x0054","\x0055","\x0056","\x0057",
"\x0058","\x0059","\x005A","\x005B","\x00A5","\x005D","\x005E","\x005F",
"\x0060","\x0061","\x0062","\x0063","\x0064","\x0065","\x0066","\x0067",
"\x0068","\x0069","\x006A","\x006B","\x006C","\x006D","\x006E","\x006F",
"\x0070","\x0071","\x0072","\x0073","\x0074","\x0075","\x0076","\x0077",
"\x0078","\x0079","\x007A","\x007B","\x007C","\x007D","\x203E","",
        "","","","","","","","",
        "","","","","","","","",
        "","","","","","","","",
        "","","","","","","","",
        "","\xFF61","\xFF62","\xFF63","\xFF64","\xFF65","\xFF66","\xFF67",
"\xFF68","\xFF69","\xFF6A","\xFF6B","\xFF6C","\xFF6D","\xFF6E","\xFF6F",
"\xFF70","\xFF71","\xFF72","\xFF73","\xFF74","\xFF75","\xFF76","\xFF77",
"\xFF78","\xFF79","\xFF7A","\xFF7B","\xFF7C","\xFF7D","\xFF7E","\xFF7F",
"\xFF80","\xFF81","\xFF82","\xFF83","\xFF84","\xFF85","\xFF86","\xFF87",
"\xFF88","\xFF89","\xFF8A","\xFF8B","\xFF8C","\xFF8D","\xFF8E","\xFF8F",
"\xFF90","\xFF91","\xFF92","\xFF93","\xFF94","\xFF95","\xFF96","\xFF97",
"\xFF98","\xFF99","\xFF9A","\xFF9B","\xFF9C","\xFF9D","\xFF9E","\xFF9F",
"","","","","","","","",
        "","","","","","","","",
        "","","","","","","","",
        "","","","","","","",""
);

(I will manfully refrain from posting my 268K hash based JIS208<->UCS2
table ;-)  ) 

There are more compact ways to do this, and perhaps external maps
(loaded via Storable?) would even be better.

 jperl can swallow SHIFT_JIS as well but that's 
another issue)  Conversion is taken if necessary.
* Here we have explictly stated that '~k16' is a UTF8 string so Jcode->new 
tries to convert it.
* UTF8 is first converted to UCS2 then EUC.
* Since UTF8 leaves ASCII as it is, '~k16' is just '~k16'.  In UCS2, that's 
"\x00~\x00k\x001\x006".  So far so good.

To reduce confusion, let's write that:

 \x007e\x006b\x0031\x0036

in UCS2. 

;)

* AND HERE IS THE PROBLEM.  The conversion table

   ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0201.TXT

states that 2 charcodes in ASCII area [\x00-\xff] are mapped to oblivion.  
They are '\' (chr(0x5c)) and  '~' (chr(0x7e).  In this case. '~' is mapped 
to "\x8f\xa2\xb7" in EUC-JP.

In JIS201, yes. Along with a bunch of other characters. JIS201 doesn't
cover much. What does this have to do with SJIS, EUC or UTF8? I'm missing
something here. 
 
* However.  The "\x8f\xa2\xb7" belongs to JIS0212, which is UNSUPPORTED in 
SHIFT_JIS, even though SHIFT_JIS (the most widely-used Japanese Charset so 
far) is SUPPOSED TO BE compatible with ASCII.

UCS2   -> SJIS
\x007e -> \x?
\x006b -> \x6b
\x0031 -> \x31
\x0036 -> \x36

Hmmm...Ok. I think I see.

The Shift-JIS to Unicode 1.1 table maps \x7e to \x203E. So the problem
is that you are starting in UTF8 with

\x7e

mapping to UCS2

\x7e -> \x007e

and then trying to map UCS2->SJIS - which doesn't map \x007e to
*ANYTHING*. Reading over
'ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT', I would
say that someone mis-read '~' as 'overline' rather than 'tilde' on their
screen when they made it back in 94. I would just change your mapping to
use

\x7e -> \x007e

instead of

\x7e -> \x0203e

It looks like an error in
<URL:ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT>

* So you end up with "[unknown]k16", instead of "~k16".

  So I tweak the code a little bit.  After ver. 0.40,  Jcode.pm leaves 
ASCII as it is unless $Jcode::Unicode::PEDANTIC is set to non-zero.
  My question is, Will the Unicode-savvy perl behave pedantically or not.

I don't think Unicode-savvy Perl has an impact either way. Your new module
is about *conversions* between Unicode and JP-encodings. If you write your
code in UTF8 (I *assume* that is the direction Perl Unicode is heading) -
you would never tell the difference unless you wrote your 'Unicode->SJIS'
converter to care. Anyhow, I'm not sure that it counts as pedantic. Has
the Unicode Consortium 'officially' blessed *any* CJK <-> Unicode
conversions? Their tables (1994 1.1 I note) I think are simply their 'best
guesses' on how the two map together. 

  
If so, All the tildes, used so often in Perl, will be nothing but line 
noise on most platforms used in Japan.

That doesn't have anything to do with Perl - that has to do with what
*converters* do. If a program is writting in UTF8 and displayed on a UTF8
aware program - tildes will be fine. You only get into trouble if you
write in SJIS, save in UTF8 and your SJIS->UTF8 convertor converts \x7e to
the UTF8 point matching UCS2 \x203e since it can't be reversed (which is
part of why I think it is a bug in
<URL:ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT>). 

Anyhow, assuming that Microsoft cp932 (Microsoft's 'official' SJIS codeset
aka MS Kanji) 
<URL:ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT>
is definitive to what Microsoft thinks about \x7e, then
<URL:ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT> is
just plain in error: \x7e SJIS should map to \x007e UCS2.

It looks like cp936 might be the expanded codeset they introduced to
support jis212-1990.... Does anyone know for sure? 

Have you looked at a book called: 'CJKV Information Processing' by Ken
Lunde, published by O'Reilly? (ISBN-1-56592-224-7)? It includes Perl code
for the various encoding interconversions (and tons of information in
general and specific). 

-- 
Benjamin Franz

"Perl is the programmer's swiss army tactical nuke - people using it 
are expected to know what they are doing and that is why Perl has 
power."               -- Tuomas Lukka

PGP: 1024/77579DB1 FP=67 B0 18 4A 28 70 BE 78  DE 5A 28 1E 10 A5 8E 31 

<Prev in Thread] Current Thread [Next in Thread>