Re: Parsing JIS X 0208 & Shift JIS with 5.8.0 +++++Success

I'm cross posting this to the perl unicode list because the pods say they might be interested in my dopey luser feedback, well actually not with those words they don't :-).

The process by which I arrived at the solution might seem painful to some, but I'm listing it here in case anyone else is/will be facing the same problem:
which is

On Tuesday, October 1, 2002, at 11:50 PM, Robin wrote:

*parse a collection of ASCII docs mixed in with docs in iso-2022-jp, shiftjis and possibly 7bit-jis, (by which I mean each doc could be 1 of three encodings, not 1 doc a mixture of all three).
*parse for tokens (Kanji charcters - ie neither Hiragana or Katakana)
*do regex substitutions accordingly

On Wednesday, October 2, 2002, at 02:14 PM, Joel Rees wrote:

You probably want these:
toshi (year) is 0x472f (JIS) and 0x944e (shift).
tsuki (month) is 0x376e (JIS) and 0x8c8e (shift).
nichi (day) is 0x467c (JIS) and 0x93fa (shift).

Thanks Joel (for all your imput), that is exactly what I needed - the character codes of the kanji I'm testing for

===from the perluniintro pod===

See "Further Resources" for how to find all these numeric codes.

===from the perluniintro pod===

The "Further Resources" mentioned in the pod, lists the vast unicode website. Once there I can't find kanji related documents, which is logical due to their Chinese origins but doesn't seem to faciliate my search as I have no idea what Chinese characters belong together as a group - after wading through various pdfs the only listing I can find which features any (one actually) of the kanji I'm intending to use as a token is in U3200.pdf and has character codes 32C1 - 32CB (namely character =[NUMBER + tsuki] ). All I want (from this group) is the character code for tsuki . Of course it's on the site somewhere, but not where I expect it to be (ie in a section called Kanji, but that's my problem not the unicode consortium's).

Time for hubris and laziness, I know which kanji I want to test for, so why not get these codes programatically using ord(), the way I would with ASCII:

===from the perluniintro pod===
At run-time you can use "chr()":

my $hebrew_alef = chr(0x05d0);

Naturally, "ord()" will do the reverse: it turns a character into a code
point.
===from the perluniintro pod===

use Encode::jp;

print ord('月'); #tsuki

which outputs: 140 (aka \X8C)

Ok, obviously ord() is assuming it's testing ASCII and returning the value for the first bite of a multi byte character encoding, ergo I'm missing something vital about how encoding works.

A while back, when I was researching how to approach dealing with Japanese text, Sadahiro Tomoyuki (owner of http://homepage1.nifty.com/nomenclator/perl/indexE.htm) (arigatou gosai masu Sadahiro -san) kindly wrote and told me how it was effectively done in the past by Japanese perlers:

(1) conversion of input in Shift_JIS to EUC_JP
(2) processing (in EUC_JP)
(3) conversion of the result (in EUC_JP) to Shift_JIS
(4) output

the same method is used by perl 5.8.0 except unicode (UTF8) is used as the internal perl processing form instead of EUC_JP

Dan Kogai wrote:

use Encode qw/encode decode/;
#...
my $utf8 = decode('shift-jis', $string);

use strict;
use diagnostics-verbose;
use Encode qw/encode decode/;

my ($data)='月';
my $utf8 = decode('shift-jis', $data);
print ord($utf8);
print chr($utf8);

yielded:
26376
Argument "\x{6708}" isn't numeric in chr at /Users/robin/Desktop/test jp.pl line 13 (#1)

Ok so it was a partial success, but I grasp what I'm doing now. Thanks to everyone that took the time to reply.