perl-unicode
|
Re: Parsing JIS X 0208 & Shift JIS with 5.8.0 +++++Success2002-10-02 11:30:04
I'm cross posting this to the perl unicode list because the pods say they might be interested in my dopey luser feedback, well actually not with those words they don't :-).
The process by which I arrived at the solution might seem painful to some, but I'm listing it here in case anyone else is/will be facing the same problem: which is On Tuesday, October 1, 2002, at 11:50 PM, Robin wrote: *parse a collection of ASCII docs mixed in with docs in iso-2022-jp, shiftjis and possibly 7bit-jis, (by which I mean each doc could be 1 of three encodings, not 1 doc a mixture of all three). On Wednesday, October 2, 2002, at 02:14 PM, Joel Rees wrote: You probably want these: Thanks Joel (for all your imput), that is exactly what I needed - the character codes of the kanji I'm testing for ===from the perluniintro pod=== See "Further Resources" for how to find all these numeric codes. ===from the perluniintro pod=== The "Further Resources" mentioned in the pod, lists the vast unicode website. Once there I can't find kanji related documents, which is logical due to their Chinese origins but doesn't seem to faciliate my search as I have no idea what Chinese characters belong together as a group - after wading through various pdfs the only listing I can find which features any (one actually) of the kanji I'm intending to use as a token is in U3200.pdf and has character codes 32C1 - 32CB (namely character =[NUMBER + tsuki] ). All I want (from this group) is the character code for tsuki . Of course it's on the site somewhere, but not where I expect it to be (ie in a section called Kanji, but that's my problem not the unicode consortium's). Time for hubris and laziness, I know which kanji I want to test for, so why not get these codes programatically using ord(), the way I would with ASCII: ===from the perluniintro pod=== At run-time you can use "chr()": my $hebrew_alef = chr(0x05d0); Naturally, "ord()" will do the reverse: it turns a character into a code point. ===from the perluniintro pod=== use Encode::jp; print ord('月'); #tsuki which outputs: 140 (aka \X8C) Ok, obviously ord() is assuming it's testing ASCII and returning the value for the first bite of a multi byte character encoding, ergo I'm missing something vital about how encoding works. A while back, when I was researching how to approach dealing with Japanese text, Sadahiro Tomoyuki (owner of http://homepage1.nifty.com/nomenclator/perl/indexE.htm) (arigatou gosai masu Sadahiro -san) kindly wrote and told me how it was effectively done in the past by Japanese perlers: (1) conversion of input in Shift_JIS to EUC_JP the same method is used by perl 5.8.0 except unicode (UTF8) is used as the internal perl processing form instead of EUC_JP Dan Kogai wrote: use Encode qw/encode decode/; use strict; use diagnostics-verbose; use Encode qw/encode decode/; my ($data)='月'; my $utf8 = decode('shift-jis', $data); print ord($utf8); print chr($utf8); yielded: 26376 Argument "\x{6708}" isn't numeric in chr at /Users/robin/Desktop/test jp.pl line 13 (#1) Ok so it was a partial success, but I grasp what I'm doing now. Thanks to everyone that took the time to reply.
|
|