perl-unicode

Re: Japanese text search problem

2001-08-07 07:32:57
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
 
   Japanese is notorious for the number of character encodings used.  JIS,
shift JIS, EUC, and now Unicode.  JIS (ISO-2022-JP to be more exact) is a de
facto standard for e-mails. shift JIS is de facto standard for Win/Mac
files.  EUC is de facto standard for Unixen.  Unicode is de facto standard
for internal representation but not so popular as data exchange format.
When you handle Japanese strings, you must not assume incoming data is in
the character set you are using.
  The easiest solution is as follows;

Why should Unicode be the "de facto standard for internal 
representation"? ...or "internal standard" to whom, or what? In perl
that could happen, but as a general statement I cannot agree, but 
anyway I would like to hear your reasoning. 
E.g. if I was going to write one of the bigger Kanwa-Jiten 
(Chinese/Japanese Character Dictionary) Database I would rather 
use TAD (TRON-encoding) than its compititor Unicode.
For much other stuff I am quit happy with euc-jp. 
  
* Use perl 5.6.0 or above
Or if you can't use 5.6.0 learn the basics of Japanese information
processing with byte-orientated Perl.
A good starter are Ken Lunde's pdf-files at:
  http://examples.oreilly.com/cjkvinfo/perl/
but if you wan't to get serious about Japanese information processing
with byte-oriented Perl you should get the whole book:
   http://www.oreilly.com/catalog/cjkvinfo/

* convert any string to utf8 using Jcode or other modules
* convert to other character set when you need to output

Maybe I am old fashioned, but I still use euc-jp or sjis for 
most of the processing/ output I do. And I am quit happy with
them. 


  Perl 5.0.x and below can handle EUC faily well but regex may fail.  If you
don't use regex, just replace utf8 with EUC in the recipe above.

Ken Lundes pdfs and book will help with the regex problem. 

Dan the Developer of Jcode
Andreas Marcel the happy and thankfull  user of Jcode











<Prev in Thread] Current Thread [Next in Thread>