Re: Japanese text search problem

on 01.8.7 9:34 PM, Jarkko Hietaniemi at jhi(_at_)iki(_dot_)fi wrote:

On Tue, Aug 07, 2001 at 05:37:00PM +0530, Ashutosh Salgarkar wrote:

Hi all,

We are trying to search japanese keyword using a search string(in perl using
pattern matching).
We are facing problem while searching a particular keyword as given below,
$searchStr =~ m/$key1/i


  First let me remark /i is useless when you deal with Japanese, regardless
of character set.  Japanese lacks the very notion of case.

when $key1 contains ã?·ã?ªã?¼ã?º
We get an error as follows

/ã?·ã?ªã?¼ã?º/: unmatched [] in regexp


What version of Perl are you using?  (perl -v)


  Perl 5.6.x can handle Japanese in regexes but that is not enough.  You
have to convert the string to UTF8.  Here is a sample code (untested).

#
require 5.6.0;
use Jcode;  # a module I develop;  Available via CPAN
use utf8;

$key = shift;
$key_utf = jcode($key)->utf8;

while(<>){
    $line = jcode($_)->utf8;
    ($line =~ /$key_utf/) and print;
    # or print $line; if you want utf8 string printed out
}
__END__

Also, (since I know very little about Japanese) in what Japanese
encoding that is, and exactly what code points (0xAA + 0xBB + ...)
are you using?  (I see some bytes in my email reader but since this is
a plain ISO Latin 8-bit terminal program, I have no idea whether those
bytes are okay.)


   Japanese is notorious for the number of character encodings used.  JIS,
shift JIS, EUC, and now Unicode.  JIS (ISO-2022-JP to be more exact) is a de
facto standard for e-mails. shift JIS is de facto standard for Win/Mac
files.  EUC is de facto standard for Unixen.  Unicode is de facto standard
for internal representation but not so popular as data exchange format.
When you handle Japanese strings, you must not assume incoming data is in
the character set you are using.
  The easiest solution is as follows;

* Use perl 5.6.0 or above
* convert any string to utf8 using Jcode or other modules
* convert to other character set when you need to output

  Perl 5.0.x and below can handle EUC faily well but regex may fail.  If you
don't use regex, just replace utf8 with EUC in the recipe above.

Dan the Developer of Jcode