perl-unicode

Re: Japanese text search problem

2001-08-08 12:18:13
On Wed, 8 Aug 2001, Dan Kogai wrote:

on 01.8.8 1:14 AM, Benjamin Franz at snowhare(_at_)nihongo(_dot_)org wrote:
On Tue, 7 Aug 2001, Ashutosh Salgarkar wrote:

my $safe_key = quotemeta($key1);
$searchStr =~ m/$safe_key/;

is probably what you want. I am presuming you are trying to use m// to
search for exact string matches rather than exploiting the full regex
facilities.

  No.  quotemeta would not cut it.  It depends on what character set is fed
to regexes but for most (virtually all) cases, you convert strings to either
EUC-jp or utf8.  Neither EUC-jp nor utf8 contains metacharacters when you
use Japanese (or Korean or Chinese).  The problem is bit deeper.
  The problem is that before perl 5.6.x, character and byte are
interchangeable and Japanese character (Kanji as follows) takes 2 bytes on
EUC (and 3 bytes on utf8).

  For example,

  /\xd1\xf1/ and print; # I want to find a line that contains 'to bore'

  not only maches the character desired but also 'camel', which is
represented by two Kanji (4 bytes).

\xb4\xc1 \xbb\xfa
-------- --------
<RAKU>   <DA>     = a camel
    ---------
    <TEKI>        = to bore

  There are ways to overcome this character boundary problem with EUC, like
inserting delimiter character (such as beep and tab) between each Kanji but
that's way too counter-intuitive, not to mention slow.

Oh, yeah. I forgot about that since I don't normally keep stuff in
JIS/SJIS/EUC-JP once I've acquired it. I always make my working store
UTF8. In UTF8 the 'frame' problem doesn't exist because character start
bytes _ALWAYS_ have bit eight set to 0 while continuation bytes _ALWAYS_
have bit eight set to 1. 'quotemeta' works fine if you use UTF8 as your
working encoding.

-- 
Benjamin Franz

  Programs must be written for people to read, and only
  incidentally for machines to execute.
                             ---Abelson and Sussman


<Prev in Thread] Current Thread [Next in Thread>