perl-unicode

Re: Japanese text search problem

2001-08-07 11:02:29
on 01.8.8 1:14 AM, Benjamin Franz at snowhare(_at_)nihongo(_dot_)org wrote:
On Tue, 7 Aug 2001, Ashutosh Salgarkar wrote:

my $safe_key = quotemeta($key1);
$searchStr =~ m/$safe_key/;

is probably what you want. I am presuming you are trying to use m// to
search for exact string matches rather than exploiting the full regex
facilities.

  No.  quotemeta would not cut it.  It depends on what character set is fed
to regexes but for most (virtually all) cases, you convert strings to either
EUC-jp or utf8.  Neither EUC-jp nor utf8 contains metacharacters when you
use Japanese (or Korean or Chinese).  The problem is bit deeper.
  The problem is that before perl 5.6.x, character and byte are
interchangeable and Japanese character (Kanji as follows) takes 2 bytes on
EUC (and 3 bytes on utf8).

  For example,

  /\xd1\xf1/ and print; # I want to find a line that contains 'to bore'

  not only maches the character desired but also 'camel', which is
represented by two Kanji (4 bytes).

\xb4\xc1 \xbb\xfa
-------- --------
<RAKU>   <DA>     = a camel
    ---------
    <TEKI>        = to bore

  There are ways to overcome this character boundary problem with EUC, like
inserting delimiter character (such as beep and tab) between each Kanji but
that's way too counter-intuitive, not to mention slow.

Dan the Man with Too Many Character Sets to Fiddle

<Prev in Thread] Current Thread [Next in Thread>