namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: mknmz notworkingforJapanese languagedocuments ?

2006-06-29 20:01:07
Darren Cook wrote:

I've not used the perl modules, but I can tell you what I do on a site
that isn't native EUC.

For indexing an English UTF8 site I use:
  mknmz --indexing-lang=en.UTF-8 -e ...

It is a mistake. 
Namazu doesn't support UTF-8. 

For indexing a Japanese UTF8 site I use (the -k means use kakasi):
  mknmz --indexing-lang=ja.UTF-8 -k -e ...

It is a mistake. 
Namazu doesn't support UTF-8. 
(But, it corresponds to the document of ja_JP.UTF-8.)

It is necessary to keep the following. 

$ mknmz --indexing-lang=ja_JP.eucjp -k -e ...

The document of ISO-2022-JP, Shift_JIS, and EUC-JP can be handled 
though it is specified ja_JP.eucjp. 

--indexing-lang option doesn't specify the encoding of the 
handled document. 

For searching (I'm using PHP module by the way) I convert the search
keywords to EUC:
  $kw_euc=mb_convert_encoding($kw,"EUC-JP","UTF8");

The retrieval key word supports only ISO-2022-JP, Shift_JIS, 
and EUC-JP. 
(UTF-8 is a unsupport. Therefore, it is recommended to convert it 
into EUC-JP like this example. )

Then do the search, then for each search hit I convert the result back
from EUC to UTF8 ready for display, e.g.:

The retrieval result is sure to become EUC-JP. (for UNIX)
-- 
=====================================================================
TADAMASA TERANISHI  yw3t-trns(_at_)asahi-net(_dot_)or(_dot_)jp
http://www.asahi-net.or.jp/~yw3t-trns/index.htm
Key fingerprint =  474E 4D93 8E97 11F6 662D  8A42 17F5 52F4 10E7 D14E

_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en

<Prev in Thread] Current Thread [Next in Thread>