On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:
However I'm having quite a lot of trouble with Japanese because of
Kanjis (chinese ideograms). ICU does provide Hiragana <=> Latin and
Katakana <=> Latin, but doesn't do anything about kanji. Which does
not surprise me too much given the fact that in Japanese a kanji has
very often more than one pronunciation depending on how and where it's
used.
Another problem with Japanese is that it seems to me that words are
not separated by spaces. Therefore even if the transliteration worked
for Kanjis I'd end up with lots of endless strings, which is not good
for indexing when you try to split text into keywords.
Any ideas?
Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.
kakasi
http://kakasi.namazu.org/
Chasen
http://chasen.aist-nara.ac.jp/
namazu
http://namazu.org/
--
Tatsuhiko Miyagawa <miyagawa(_at_)bulknews(_dot_)net>