Tatsuhiko Miyagawa <miyagawa(_at_)edge(_dot_)co(_dot_)jp> writes:
On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:
Another problem with Japanese is that it seems to me that words are
not separated by spaces. Therefore even if the transliteration worked
for Kanjis I'd end up with lots of endless strings, which is not good
for indexing when you try to split text into keywords.
Any ideas?
Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.
kakasi
http://kakasi.namazu.org/
Chasen
http://chasen.aist-nara.ac.jp/
Yes "kakasi" and "Chasen" are currently the best stuff available
for the public. And the best way to transliterate seems to be the
combination of both. Most people from the western hemisphere will
be happy with:
[shell]$ chasen -F'%a ' | kakasi -Ka
Here chasen is used for tokenization and transliteration to kana.
kakasi is used for romanization. But be aware, kakasi's transliteration
is far from perfect. I have sent in a few patches to kakasi and
one was applied. Others which will make kakasi more compatible
to hepburn transliteration were ignored.
I use chasen -F'%a ' and chasen -F'%A ' (for the base form) and
my own transliteration engine I unfortunatly cannot make public
at the moment.
Andreas