Re: Unicode / Japanese and Transliteration problem

Tatsuhiko Miyagawa <miyagawa(_at_)edge(_dot_)co(_dot_)jp> writes:

On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:


  Another problem with Japanese is that it seems to me that words are
  not separated by spaces.  Therefore even if the transliteration worked
  for Kanjis I'd end up with lots of endless strings, which is not good
  for indexing when you try to split text into keywords.

Any ideas?


Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.

kakasi
http://kakasi.namazu.org/
Chasen
http://chasen.aist-nara.ac.jp/


Yes "kakasi" and "Chasen" are currently the best stuff available
for the public. And the best way to transliterate seems to be the
combination of both. Most people from the western hemisphere will
be happy with:

     [shell]$ chasen -F'%a '  | kakasi -Ka

Here chasen is used for tokenization and transliteration to kana.
kakasi is used for romanization. But be aware, kakasi's transliteration
is far from perfect. I have sent in a few patches to kakasi and
one was  applied. Others which will make kakasi more compatible
to hepburn transliteration were ignored.
I use chasen -F'%a ' and chasen -F'%A ' (for the base form) and
my own transliteration engine I unfortunatly cannot make public
at the moment.
 

Andreas

<Prev in Thread]	Current Thread	[Next in Thread>
Unicode / Japanese and Transliteration problem, Jean-Michel Hiver Re: Unicode / Japanese and Transliteration problem, Tatsuhiko Miyagawa Japanese tokenization problem, Dan Kogai Re: Japanese tokenization problem, Andreas Marcel Riechert Re: Unicode / Japanese and Transliteration problem, Andreas Marcel Riechert <=