Japanese tokenization problem

On 2002.01.31, at 21:44, Tatsuhiko Miyagawa wrote:

On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:

Any ideas?


Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.

One caveat. Kakasi / Chasen does't grok Unicode. You have to convertthe original string from/to EUC to use it. You can use Jcode.pm oriconv for that purpose.

  Another problem with Japanese is that it seems to me that words are

not separated by spaces. Therefore even if the transliterationworked

  for Kanjis I'd end up with lots of endless strings, which is not good
  for indexing when you try to split text into keywords.

Tokenizing Japanese is among the hardest. Even the very notion oftoken differs among linguists. For instance....


WATASHI-wa Perl-de CGI-wo KAkimasu

  Large cap part is in Kanji and small cap part is in hiragana.

See the first part WATASHI-ha. This corresponds to English 'I'. Butcomes in two parts. WATASHI, which means "First party" and "wa"(spelled "ha" but pronounced "wa"), which makes the previous wordnominative. Now the question is whether "WATASHI-wa" is a single tokenor two.So please note there is no silver bullet for Japanese tokenization.Kakasi / Chasen is good enough for search engines like Namazu but thatdoes not mean the very tokens they spit are canonical. As you see,there is no "Canonical Japanese" in a sense "Canonical French" byAcademe France :)There is even more radical approach when it comes to search engine.You can now search arbitrary byte stream WITHOUT tokenization at all.You use an algorithm called suffix array. The concept is deceptivelysimple but for some reason this was not found until 1990's. To get anidea of what suffix array is, search for 'suffix array' on Google.Interestingly, the first hit goes to sary.namazu.org.But once again, even this is not silver bullet. One problem of suffixarray is that your 'INDEX' often gets larger than the original text.

  Well, I've said enough.  Happy coding!

Dan the Man with Too Many Words to Tokenize