perl-unicode

Japanese tokenization problem

2002-01-31 06:28:20
On 2002.01.31, at 21:44, Tatsuhiko Miyagawa wrote:
On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:
Any ideas?

Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.

One caveat. Kakasi / Chasen does't grok Unicode. You have to convert the original string from/to EUC to use it. You can use Jcode.pm or iconv for that purpose.

  Another problem with Japanese is that it seems to me that words are
not separated by spaces. Therefore even if the transliteration worked
  for Kanjis I'd end up with lots of endless strings, which is not good
  for indexing when you try to split text into keywords.

Tokenizing Japanese is among the hardest. Even the very notion of token differs among linguists. For instance....

WATASHI-wa Perl-de CGI-wo KAkimasu

  Large cap part is in Kanji and small cap part is in hiragana.
See the first part WATASHI-ha. This corresponds to English 'I'. But comes in two parts. WATASHI, which means "First party" and "wa" (spelled "ha" but pronounced "wa"), which makes the previous word nominative. Now the question is whether "WATASHI-wa" is a single token or two. So please note there is no silver bullet for Japanese tokenization. Kakasi / Chasen is good enough for search engines like Namazu but that does not mean the very tokens they spit are canonical. As you see, there is no "Canonical Japanese" in a sense "Canonical French" by Academe France :) There is even more radical approach when it comes to search engine. You can now search arbitrary byte stream WITHOUT tokenization at all. You use an algorithm called suffix array. The concept is deceptively simple but for some reason this was not found until 1990's. To get an idea of what suffix array is, search for 'suffix array' on Google. Interestingly, the first hit goes to sary.namazu.org. But once again, even this is not silver bullet. One problem of suffix array is that your 'INDEX' often gets larger than the original text.
  Well, I've said enough.  Happy coding!

Dan the Man with Too Many Words to Tokenize

<Prev in Thread] Current Thread [Next in Thread>