On 2002.01.31, at 21:44, Tatsuhiko Miyagawa wrote:
On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:
Any ideas?
Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.
One caveat. Kakasi / Chasen does't grok Unicode. You have to convert
the original string from/to EUC to use it. You can use Jcode.pm or
iconv for that purpose.
Another problem with Japanese is that it seems to me that words are
not separated by spaces. Therefore even if the transliteration
worked
for Kanjis I'd end up with lots of endless strings, which is not good
for indexing when you try to split text into keywords.
Tokenizing Japanese is among the hardest. Even the very notion of
token differs among linguists. For instance....
WATASHI-wa Perl-de CGI-wo KAkimasu
Large cap part is in Kanji and small cap part is in hiragana.
See the first part WATASHI-ha. This corresponds to English 'I'. But
comes in two parts. WATASHI, which means "First party" and "wa"
(spelled "ha" but pronounced "wa"), which makes the previous word
nominative. Now the question is whether "WATASHI-wa" is a single token
or two.
So please note there is no silver bullet for Japanese tokenization.
Kakasi / Chasen is good enough for search engines like Namazu but that
does not mean the very tokens they spit are canonical. As you see,
there is no "Canonical Japanese" in a sense "Canonical French" by
Academe France :)
There is even more radical approach when it comes to search engine.
You can now search arbitrary byte stream WITHOUT tokenization at all.
You use an algorithm called suffix array. The concept is deceptively
simple but for some reason this was not found until 1990's. To get an
idea of what suffix array is, search for 'suffix array' on Google.
Interestingly, the first hit goes to sary.namazu.org.
But once again, even this is not silver bullet. One problem of suffix
array is that your 'INDEX' often gets larger than the original text.
Well, I've said enough. Happy coding!
Dan the Man with Too Many Words to Tokenize