Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
On 2002.01.31, at 21:44, Tatsuhiko Miyagawa wrote:
On Thu, 31 Jan 2002 12:31:58 +0000
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:
Any ideas?
Try kakasi or Chasen. They can be accessed via Perl with XS
wrapper.
Tokenizing Japanese is among the hardest. Even the very notion of
token differs among linguists. For instance....
WATASHI-wa Perl-de CGI-wo KAkimasu
Large cap part is in Kanji and small cap part is in hiragana.
See the first part WATASHI-ha. This corresponds to English 'I'. But
comes in two parts. WATASHI, which means "First party" and "wa"
(spelled "ha" but pronounced "wa"), which makes the previous word
nominative. Now the question is whether "WATASHI-wa" is a single
token or two.
One token if you do "Bunsetsu" tokenization, two token if you do
lemata (dictionary entry) tokenization.
So please note there is no silver bullet for Japanese tokenization.
Kakasi / Chasen is good enough for search engines like Namazu but that
does not mean the very tokens they spit are canonical. As you see,
there is no "Canonical Japanese" in a sense "Canonical French" by
Academe France :)
Unfortunatly, here Dan is right. If the meaning of the word "word"
can get you into week long discussions with japanese linguists.
There is even more radical approach when it comes to search engine.
You can now search arbitrary byte stream WITHOUT tokenization at
all. You use an algorithm called suffix array. The concept is
deceptively simple but for some reason this was not found until
1990's. To get an idea of what suffix array is, search for 'suffix
array' on Google. Interestingly, the first hit goes to
sary.namazu.org.
I think If you want to try suffix arrays "sufary"
http://cl.aist-nara.ac.jp/lab/nlt/ss/
is worth a try. A xs-module for Perl is included in the distribution.
Andreas