Hi Perl Unicode geeks,
I'm currently making our web application (MKDoc) support more than
western european languages. Being a French lad working in England
that's currently learning Japanese (mainly because of anime movies I
admit it :)), I thought I had to do it the "Right Way", i.e. going
all the way with Unicode.
One of the few problems I've been running into with Unicode is to
build human readable URIs from Unicode strings. Indeed it's not that
much of a deal when constructing URIs from English titles, but it
becomes a bit less obvious when "URLizing" from languages such as
punjabi or gujurati.
In order to solve this I wrote an XS wrapper around IBM's ICU 2.0
libraries (attached) which I'm on the process of putting on CPAN. It
neatly wraps ICU transliteration services, which cover plenty of
languages / character sets, etc.
Another cool thing about it that it eases document indexing. It is
actually possible to transliterate them first and then store a bunch
of plain old ASCII keywords which make all databases very happy.
Besides, it makes it possible to perform searches based on
transliterated ASCII string, which is nice when you don't have a
Punjabi keyboard to input search keywords for instance.
However I'm having quite a lot of trouble with Japanese because of
Kanjis (chinese ideograms). ICU does provide Hiragana <=> Latin and
Katakana <=> Latin, but doesn't do anything about kanji. Which does
not surprise me too much given the fact that in Japanese a kanji has
very often more than one pronunciation depending on how and where it's
used.
Another problem with Japanese is that it seems to me that words are
not separated by spaces. Therefore even if the transliteration worked
for Kanjis I'd end up with lots of endless strings, which is not good
for indexing when you try to split text into keywords.
Any ideas? I'm quite worried about the fact that I have a webapp that
works perfectly for Punjabi but that kind of screws Japanese up when
creating new documents and performing searches :-(
Cheers,
--
IT'S TIME FOR A DIFFERENT KIND OF WEB
================================================================
Jean-Michel Hiver - Software Director
jhiver(_at_)mkdoc(_dot_)com
+44 (0)114 221 4968
================================================================
VISIT HTTP://WWW.MKDOC.COM
Unicode-Transliterate-0.2.tgz
Description: application/tar-gz