perl-unicode

Unicode / Japanese and Transliteration problem

2002-01-31 05:32:13
Hi Perl Unicode geeks,

  I'm currently making our web application (MKDoc) support more than
  western european languages. Being a French lad working in England
  that's currently learning Japanese (mainly because of anime movies I
  admit it :)), I thought I had to do it the "Right Way", i.e.  going
  all the way with Unicode.

  One of the few problems I've been running into with Unicode is to
  build human readable URIs from Unicode strings. Indeed it's not that
  much of a deal when constructing URIs from English titles, but it
  becomes a bit less obvious when "URLizing" from languages such as
  punjabi or gujurati.

  In order to solve this I wrote an XS wrapper around IBM's ICU 2.0
  libraries (attached) which I'm on the process of putting on CPAN. It
  neatly wraps ICU transliteration services, which cover plenty of
  languages / character sets, etc.

  Another cool thing about it that it eases document indexing. It is
  actually possible to transliterate them first and then store a bunch
  of plain old ASCII keywords which make all databases very happy.
  Besides, it makes it possible to perform searches based on
  transliterated ASCII string, which is nice when you don't have a
  Punjabi keyboard to input search keywords for instance.

  However I'm having quite a lot of trouble with Japanese because of
  Kanjis (chinese ideograms).  ICU does provide Hiragana <=> Latin and
  Katakana <=> Latin, but doesn't do anything about kanji. Which does
  not surprise me too much given the fact that in Japanese a kanji has
  very often more than one pronunciation depending on how and where it's
  used.

  Another problem with Japanese is that it seems to me that words are
  not separated by spaces.  Therefore even if the transliteration worked
  for Kanjis I'd end up with lots of endless strings, which is not good
  for indexing when you try to split text into keywords.

Any ideas?  I'm quite worried about the fact that I have a webapp that
works perfectly for Punjabi but that kind of screws Japanese up when
creating new documents and performing searches :-(

Cheers,
-- 
IT'S TIME FOR A DIFFERENT KIND OF WEB
================================================================
  Jean-Michel Hiver - Software Director
  jhiver(_at_)mkdoc(_dot_)com
  +44 (0)114 221 4968
================================================================
                                      VISIT HTTP://WWW.MKDOC.COM

Attachment: Unicode-Transliterate-0.2.tgz
Description: application/tar-gz

<Prev in Thread] Current Thread [Next in Thread>