perl-i18n

Questions About Encoding

2003-05-11 11:21:49
Dear all,

    Hi.  I'm imacat, the author of Locale::Maketext::Gettext.  The
perl-i18n list doesn't seem to be a busy list. ^^;  I have some
questions about encoding here.

    The Locale::Maketext functions are not mult-byte safe, as some of
you may have noticed.  It is not safe to the Traditional Chinese Big5
encoding.  Big5 has some characters that contains "]" or "[" as their
second byte, which cannot (and should not) be escaped in the middle of a
full character.  (Otherwise the character will break in half.)

    I tried to solve this problem through my Locale::Maketext::Gettext. 
By obtaining %Lexicon from a foreign source that contains its encoding
infomation (like gettext MO files), it is possible to decode %Lexicon
into perl's internel encoding (currently UTF-8) before maketext.  And,
UTF-8 is, as far I can tell, multi-byte safe to maketext.  After
maketext, I can turn it back to whatever encoding the application needs. 
Here Locale::Maketext::Gettext becomes a wrapper to Locale::Maketext on
this encoding issue.

    However, there are still some problems:

 1. In the case when the _AUTO lexicon is used on a multi-byte key. 
That is, using non-English multi-byte as the key to search in the
lexicon, and while it is not found, the key itself is used to compile
the language function and return the result.  In such a case, the key
itself is to be compiled.  The multi-byte safety of the key itself
becomes a problem.

    Of course, this may not be a problem of you in the English world,
and of you who are working on public-domain softwares.  The key you are
using should always be in English US-ASCII.  Even for GNU gettext,
non-English messsage IDs are discouraged in the documentation.

    But, in the real world, this may not be the case.  Several clients
of my company requires a new multilingual website, starting from their
original unilingual website in their native language and encoding.  A
half of my unilingual websites ultilize the same message transmission
subroutine used in the other half multilingual websites.  It doesn't
make sense to translate their messages into English and maketext back,
just to fit the maketext multibyte safety.

    To fix this (multibyte safety in the message key) I tried to
implement a new "key_encoding".  It is used to declare the encoding used
in the message keys.  When maketext is looking for messages, the message
key is decoded by its key_encoding first into perl's internel encoding. 
This approach fixed the multi-byte safety problem on the _AUTO lexicon,
in the cost of complexity.

 2. Beside the key encoding, there is still another piece of text that
may pose the encoding problem: the message parameters.  It's different
than the above.  The message parameters do not pose the maketext
language function multi-byte safety problem, since they are applied
after the text is compiled.  But they may not come in perl's internal
encoding.  If the maketext result is going to be encoded into the
application's preferred encoding, as my first approach does, the
encoding confusion occurs.  That is, if I put a piece of Big5 octes as
the parameter, it will be inserted into a piece of UTF-8 octes, and
encoded from UTF-8 to Big5, where a part of the octes is Big5, not UTF-8.

    So I'm confused.  I have to substitute after encode, so that [_1]
always does exactly like printf that hands off from the encoding of the
parameters.  This is far beyond what a wrapper of Locale::Maketext can
do.  But, is it possible to the current Locale::Maketext design?

    Of course we can go back to the world without the encoding problem,
as the original Locale::Maketext does.  But that isn't reasonable to my
multibyte world.  I started to miss the world of simplicity where the
wonky GNU gettext lives.

--
Best regards,
imacat ^_*' <imacat(_at_)mail(_dot_)imacat(_dot_)idv(_dot_)tw>
PGP Key: http://www.imacat.idv.tw/me/pgpkey.txt

<<Woman's Voice>> News: http://www.wov.idv.tw/
Tavern IMACAT's: http://www.imacat.idv.tw/
TLUG List Manager: http://www.linux.org.tw/mailman/listinfo/tlug

Attachment: pgpwA8SxPDN7N.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>