RFC: Japanese Text Conversion and other language issues

2002-11-30 21:23:12
If you've been following the CVS commits to mhonarc, you will
notice that MHonArc::CharEnt has been updated to support the
conversion of variety of new character sets.

The most recent (2002/11/30) has been for iso-2022-jp and euc-jp.
Hence, I ask the following question:

    Should MHonArc::CharEnt replace as the default
    CHARSETCONVERTER for iso-2022-jp text?

MHonArc::CharEnt and approach the Japanese conversion
problem differently. does very little "conversion" since
it assumes that clients will be configured for iso-2022-jp.  I.e. is very locale specific, but has served well for the
Japanese user base.

MHonArc::CharEnt tries to map everything to HTML entity references,
allowing for the ability of multiple languages to co-exist.
For example, with the use of Unicode character entity references, I
can view Japanese, Chinese, Russian, etc messages in the same archive
*without* having to set the document character set in the HTML (via
MSGPGBEGIN resource) or *without* manually switching the encoding in my
browser.  Think of it as an end-around the charset soup problem without
trying to convert everything to utf-8 which requires Perl v5.6.1,
or later, and the Unicode::* modules (side note, MHonArc::CharEnt
converts utf-8 into Unicode character entity references).

MHonArc::CharEnt tries to keep all raw characters in the final
raw HTML message page in the ASCII domain.  The Unicode character
entity reference approach does rely on the browser to support it,
which modern browsers do.  I use Galeon (which uses the
Mozilla Gecko rendering engine), and it properly loads the various
font glyphs for non-English characters.

I definitely welcome feedback from fellow Japanese users.  I know
there has been some past (sometimes heated) discussion about Japanese
text conversion, so I leave it up to those experienced with dealing
with Japanese data on what the best answer to the question will be.
I do request the consideration of users in non-Japanese locales that
deal with multiple languages, including Japanese.

A related question that impacts mharc: Does Namazu handle Unicode
character entity references?


P.S. Snapshot builds, <>,
dated 2002/12/01 and later will contain the Japanese conversion
support in MHonArc::CharEnt.  If reading this message before 2002/12/01
(U.S. CDT) then wait a day or check things out directly from CVS.

P.S.S. I welcome testing of the new MHonArc::CharEnt, especially
with the new charset that have been added.  I do not know a word
of Japanese, Chinese, Russian, et. al., so it is very hard to
verify the accuracy of the conversions.  Testing under older
versions of Perl (<5.6.1) would be much appreciated.

To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the

<Prev in Thread] Current Thread [Next in Thread>
  • RFC: Japanese Text Conversion and other language issues, Earl Hood <=