Re: [Q] assume charset in raw 8bit headers

On January 11, 2003 at 19:02, Tomohiro KUBOTA wrote:

I have a question on CharsetConverters.  I am planning to
use UTF-8 filter like following.

<CharsetConverters override>
plain;          mhonarc::htmlize;
us-ascii;       mhonarc::htmlize;
default;        MHonArc::UTF8::str2sgml;     MHonArc/UTF8.pm
</CharsetConverters>

...

This, I'd like to assume that raw 8bit characters are all KOI8-R
and convert these 8bit characters into either
 - SGML entity expressions,
 - &#xxx; expressions where xxx mean decimal Unicode codepoints, or
 - UTF-8 characters.
How can I configure MHonArc to achieve this?


It looks like you will need to try out the latest development version
to achieve what you want.  The latest development version is now
frozen for new functionality and is being evaluated for any major
problems before release (and I'm looking for as many people willing
to test things out before the release).  You can grab a copy of the
development version from <http://www.mhonarc.org/release/MHonArc/tar/>.
Just grap one of the -snap bundles.

With the latest code, much more character encoding support has been added,
including Russian sets like KOI8-R.

One way to get what you want with the latest snapshot build is with
the following resource settings:

  <!-- Want everything to goto UTF-8 -->
  <CharsetConverters override>
  us-ascii;       mhonarc::htmlize;
  default;        MHonArc::UTF8::str2sgml;     MHonArc/UTF8.pm
  </CharsetConverters>

  <!-- Make sure to register UTF-8-aware clipping function -->
  <TextClipFunc>
  MHonArc::UTF8::clip; MHonArc/UTF8.pm
  </TextClipFunc>

  <!-- Alias the special "plain" set to koi8-r to deal with
       improper mail headers -->
  <CharsetAliases>
  koi8-r; plain
  </CharsetAliases>

  <!-- If no charset specified, assumed koi8-r as the default
       instead of us-ascii -->
  <DefCharset>
  koi8-r
  </DefCharset>

  <!-- ... HERE define *PGBEGIN resource to denote utf-8 document
           character set with <meta http-equiv="content-type"> tag.
           See utf-8.mrc example resource file in distribution.
       ... -->

Another way, would be:

  <!-- TEXTENCODE allows to map all character data to a given
       character encoding when messages are first read.
    -->
  <TextEncode>
  utf-8; MHonArc::UTF8::to_utf8; MHonArc/UTF8.pm
  </TextEncode>

  <-- With data translated to UTF-8, it simplifies CHARSETCONVERTERS -->
  <CharsetConverters override>
  default; mhonarc::htmlize
  </CharsetConverters>

  <-- Need to also register UTF-8-aware text clipping function -->
  <TextClipFunc>
  MHonArc::UTF8::clip; MHonArc/UTF8.pm
  </TextClipFunc>

  <!-- Alias the special "plain" set to koi8-r to deal with
       inproper mail headers -->
  <CharsetAliases>
  koi8-r; plain
  </CharsetAliases>

  <!-- If no charset specified, assumed koi8-r as the default
       instead of us-ascii -->
  <DefCharset>
  koi8-r
  </DefCharset>

  <!-- ... HERE define *PGBEGIN resource to denote utf-8 document
           character set with <meta http-equiv="content-type"> tag.
           See utf-8-encode.mrc example resource file in distribution.
       ... -->

Using the TEXTENCODE method is probably more efficient overall.

Make sure to test the above first to make sure things work as
you want.  If you have any problems, you should follow-up to the
mhonarc-dev(_at_)mhonarc(_dot_)org mailing list since the above is not yet
provided in an official release.

The snapshot builds do contain updated documentation (excluding the
nodoc bundles where the docs are not present).  You can also
check out the latest docs at
<http://www.mhonarc.org/release/MHonArc/snapshot/doc/>.

Check out the CHARSETCONVERTERS and TEXTENCODE resource pages for
more details about these resources.  Pages can currently be see via
the Web at:
<http://www.mhonarc.org/release/MHonArc/snapshot/doc/resources/charsetconverters.html>
<http://www.mhonarc.org/release/MHonArc/snapshot/doc/resources/textencode.html>

You may want to start with the TEXTENCODE page since it provides
information on the differences and relationships of TEXTENCODE and
CHARSETCONVERTERS.

Side Note: You will notice that docs mention Unicode::MapUTF8, and
MHonArc may use it depending on your Perl installation.  However,
I noticed conversion problems with Unicode::MapUTF8 when dealing
with Japanese character data, i.e. it did nothing, but it did not
complain.  It may be that I did not install the Jcode module correctly,
or Unicode::MapUTF8 is failing to recognize it.

Therefore, if using versions of Perl < 5.8, and you have Unicode::MapUTF8
installed, run tests with Japanese messages.  If you get problems,
either use 5.8 (since the Encode module is available) or uninstall
Unicode::MapUTF8 and let MHonArc use the fallback implementation for
conversion to UTF-8.

I am considering dropping support for Unicode::MapUTF8 since the
Encode module supercedes it and is standard with Perl 5.8.  Also,
it appears that Unicode::MapUTF8 is not being actively maintained
anymore.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-USERS