TEXTENCODE

Syntax
Description
- TEXTENCODE vs CHARSETCONVERTERS
- Writing Encoders
Available Encoders
- MHonArc::UTF8::to_utf8
- MHonArc::Encode::from_to
Default Setting
Resource Variables
Examples
Version
See Also

Syntax

Envariable: N/A
Element: <TEXTENCODE>
charset; perl_function; source_file
</TEXTENCODE>
Command-line Option: N/A

TEXTENCODE allows you to specify a destination character encoding for all message text data. For each message read, textual data, -- including message header fields and text body parts -- is translated from the charset(s) used in the message to the charset specified by the TEXTENCODE resource.

For example, the following resource setting,

<TextEncode>
utf-8; MHonArc::UTF8::to_utf8; MHonArc/UTF8.pm
</TextEncode>

converts message text to UTF-8 (Unicode) by using the MHonArc::UTF8::to_utf8 function. List of available encoding functions is provided below.

NOTE:

The terms character set (charset) and character encoding are used interchangeably within MHonArc documentation. The reasoning is charset is used within the MIME RFCs, but it blurs the concepts of character encoding and coded characer set and probably a few other things. For the purposes of this document, such details are not really necessary, but if you want to learn more, see Unicode Technical Report #17: Character Encoding Model and Character Set Considered Harmful.

The syntax of the TEXTENCODE resource is as follows:

charset;routine-name;file-of-routine

The definition of each semi-colon-separated value is as follows:

charset: Character set name. See the CHARSETCONVERTERS and CHARSETALIASES for character sets MHonArc is aware of. The official list of registered character sets for use on the Internet is available from IANA.
routine-name: The actual routine name of the encoder. The name should be fully qualified by the package it is defined in (e.g. "mypackage::filter").
file-of-routine: The name of the file that defines routine-name. If the file is not a full pathname, MHonArc finds the file by looking in the standard include paths of Perl, and the paths specified by the PERLINC resource.

TEXTENCODE vs CHARSETCONVERTERS

It is important to clarify the differences between TEXTENCODE and CHARSETCONVERTERS since reading about both resources may generate confusion.

The main difference between TEXTENCODE and CHARSETCONVERTERS is that TEXTENCODE is applied as the message is read, before the message is converted to HTML. TEXTENCODE's primary role is converting characters from one charset to another charset. CHARSETCONVERTERS' role is to convert characters into HTML.

The following crude text diagram shows the path message text data takes when converted to HTML:

  message-text --> TEXTENCODE --> CHARSETCONVERTERS --> HTML

In addition, TEXTENCODE is applied only once to message text data. Since MHonArc stores some message header information in the archive database, the message header text is stored in "raw" form, which can include non-ASCII MIME encoded data like the following:

From: =?US-ASCII?Q?Earl_Hood?= <earl@earlhood.com>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
 =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

Therefore, when a resource variable like $SUBJECT$ is used, the =?ISO-8859-1?B?SWYgeW9... data must be parsed and converted every time.

If TEXTENCODE is active, the message subject =?ISO-8859-1?B?SWYgeW9... is parsed and translated to the destination encoding when the message is first parsed, with the final result stored in the archive database. The non-ASCII MIME encoding is removed, and it no longer has to be parsed each time $SUBJECT$ is used.

In either case, CHARSETCONVERTERS is still invoked, but in the former =?ISO-8859-1?B?SWYgeW9... case, CHARSETCONVERTERS must handle the various character sets specified. In the TEXTENCODE case, CHARSETCONVERTERS only has to deal with the charset specified in the TEXTENCODE resource. Therefore, using TEXTENCODE vastly simplifies the CHARSETCONVERTERS resource value, where only the default converter needs to be defined. This is highlighted in the Usage sections for the Available Encoders listed below.

Writing Encoders

NOTE:

Before writing your own, first check out the list of available encoders to see if one already exists that satisfies your needs.

If you want to write your own text encode for use in MHonArc, you need to know the Perl programming language. The following information assumes you know Perl.

Function Interface of Encoder

MHonArc interfaces with encoder by calling a routine with a specific set of arguments. The prototype of the interface routine is as follows:

sub text_encoder {
    my($text_ref, $from_charset, $to_charset) = @_;

    # code here

    # The last statement should be the return value, unless an
    # explicit return is done. See the following for the format of the
    # return value.
}

Parameter Descriptions

`$text_ref`	A reference to the string to encode. The routine should do the text encoding in-place.
`$from_charset`	Name of the source character encoding of `$$text_ref`.
`$to_charset`	The destination encoding for `$$text_ref`. `$to_charset` is be set to the `charset` component of the TEXTENCODE resource value.

Return Value

On error, the routine should return undef. Otherwise, it should return any true value.

CAUTION:

If your routine encounters an error, try to preserve the original value of $$text_ref or data may be lost in archive output.

Available Encoders

The standard MHonArc distribution provides the following character encoding routines:

`MHonArc::UTF8::to_utf8`

Usage

<TextEncode>
utf-8; MHonArc::UTF8::to_utf8; MHonArc/UTF8.pm
</TextEncode>

<-- With data translated to UTF-8, it simplifies CHARSETCONVERTERS -->
<CharsetConverters override>
default; mhonarc::htmlize
</CharsetConverters>

<-- Need to also register UTF-8-aware text clipping function -->
<TextClipFunc>
MHonArc::UTF8::clip; MHonArc/UTF8.pm
</TextClipFunc>

Description

MHonArc::UTF8::to_utf8 converts text to UTF-8 (Unicode). Unicode is designed to represents all characters of all languages. UTF-8 is an encoding of Unicode that is 8-bit clean and immune to byte ordering of computer systems. Most modern browsers support UTF-8 and UTF-8 is a good choice if dealing with multi-lingual archives.

MHonArc::UTF8::to_utf8 is designed to work with older versions of Perl that do not support UTF-8, but also utilizing UTF-8 aware modules in later versions of Perl. MHonArc::UTF8 checks for the following, in order of preference, when loaded:

Encode: Encode comes standard with Perl v5.8 and provides conversion capbilities between various character encodings.
Unicode::MapUTF8: Unicode::MapUTF8 is available via CPAN, and provides conversion capabilities between various character encodings to, and from, UTF-8. Unicode::MapUTF8 depends on other modules, see Unicode::MapUTF8 module documentation for details.
fallback: If none of the above are present, then the fallback implementation is used. Fallback code is written in pure Perl, so it may not be as efficient as the modules listed above. However, many popular character encodings are supported.

NOTE:
Fallback code is automatically invoked for character encodings not recognized by the above listed modules.

`MHonArc::Encode::from_to`

Usage

<TextEncode>
charset; MHonArc::Encode::from_to; MHonArc/Encode.pm
</TextEncode>

Description

MHonArc::Encode::from_to converts texts to the specified charset encoding. This routine is useful for locales that prefer to have all archive data translated to the locale-prefered character set.

NOTE:

Since most locale-specific character sets are not universal sets (like Unicode), characters may be lost during translation.

NOTE:

For UTF-8 encoding, use MHonArc::UTF8::to_utf8 instead since it provides more robust fallback capabilities and works under non-UTF-8-aware versions of Perl.

MHonArc::Encode:from_to works only if one of the following modules are available, in order of preference, when MHonArc::Encode is located:

Encode: Encode comes standard with Perl v5.8 and provides conversion capbilities between various character encodings.
Unicode::MapUTF8: Unicode::MapUTF8 is available via CPAN, and provides conversion capabilities between various character encodings to, and from, UTF-8. Unicode::MapUTF8 depends on other modules, see Unicode::MapUTF8 module documentation for details.

No fallback implentations are available.

If converting to a multi-byte encoding, the default TEXTCLIPFUNC may not be adequate. Therefore, you may have to avoid using resource variables with maximum length specifiers.

NOTE:

There is support for ISO-2022-JP (Japanese). The following resource settings should serve as a basis when encoding to iso-2022-jp:

<TextEncode>
iso-2022-jp; MHonArc::Encode::from_to; MHonArc/Encode.pm
</TextEncode>

<-- Make sure to use iso-2022-jp aware charset converter -->
<CharsetConverters override>
default; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>

<-- Need to also register iso-2022-jp-aware text clipping function -->
<TextClipFunc>
iso_2022_jp::clip; iso2022jp.pl
</TextClipFunc>

For more information about using MHonArc in a Japanese locale, see (documents in Japanese): <http://www.mhonarc.jp/>.

<TextEncode>
iso-2022-jp; MHonArc::Encode::from_to; MHonArc/Encode.pm
</TextEncode>

<-- Make sure to use iso-2022-jp aware charset converter -->
<CharsetConverters override>
default; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>

<-- Need to also register iso-2022-jp-aware text clipping function -->
<TextClipFunc>
iso_2022_jp::clip; iso2022jp.pl
</TextClipFunc>

<IdxPgBegin>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>$IDXTITLE$</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp">
</head>
<body>
<h1>$IDXTITLE$</h1>
</IdxPgBegin>

<TIdxPgBegin>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>$TIDXTITLE$</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp">
</head>
<body>
<h1>$TIDXTITLE$</h1>
</TIdxPgBegin>


<MsgPgBegin>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>$SUBJECTNA$</title>
<link rev="made" href="mailto:$FROMADDR$">
<meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp">
</head>
<body>
</MsgPgBegin>

Version

2.6.0