[Prev: CHARSETALIASES][Resources][TOC][Next: CHECKNOARCHIVE]

CHARSETCONVERTERS


Syntax

Envariable

N/A

Element

<CHARSETCONVERTERS>
charset-filter-specification
</CHARSETCONVERTERS>

Command-line Option

N/A


Description

The CHARSETCONVERTERS resource specifies Perl routines to call for filtering characters of a character set to legal HTML characters. The filtering occurs for message header data encoded according to the MIME standard. The following example shows a header with encoded data:

From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
 =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

CHARSETCONVERTERS resource is also used by text-based MIMEFILTERS for message body text.

The CHARSETCONVERTERS resource can only be defined via the resource file. Each line of the element specifies a character set, the Perl routine for filtering the character set, and the Perl source file containing the routine.

Example:

<CharsetConverters>
iso-8859-1; MHonArc::CharEnt::str2sgml; MHonArc/CharEnt.pm
</CharsetConverters>

The first field is the character set specification. The second field is the routine name (which should contain a package qualifier). The third field is the source file the routine is defined. The source file is searched for as defined by the PERLINC resource.

There are some special character set specifications. They are as follows:

plain

This specifies text that is not explicitly encoded in a specific character set. The MIME RFCs specify that unencoded data should be treated as us-ascii. However, in some locales, this may not be the case.

default

The default routine to invoke for encoded data if no converter is defined for the given character set.

There are some special character set converter routines values. They are as follows:

-ignore-

Leave the data "as-is". I.e. The MIME encoding will be preserved.

-decode-

Just decode the data. This is useful if it is known that the characters set is the native character set for the system.

WARNING:

If the decoded data contains the characters '<', '>', and '&', this may conflict with HTML markup. -decode- should only be used if DECODEHEADS is active. See Examples below and DECODEHEADS for example uses of -decode-.

Each charset converter function is invoked as follows:

$converted_data = &function($data, $charset);

The data passed in will already be decoded from quoted-printable or base64 (as specified by the MIME syntax). Therefore, the called routine will be passed the raw byte data. It is important that the routine convert the data into a format suitable for inclusion within HTML markup.


Available Converters

The standard MHonArc distribution provides the following converters:

mhonarc::htmlize

Usage
<CharsetConverters>
charset-name; mhonarc::htmlize
</CharsetConverters>

mhonarc::htmlize is provided by the MHonArc core code base, so no source file specification is required.

Description

mhonarc::htmlize does a simple replacement of HTML special characters into entity references. The characters '<', '>', '&', and '"' are converted to '&lt;', '&gt;', '&amp;', and '&quot;', respectively.

This converter is appropriate for us-ascii data and for situations where the given character set is an 8-bit set that matches the locale settings for the archives. For example, if an archive contains iso-8859-7 (Greek) text data and archive readers' browsers are set to iso-8859-7 as the default encoding, then mhonarc::htmlize can be used to prevent the overhead of Greek characters being converted to entity references.

If you will be managing archives that will include messages with multiple character encodings, it is recommend to limit the use of mhonarc::htmlize to us-ascii only.

MHonArc::CharEnt::str2sgml

Usage
<CharsetConverters>
charset-name; MHonArc::CharEnt::str2sgml; MHonArc/CharEnt.pm
</CharsetConverters>
Description

MHonArc::CharEnt::str2sgml converts a variety of character encodings into HTML 4 standard character entity references (e.g. &#Aelig;) and/or Unicode character entity references (e.g. &#x017D;). Characters in the us-ascii domain are left as-is, with the exception of HTML specials, which are converted like mhonarc::htmlize. MHonArc::CharEnt::str2sgml attempts to be locale neutral and should be sufficient for most locales.

The following character sets/encodings are supported:

Charset/encodingDescription
us-asciiUS ASCII
iso-8859-1Latin 1
iso-8859-2Latin 2
iso-8859-3Latin 3
iso-8859-4Latin 4
iso-8859-5Cyrillic
iso-8859-6Arabic
iso-8859-7Greek
iso-8859-8Hebrew
iso-8859-9Latin 5
iso-8859-10Latin 6
iso-8859-11Thai
iso-8859-13Latin 7 (Baltic Rim)
iso-8859-14Latin 8 (Celtic)
iso-8859-15Latin 9 (aka Latin 0)
iso-8859-16Latin 10
iso-2022-jpJapanese
iso-2022-krKorean
euc-jpJapanese
utf-8Unicode UTF-8
cp866MS-DOS Cyrillic
cp932Japanese (Shift-JIS)
cp936Chinese (GBK)
cp949Korean
cp950Windows Chinese
cp1250Windows Latin 2
cp1251Windows Cyrillic
cp1252Windows Latin 1
cp1253Windows Greek
cp1254Windows Turkish
cp1255Windows Hebrew
cp1256Windows Arabic
cp1257Windows Baltic
cp1258Windows Vietnamese
koi-0Cyrillic
koi-7Cyrillic
koi8-aCyrillic
koi8-bCyrillic
koi8-eCyrillic
koi8-fCyrillic
koi8-rCyrillic
koi8-uCyrillic
gost-19768-87Cyrillic
visciiVietnamese
big5-etenChinese (Taiwan)
big5-hkscsChinese (Hong Kong)
gb2312Chinese
macarabicApple Arabic
maccentraleurromanApple Central Europe
maccroatianApple Croatian
maccyrillicApple Cyrillic
macgreekApple Greek
machebrewApple Hebrew
macicelandicApple Icelandic
macromanianApple Romanian
macromanApple Roman (Latin)
macthaiApple Thai
macturkishApple Turkish
hp-roman8HP Roman (Latin)

Most of the above listed charsets are also known by different names. See the CHARSETALIASES resource for details.

MHonArc::UTF8::str2sgml

Usage
<CharsetConverters override>
plain;    mhonarc::htmlize
default;  MHonArc::UTF8::str2sgml; MHonArc/UTF8.pm
</CharsetConverters>

<-- Need to also register UTF-8-aware text clipping function -->
<TextClipFunc>
MHonArc::UTF8::clip; MHonArc/UTF8.pm
</TextClipFunc>
Description

MHonArc::UTF8::str2sgml converts data to UTF-8. With HTML specials converted to entity references like mhonarc::htmlize.

Typical usages is to have it registered for all charsets, since only one TEXTCLIPFUNC can be specified. Having a mixture of UTF-8 and non-UTF-8 data can cause clipping problems in resource variables that specify a length specifier.

See the utf-8.mrc example resource file more details on how this converter can be used.

iso_2022_jp::str2html

Usage
<CharsetConverters>
iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>
Description

iso_2022_jp::str2html is designed to work with iso-2022-jp within a Japanese locale. iso_2022_jp::str2html preserves the iso-2022-jp encoding format, but converts HTML specials into character entity references similiar to mhonarc::htmlize.

NOTE:

If using iso_2022_jp::str2html, you should also use the iso-2022-jp text clipping function:

<TextClipFunc>
iso_2022_jp::clip; iso2022jp.pl
</TextClipFunc>

Some Japanese-aware processing tools do not support Unicode character entity references, like those generated by MHonArc::CharEnt::str2sgml, so the iso_2022_jp::str2html may be prefered over MHonArc::CharEnt::str2sgml for handling iso-2022-jp data.

For more information about using MHonArc in a Japanese locale, see (documents in Japanese): <http://www.mhonarc.jp/>


Default Setting

NOTE:

As of MHonArc v2.6.0, filters should only be defined for base charsets. The CHARSETALIASES resource can be used to map alternate names for base charsets.

<CharsetConverters>
plain;		    mhonarc::htmlize;
us-ascii;	    mhonarc::htmlize;
iso-8859-1;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-2;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-3;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-4;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-5;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-6;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-7;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-8;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-9;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-10;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-11;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-13;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-14;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-15;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-16;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-2022-jp;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-2022-kr;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
euc-jp;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
utf-8;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp866;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp936;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp949;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp950;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1250;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1251;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1252;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1253;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1254;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1255;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1256;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1257;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1258;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi-0;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi-7;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-a;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-b;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-e;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-f;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-r;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-u;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
gost-19768-87;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
viscii;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
big5-eten;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
big5-hkscs;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
gb2312;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macarabic;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
maccentraleurroman; MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
maccroatian;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
maccyrillic;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macgreek;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
machebrew;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macicelandic;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macromanian;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macroman;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macthai;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macturkish;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
hp-roman8;          MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
default;            -ignore-
</CharsetConverters>

Resource Variables

N/A


Examples

The following example tells MHonArc to just decode iso-8859-1 character data since it is the default character set used by most browsers:

<DecodeHeads>
<CharsetConverters>
iso-8859-1;-decode-
</CharsetConverters>

MHonArc's MHonArc::CharEnt module supports the conversion of many major character sets, including UTF-8 data, into standard HTML character entity references (e.g. &Aelig;) and numeric Unicode character references (e.g. &#x203E;). However, if you want archive pages to be in native UTF-8, see the utf-8.mrc resource file example.


Version

2.0


See Also

CHARSETALIASES, DECODEHEADS, MIMEDECODERS, MIMEFILTERS, PERLINC, TEXTCLIPFUNC, TEXTENCODE


[Prev: CHARSETALIASES][Resources][TOC][Next: CHECKNOARCHIVE]

$Date: 2005/05/13 18:50:38 $
MHonArc
Copyright © 1997-2001, Earl Hood, mhonarc@mhonarc.org