Suggestions for improving MHA's i18n support

 Following are some suggestions.

1.

Currently, charset conversion routines are not applied to HTML messages.That is, m2h_text_html::filter() ignores any converters registered with<CharsetConverters>.

As a result we get pages in our archive in different encodings, which isa bad thing for at least three reasons:

(a) We may encounter difficulties using external tools such as searchengines, grep(1) etc.(b) We have no way to tell the browser in what encoding the page is(i.e. no <meta http-equive> tag).

(c) The user's browser may not support all encodings.

My suggestion is to apply the charset conversion routines to HTMLmessages as well. The fact that the current conversion routines escapeHTML special characters (i.e. '<', '>', '&') complicates matters a bit.

Although in some communities HTML messages are frowned upon, when oneuses a bidirectional language, sometimes HTML is a must, because plaintext lacks the higher protocol needed to specify directionality.


==================================================

Improvements to the new UTF-8 support:

2.

There are two instances where we (and MHA) don't know the charset of thedata:

(a) We don't know the charset of the body of the message when no"charset=..." is present in the "content-type" header.(b) We don't know the charset of the headers (e.g. "subject") when theMUA uses 8-bit octets instead of following RFC 1522's guidelines (e.g.some web-mails and even Outlook Express, when the "Allow 8-bitcharacters in headers" option is checked).

Since MHA doesn't know the charset of the data, a UTF-8 conversion can'tbe carried out.

Although m2h_text_plain::filter() has a "default" argument that allowsus to specify a default charset, this doesn't apply to headers, and,besides, m2h_text_html::filter() doesn't have such argument. Also,although MHA supports the pseudo charset "plain", we still have no wayto tell the various conversion routines in what charset the data is.

My suggestion is to create a new resource, <DefaultCharset>, that allowsone to specify a default charset. This charset will be passed to theconversion routines when no charset is explicitly specified (includingheaders).


Example:

<DefaultCharset>
windows-1255
</DefaultCharset>

3.

Misconfigured MUAs, including some web-mails, may declare an incorrectcharset. For example, Yahoo mail always appends "charset=us-ascii" tooutgoing messages, even when the user writes in Hebrew.

As a result, the UTF-8 conversion routine thinks it converts us-asciidata, while the data is actually in iso-8859-8.

My suggestion is to create a new resource, <CharsetAliases>, to have MHAtreat some charsets as others. Then, for example, if I have a Hebrewmailing list, I'd write:


<CharsetAliases>
iso-8859-8;  us-ascii iso-8859-1 iso-8859-8-i x-unknown x-user-defined
</CharsetAlises>

(which reads: "us-ascii and iso-8859-1 and . . . are aliases foriso-8859-8")


BTW, in mhtxtplain.pl you wrote:

%asis = ('us-ascii' => 1);  # XXX: Should us-ascii always be "as-is"?

The answer is, "No!" If you always treat us-ascii "as-is" you don't givethe administrator a chance to register a CharsetConverter with us-asciiin order to handle misconfigured MUAs.

4.

I see that UTF8.pm includes a few hard-coded aliases (e.g."windows-1250" --> "cp1250"). It might be possible to extend<CharsetAliases> to have this function too; for example:


<CharsetAliases>
cp1250; windows-1250
. . .
cp1255; windows-1255
. . .
apple-hebrew; x-mac-hebrew
</CharsetAlises>

5.

Although UTF-8 has its advantages, some administrators might prefertheir national 8-bit encoding (because it requires less disk space,because they already have 3rd party tools that work with it (e.g. searchtools), etc). It seems that it won't be difficult to create a newconversion routine (one can start from MHonArc::UTF8::str2sgml) thatconverts everything to a common arbitrary encoding, which can be a 8-bitbased one. A new resource, e.g. <TargetEncoding> or <ArchiveEncoding>,could determine this target encoding (which could also be "utf-8"(!), sothis routine could eventually obsolete MHonArc::UTF8::str2sgml).


So, for example, for a windows-1255 archive we could have:

<ArchiveEncoding>
windows-1255
</ArchiveEncoding>

<CharsetConverters override>
default;        MHonArc::Converter::str2sgml;     MHonArc/Converter.pm
</CharsetConverters>

And for a UTF-8 archive:

<ArchiveEncoding>
utf-8
</ArchiveEncoding>

<CharsetConverters override>
default;        MHonArc::Converter::str2sgml;     MHonArc/Converter.pm
</CharsetConverters>

(Note that we don't list "us-ascii" in <CharsetConverters> (in contrastwith the utf-8 mrc example you provide with MHA), because "us-ascii"might be <CharsetAliases>'ed to something else that can't be handled"as-is". Also, we don't list "plain" because it's the responsibilty ofthe conversion routine: when it gets an empty charset argument, it looksfor the one specified in the <DefaultCharset> resource.)

We can also have a corresponding resource variable, $ArchiveEncoding$,and put a meta tag on every page:


<MsgPgBegin>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML//EN">
<html>
<head>

<meta http-equiv="content-type" content="text/html;charset=$ArchiveEncoding$ ">

<title>$SUBJECTNA$</title>
</head>
</MsgPgBegin>

======== End of UTF-8 notes, back to 8-bit: =============

6.

There are some conversion tables in the CharEnt directory(ISO8859_*.pm). Almost all of them are incorrect, because you're usingentity names that don't exist in HTML. I know of no browser thatrecognizes these names. The HTML spec defines only a handful ofcharacter names, so the correct way is to use numeric characterreferences (that is, "&#", possibly an "x", unicode value, ";"). I'd behappy to help you fix these tables (but perhaps it would be better toabandon them and instead implement the routine I suggested in #5 usingthe various Unicode::* modules (Yes, I know you don't want to tell yourusers they must have a Unicode::FooBar module installed, but isn't thealternative -- to include tens and hundreds of conversion tables withMHA -- worse?)).


If you decide to implement any of the above, let me know if I can help.


---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV