Suggestions for improving MHA's i18n support

2002-09-10 17:57:58
 Following are some suggestions.


Currently, charset conversion routines are not applied to HTML messages. That is, m2h_text_html::filter() ignores any converters registered with <CharsetConverters>.

As a result we get pages in our archive in different encodings, which is a bad thing for at least three reasons:

(a) We may encounter difficulties using external tools such as search engines, grep(1) etc. (b) We have no way to tell the browser in what encoding the page is (i.e. no <meta http-equive> tag).
(c) The user's browser may not support all encodings.

My suggestion is to apply the charset conversion routines to HTML messages as well. The fact that the current conversion routines escape HTML special characters (i.e. '<', '>', '&') complicates matters a bit.

Although in some communities HTML messages are frowned upon, when one uses a bidirectional language, sometimes HTML is a must, because plain text lacks the higher protocol needed to specify directionality.


Improvements to the new UTF-8 support:


There are two instances where we (and MHA) don't know the charset of the data:

(a) We don't know the charset of the body of the message when no "charset=..." is present in the "content-type" header. (b) We don't know the charset of the headers (e.g. "subject") when the MUA uses 8-bit octets instead of following RFC 1522's guidelines (e.g. some web-mails and even Outlook Express, when the "Allow 8-bit characters in headers" option is checked).

Since MHA doesn't know the charset of the data, a UTF-8 conversion can't be carried out.

Although m2h_text_plain::filter() has a "default" argument that allows us to specify a default charset, this doesn't apply to headers, and, besides, m2h_text_html::filter() doesn't have such argument. Also, although MHA supports the pseudo charset "plain", we still have no way to tell the various conversion routines in what charset the data is.

My suggestion is to create a new resource, <DefaultCharset>, that allows one to specify a default charset. This charset will be passed to the conversion routines when no charset is explicitly specified (including headers).




Misconfigured MUAs, including some web-mails, may declare an incorrect charset. For example, Yahoo mail always appends "charset=us-ascii" to outgoing messages, even when the user writes in Hebrew.

As a result, the UTF-8 conversion routine thinks it converts us-ascii data, while the data is actually in iso-8859-8.

My suggestion is to create a new resource, <CharsetAliases>, to have MHA treat some charsets as others. Then, for example, if I have a Hebrew mailing list, I'd write:

iso-8859-8;  us-ascii iso-8859-1 iso-8859-8-i x-unknown x-user-defined

(which reads: "us-ascii and iso-8859-1 and . . . are aliases for iso-8859-8")

BTW, in you wrote:

%asis = ('us-ascii' => 1);  # XXX: Should us-ascii always be "as-is"?

The answer is, "No!" If you always treat us-ascii "as-is" you don't give the administrator a chance to register a CharsetConverter with us-ascii in order to handle misconfigured MUAs.


I see that includes a few hard-coded aliases (e.g. "windows-1250" --> "cp1250"). It might be possible to extend <CharsetAliases> to have this function too; for example:

cp1250; windows-1250
. . .
cp1255; windows-1255
. . .
apple-hebrew; x-mac-hebrew


Although UTF-8 has its advantages, some administrators might prefer their national 8-bit encoding (because it requires less disk space, because they already have 3rd party tools that work with it (e.g. search tools), etc). It seems that it won't be difficult to create a new conversion routine (one can start from MHonArc::UTF8::str2sgml) that converts everything to a common arbitrary encoding, which can be a 8-bit based one. A new resource, e.g. <TargetEncoding> or <ArchiveEncoding>, could determine this target encoding (which could also be "utf-8"(!), so this routine could eventually obsolete MHonArc::UTF8::str2sgml).

So, for example, for a windows-1255 archive we could have:


<CharsetConverters override>
default;        MHonArc::Converter::str2sgml;     MHonArc/

And for a UTF-8 archive:


<CharsetConverters override>
default;        MHonArc::Converter::str2sgml;     MHonArc/

(Note that we don't list "us-ascii" in <CharsetConverters> (in contrast with the utf-8 mrc example you provide with MHA), because "us-ascii" might be <CharsetAliases>'ed to something else that can't be handled "as-is". Also, we don't list "plain" because it's the responsibilty of the conversion routine: when it gets an empty charset argument, it looks for the one specified in the <DefaultCharset> resource.)

We can also have a corresponding resource variable, $ArchiveEncoding$, and put a meta tag on every page:

<meta http-equiv="content-type" content="text/html; charset=$ArchiveEncoding$ ">

======== End of UTF-8 notes, back to 8-bit: =============


There are some conversion tables in the CharEnt directory (ISO8859_*.pm). Almost all of them are incorrect, because you're using entity names that don't exist in HTML. I know of no browser that recognizes these names. The HTML spec defines only a handful of character names, so the correct way is to use numeric character references (that is, "&#", possibly an "x", unicode value, ";"). I'd be happy to help you fix these tables (but perhaps it would be better to abandon them and instead implement the routine I suggested in #5 using the various Unicode::* modules (Yes, I know you don't want to tell your users they must have a Unicode::FooBar module installed, but isn't the alternative -- to include tens and hundreds of conversion tables with MHA -- worse?)).

If you decide to implement any of the above, let me know if I can help.

To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the

<Prev in Thread] Current Thread [Next in Thread>