Re: Suggestions for improving MHA's i18n support

2002-09-12 17:03:15
On Wednesday 11 September 2002 09:29 pm, Earl Hood wrote:
On September 11, 2002 at 03:50, Mooffie wrote:
Currently, charset conversion routines are not applied to HTML messages.
. . .
My suggestion is to apply the charset conversion routines to HTML
messages as well.
. . .
A thing to consider is to have a pre-filtering step for any
text/* type that allows for a pre-conversion processing before a text
entity is passed to a filter. 

Yes, that's a better idea.

This relates to the TODO item of having
chained filters.

Chained? You mean that they'll run one after the other?

If you examine a filter in HSMA, you'll see that its structure is:

sub my_new_HTML_filter {
        # 1. do some processing...
        # 2. call MHA's HTML filter
        # 3. do some more processing...

And this looks more usefull than a simple chain.


There are two instances where we (and MHA) don't know the charset of the
. . .
Since MHA doesn't know the charset of the data, a UTF-8 conversion can't
be carried out.
. . .
My suggestion is to create a new resource, <DefaultCharset>, that allows
one to specify a default charset.
. . .
This is a nice shortcut to something like the following that has
the same effect:

    plain; MyMHADefault::str2html;


    package MyMHADefault;
    require '';

    my $default_charset = 'iso-8859-8';
    sub str2html {
      my $charcnv = readmail::load_charset($default_charset);
      $charcnv($_[0], $default_charset);

That's right, but you can't expect the user to be a Perl programmer... so this 
isn't merely a "shortcut".


Misconfigured MUAs, including some web-mails, may declare an incorrect
. . .
My suggestion is to create a new resource, <CharsetAliases>, to have MHA
treat some charsets as others. Then, for example, if I have a Hebrew
mailing list, I'd write:

iso-8859-8;  us-ascii iso-8859-1 iso-8859-8-i x-unknown x-user-defined
. . .
You can do the following to get a similiar effect:

us-ascii; MyHebrewConverter::str2html;
iso-8859-1; MyHebrewConverter::str2html;
iso-8859-8; MyHebrewConverter::str2html;
iso-8859-8-i; MyHebrewConverter::str2html;
x-unknown; MyHebrewConverter::str2html;
x-user-defined; MyHebrewConverter::str2html;

Your suggestion may fail. Imagine the following situation:

Let's say MyHebrewConverter::str2html() converts its input to UTF-8.

Now, I have iso-8859-8 data which is incorrectly declared (by the MUA) as 
"us-ascii". MyHebrewConverter::str2html() will see "us-ascii" in the 
"charset" argument and thus the conversion will fail -- because it's not 

However, if "us-ascii" is an alias for "iso-8859-8" (using <CharsetAliases>), 
MyHebrewConverter::str2html() will think that the data is "iso-8859-8", which 
it really is. 

I guess the main sematic difference is that if CharsetAliases
were used, if mhonarc sees "us-ascii", when it calls
MyHebrewConverter::str2html, it will actually pass in "iso-8859-8"
as $charset instead of "us-ascii"


In HSMA I indeed used the above two ways to solve the problem:

1. I provided a resource file with the elaborate <CharsetConverters> resource 
you just gave, for use when the user wants the archive encoding to be 
windows-1255 (which is a 8-bit encoding), 

2. Because the user may want the archive encoding to be UTF-8, I had to 
include an aliases table in the code that does the UTF-8 conversion.

A <CharsetAliases> resource can solve this problem. You may have got the 
impression that it's only usefull for buggy MUAs, but that's not so. For 
example, Hebrew messages may be marked as either "iso-8859-8" or 
"iso-8859-8-i" (both are standardized, and both stand for the same encoding, 
but the first, which is deprecated nowadays, stands for "Visual Hebrew", and 
the later for "Logical Hebrew" (the meaning is not important for our 
discussion)). What I want is to let MHA know that "iso-8859-8-i" is actually 
the standard "iso-8859-8" encoding, and a <CharsetAliases> mechanism looks 
like an elegant solution.

Note that my suggestions are intended to help the ordinary user that manages 
MHA (the "administrator"). If all users were programmers, they wouldn't need 
them (albeit they'd have to spend a lot of time coding).

I can see the usefulness of it, especially when used with the existing



I see that includes a few hard-coded aliases (e.g.
"windows-1250" --> "cp1250"). It might be possible to extend
<CharsetAliases> to have this function too; for example:

cp1250; windows-1250
. . .
cp1255; windows-1255
. . .
apple-hebrew; x-mac-hebrew


Actually, it would be silly for <CharsetAliases> not to have this function, 
because the intent is to spare the user editing the source.


Although UTF-8 has its advantages, some administrators might prefer
their national 8-bit encoding (because it requires less disk space,
because they already have 3rd party tools that work with it (e.g. search
tools), etc). It seems that it won't be difficult to create a new
conversion routine (one can start from MHonArc::UTF8::str2sgml) that
converts everything to a common arbitrary encoding, which can be a 8-bit
based one. A new resource, e.g. <TargetEncoding> or <ArchiveEncoding>,
could determine this target encoding (which could also be "utf-8"(!), so
this routine could eventually obsolete MHonArc::UTF8::str2sgml).

Can't someone achieve something similiar with:

<CharsetConverters override>
plain;          mhonarc::htmlize;
default;        -decode-

No, the intent is to convert all messages to a common encoding the user 
specifies. You already have a function, MHonArc::UTF8::str2sgml, that 
converts all messages to UTF-8. My suggestion is to extend this function so 
that any target encoding is possible, not just UTF-8.

Of course, it does not allow one to explicitly specify a target
encoding to allow for "smart" conversion from one charset to the
final one since whatever is registered for the "plain" set is not
provided any information on what the source format really is.

Yes, but this can be solved with the <DefaultCharset> resource.

Ah, my SGML background is showing through.  The named entities are
standard in SGML, and unfortunately, never adopted by HTML.  I knew
this would eventually be a problem.

The named entities have the advantage of being usable across character
sets while numeric are tied to the current character set in use.

No, numeric character references are _independent_ of the encoding of the 
page, because they specify the unicode number of the character.

That is, "&#254;" is always the the letter Thorn, no matter what the encoding 

A good explanation can be found in the HTML spec:

And while you're at it, please check the folowing list:

I welcome numeric character entity reference mappings using the
&#x...; notation.  This should work for most modern browsers and
avoid have to have dependencies on other modules (at least in the
default configuration).

Since the named entities don't work anyway (well... the bulk of them), using 
numeric character references can't be worse :-)

A major goal, and I think one reason for it usefulness, is that it
should be easy to install and get going.

I agree with you.

I urge you to read the HSMA source and *.mrc files in order to understand the 
rationales behind my suggestions. In the code I dealt with the above issues, 
issues that are not at all specific to Hebrew, and which, in my opinion, 
should be handled by MHA itself.

To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the