mhonarc-dev

Re: Suggestions for improving MHA's i18n support

2002-09-11 11:29:19
On September 11, 2002 at 03:50, Mooffie wrote:

Currently, charset conversion routines are not applied to HTML messages. 
That is, m2h_text_html::filter() ignores any converters registered with 
<CharsetConverters>.

As a result we get pages in our archive in different encodings, which is 
a bad thing for at least three reasons:

(a) We may encounter difficulties using external tools such as search 
engines, grep(1) etc.
(b) We have no way to tell the browser in what encoding the page is 
(i.e. no <meta http-equive> tag).
(c) The user's browser may not support all encodings.

My suggestion is to apply the charset conversion routines to HTML 
messages as well. The fact that the current conversion routines escape 
HTML special characters (i.e. '<', '>', '&') complicates matters a bit.

I would consider this process separate from CharsetConverters.
CharsetConverters is responsible for taking the raw data and converting
into HTML.  What you are asking is straight character set/encoding
conversion.  The sematics are slightly different, but very important
since CharsetConverters routines must properly do things like converting
'<', '>', '&' to entity references.

A thing to consider is to have a pre-filtering step for any
text/* type that allows for a pre-conversion processing before a text
entity is passed to a filter.  This relates to the TODO item of having
chained filters.  Something like this would require some resctructuring
of readmail.pl, which then compatibility concerns must be considered.

2.

There are two instances where we (and MHA) don't know the charset of the 
data:

(a) We don't know the charset of the body of the message when no 
"charset=..." is present in the "content-type" header.

According to the MIME specs, US-ASCII is to be assumed.  Of course,
you are trying to deal with non-compliant MUAs.

(b) We don't know the charset of the headers (e.g. "subject") when the 
MUA uses 8-bit octets instead of following RFC 1522's guidelines (e.g. 
some web-mails and even Outlook Express, when the "Allow 8-bit 
characters in headers" option is checked).

Ugh.

Since MHA doesn't know the charset of the data, a UTF-8 conversion can't 
be carried out.

Although m2h_text_plain::filter() has a "default" argument that allows 
us to specify a default charset, this doesn't apply to headers, and, 
besides, m2h_text_html::filter() doesn't have such argument. Also, 
although MHA supports the pseudo charset "plain", we still have no way 
to tell the various conversion routines in what charset the data is.

My suggestion is to create a new resource, <DefaultCharset>, that allows 
one to specify a default charset. This charset will be passed to the 
conversion routines when no charset is explicitly specified (including 
headers).

Example:

<DefaultCharset>
windows-1255
</DefaultCharset>

This is a nice shortcut to something like the following that has
the same effect:

    <CharsetConverters>
    plain; MyMHADefault::str2html; MyMHADefault.pm
    </CharsetConverters>

In MyMHADefault.pm:

    package MyMHADefault;
    require 'readmail.pl';

    my $default_charset = 'iso-8859-8';
    sub str2html {
      my $charcnv = readmail::load_charset($default_charset);
      $charcnv($_[0], $default_charset);
    }
    1;

This does require that you properly define a converter for what
$default_charset is set to.

I like the DefaultCharset idea.

3.

Misconfigured MUAs, including some web-mails, may declare an incorrect 
charset. For example, Yahoo mail always appends "charset=us-ascii" to 
outgoing messages, even when the user writes in Hebrew.

As a result, the UTF-8 conversion routine thinks it converts us-ascii 
data, while the data is actually in iso-8859-8.

My suggestion is to create a new resource, <CharsetAliases>, to have MHA 
treat some charsets as others. Then, for example, if I have a Hebrew 
mailing list, I'd write:

<CharsetAliases>
iso-8859-8;  us-ascii iso-8859-1 iso-8859-8-i x-unknown x-user-defined
</CharsetAlises>

(which reads: "us-ascii and iso-8859-1 and . . . are aliases for 
iso-8859-8")

You can do the following to get a similiar effect:

<CharsetConverters>
us-ascii; MyHebrewConverter::str2html; MyHebrewConverter.pm
iso-8859-1; MyHebrewConverter::str2html; MyHebrewConverter.pm
iso-8859-8; MyHebrewConverter::str2html; MyHebrewConverter.pm
iso-8859-8-i; MyHebrewConverter::str2html; MyHebrewConverter.pm
x-unknown; MyHebrewConverter::str2html; MyHebrewConverter.pm
x-user-defined; MyHebrewConverter::str2html; MyHebrewConverter.pm
</CharsetConverters>

Of course, MyHebrewConverter::str2html would have to aware that
the $charset parameter could contain 'us-ascii'.

I guess the main sematic difference is that if CharsetAliases
were used, if mhonarc sees "us-ascii", when it calls
MyHebrewConverter::str2html, it will actually pass in "iso-8859-8"
as $charset instead of "us-ascii" (and use the charsetconverter
registered for iso-8859-8).

I can see the usefulness of it, especially when used with the existing
converters.

BTW, in mhtxtplain.pl you wrote:

%asis = ('us-ascii' => 1);  # XXX: Should us-ascii always be "as-is"?

The answer is, "No!" If you always treat us-ascii "as-is" you don't give 
the administrator a chance to register a CharsetConverter with us-ascii 
in order to handle misconfigured MUAs.

Good point.  I took the angle that typically all charsets have us-ascii
as a formal subset, so I figured why take the overhead of calling a
filter for it.  However, I did not consider the misconfigured MUA angle
(since we all know all MUAs do the right thing :-)

4.

I see that UTF8.pm includes a few hard-coded aliases (e.g. 
"windows-1250" --> "cp1250"). It might be possible to extend 
<CharsetAliases> to have this function too; for example:

<CharsetAliases>
cp1250; windows-1250
. . .
cp1255; windows-1255
. . .
apple-hebrew; x-mac-hebrew
</CharsetAlises>

True.

5.

Although UTF-8 has its advantages, some administrators might prefer 
their national 8-bit encoding (because it requires less disk space, 
because they already have 3rd party tools that work with it (e.g. search 
tools), etc). It seems that it won't be difficult to create a new 
conversion routine (one can start from MHonArc::UTF8::str2sgml) that 
converts everything to a common arbitrary encoding, which can be a 8-bit 
based one. A new resource, e.g. <TargetEncoding> or <ArchiveEncoding>, 
could determine this target encoding (which could also be "utf-8"(!), so 
this routine could eventually obsolete MHonArc::UTF8::str2sgml).

Can't someone achieve something similiar with:

<DecodeHeads>
<CharsetConverters override>
plain;          mhonarc::htmlize;
default;        -decode-
</CharsetConverters>

This basically decodes all non-ASCII encoded data, causing all
character data to be treated by the default locale setting (or what
is put in a <meta http-equiv> tag via layout resources).

Of course, it does not allow one to explicitly specify a target
encoding to allow for "smart" conversion from one charset to the
final one since whatever is registered for the "plain" set is not
provided any information on what the source format really is.

We can also have a corresponding resource variable, $ArchiveEncoding$, 
and put a meta tag on every page:

I prefer not to add conveniences like this since the regular layout
resource can give the same effect.

6.

There are some conversion tables in the CharEnt directory 
(ISO8859_*.pm). Almost all of them are incorrect, because you're using 
entity names that don't exist in HTML.

Ah, my SGML background is showing through.  The named entities are
standard in SGML, and unfortunately, never adopted by HTML.  I knew
this would eventually be a problem.

The named entities have the advantage of being usable across character
sets while numeric are tied to the current character set in use.

I know of no browser that 
recognizes these names. The HTML spec defines only a handful of 
character names, so the correct way is to use numeric character 
references (that is, "&#", possibly an "x", unicode value, ";"). I'd be 
happy to help you fix these tables (but perhaps it would be better to 
abandon them and instead implement the routine I suggested in #5 using 
the various Unicode::* modules (Yes, I know you don't want to tell your 
users they must have a Unicode::FooBar module installed, but isn't the 
alternative -- to include tens and hundreds of conversion tables with 
MHA -- worse?)).

I welcome numeric character entity reference mappings using the
&#x...; notation.  This should work for most modern browsers and
avoid have to have dependencies on other modules (at least in the
default configuration).

As for keeping hundreds of conversion tables, I agree with your
statement.  However, it would be nice to have tables for the most
common/standard sets (like to the ISO-8859 sets) so users do not have
to deal with trying to install third-party modules to get going.
A major goal, and I think one reason for it usefulness, is that it
should be easy to install and get going.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV