Following are some suggestions.
1.
Currently, charset conversion routines are not applied to HTML messages.
That is, m2h_text_html::filter() ignores any converters registered with
<CharsetConverters>.
As a result we get pages in our archive in different encodings, which is
a bad thing for at least three reasons:
(a) We may encounter difficulties using external tools such as search
engines, grep(1) etc.
(b) We have no way to tell the browser in what encoding the page is
(i.e. no <meta http-equive> tag).
(c) The user's browser may not support all encodings.
My suggestion is to apply the charset conversion routines to HTML
messages as well. The fact that the current conversion routines escape
HTML special characters (i.e. '<', '>', '&') complicates matters a bit.
Although in some communities HTML messages are frowned upon, when one
uses a bidirectional language, sometimes HTML is a must, because plain
text lacks the higher protocol needed to specify directionality.
==================================================
Improvements to the new UTF-8 support:
2.
There are two instances where we (and MHA) don't know the charset of the
data:
(a) We don't know the charset of the body of the message when no
"charset=..." is present in the "content-type" header.
(b) We don't know the charset of the headers (e.g. "subject") when the
MUA uses 8-bit octets instead of following RFC 1522's guidelines (e.g.
some web-mails and even Outlook Express, when the "Allow 8-bit
characters in headers" option is checked).
Since MHA doesn't know the charset of the data, a UTF-8 conversion can't
be carried out.
Although m2h_text_plain::filter() has a "default" argument that allows
us to specify a default charset, this doesn't apply to headers, and,
besides, m2h_text_html::filter() doesn't have such argument. Also,
although MHA supports the pseudo charset "plain", we still have no way
to tell the various conversion routines in what charset the data is.
My suggestion is to create a new resource, <DefaultCharset>, that allows
one to specify a default charset. This charset will be passed to the
conversion routines when no charset is explicitly specified (including
headers).
Example:
<DefaultCharset>
windows-1255
</DefaultCharset>
3.
Misconfigured MUAs, including some web-mails, may declare an incorrect
charset. For example, Yahoo mail always appends "charset=us-ascii" to
outgoing messages, even when the user writes in Hebrew.
As a result, the UTF-8 conversion routine thinks it converts us-ascii
data, while the data is actually in iso-8859-8.
My suggestion is to create a new resource, <CharsetAliases>, to have MHA
treat some charsets as others. Then, for example, if I have a Hebrew
mailing list, I'd write:
<CharsetAliases>
iso-8859-8; us-ascii iso-8859-1 iso-8859-8-i x-unknown x-user-defined
</CharsetAlises>
(which reads: "us-ascii and iso-8859-1 and . . . are aliases for
iso-8859-8")
BTW, in mhtxtplain.pl you wrote:
%asis = ('us-ascii' => 1); # XXX: Should us-ascii always be "as-is"?
The answer is, "No!" If you always treat us-ascii "as-is" you don't give
the administrator a chance to register a CharsetConverter with us-ascii
in order to handle misconfigured MUAs.
4.
I see that UTF8.pm includes a few hard-coded aliases (e.g.
"windows-1250" --> "cp1250"). It might be possible to extend
<CharsetAliases> to have this function too; for example:
<CharsetAliases>
cp1250; windows-1250
. . .
cp1255; windows-1255
. . .
apple-hebrew; x-mac-hebrew
</CharsetAlises>
5.
Although UTF-8 has its advantages, some administrators might prefer
their national 8-bit encoding (because it requires less disk space,
because they already have 3rd party tools that work with it (e.g. search
tools), etc). It seems that it won't be difficult to create a new
conversion routine (one can start from MHonArc::UTF8::str2sgml) that
converts everything to a common arbitrary encoding, which can be a 8-bit
based one. A new resource, e.g. <TargetEncoding> or <ArchiveEncoding>,
could determine this target encoding (which could also be "utf-8"(!), so
this routine could eventually obsolete MHonArc::UTF8::str2sgml).
So, for example, for a windows-1255 archive we could have:
<ArchiveEncoding>
windows-1255
</ArchiveEncoding>
<CharsetConverters override>
default; MHonArc::Converter::str2sgml; MHonArc/Converter.pm
</CharsetConverters>
And for a UTF-8 archive:
<ArchiveEncoding>
utf-8
</ArchiveEncoding>
<CharsetConverters override>
default; MHonArc::Converter::str2sgml; MHonArc/Converter.pm
</CharsetConverters>
(Note that we don't list "us-ascii" in <CharsetConverters> (in contrast
with the utf-8 mrc example you provide with MHA), because "us-ascii"
might be <CharsetAliases>'ed to something else that can't be handled
"as-is". Also, we don't list "plain" because it's the responsibilty of
the conversion routine: when it gets an empty charset argument, it looks
for the one specified in the <DefaultCharset> resource.)
We can also have a corresponding resource variable, $ArchiveEncoding$,
and put a meta tag on every page:
<MsgPgBegin>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=$ArchiveEncoding$ ">
<title>$SUBJECTNA$</title>
</head>
</MsgPgBegin>
======== End of UTF-8 notes, back to 8-bit: =============
6.
There are some conversion tables in the CharEnt directory
(ISO8859_*.pm). Almost all of them are incorrect, because you're using
entity names that don't exist in HTML. I know of no browser that
recognizes these names. The HTML spec defines only a handful of
character names, so the correct way is to use numeric character
references (that is, "&#", possibly an "x", unicode value, ";"). I'd be
happy to help you fix these tables (but perhaps it would be better to
abandon them and instead implement the routine I suggested in #5 using
the various Unicode::* modules (Yes, I know you don't want to tell your
users they must have a Unicode::FooBar module installed, but isn't the
alternative -- to include tens and hundreds of conversion tables with
MHA -- worse?)).
If you decide to implement any of the above, let me know if I can help.
---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV