Re: <CharsetConverters> for HTML?

2002-08-22 16:46:06
On August 22, 2002 at 12:08, Mooffie wrote:

Charset conversion is not being done to HTML messages.

m2h_text_plain::filter() does call a charset conversion function, if 
applicable, but m2h_text_html::filter() does not.

My Questions:

1. Why is that?

CHARSETCONVERTERS is designed to work on raw text data that is
to be converted to HTML, not the other way around.

2. Suppose I have HTML messages in different encodings. My archive 
contains the messages in their original encoding, and, apart from having 
a charset soup, it seems that I don't have any means to specify the 
correct <meta http-equiv="content-type" content="text/html; 
charset=XXX"> tag. Isn't this an i18n design flaw?

It is not a i18n flaw.  The solution wrt HTML is convert to unicode.

Currently, there is no easy way to set the <meta http-equiv>
dynamically in MHonArc.  It would require custom coding.  It would
be nice that MHonArc did support some ability for a filter to signal
to mhonarc what the document charset to be set to, but I'm not sure
how the interface (the user and the programmer) should be implemented
(and what should mhonarc do if it is given multiple charset values
for the same message!?).  Plus, this kind of capability would require
that all page layout resource settings use ASCII only and *named*
entity references for any non-7-bit characters.

3. One way to somehow solve the above problem is to instruct MHA, using 
<MIMEAltPrefs>, to prefer the text/plain media-type over text/html. 

This is good practice, mainly for security reasons.  But it does not
solve the general problem (see below).

Another way is to write my own HTML filter to do the conversion, and 
then call MHA's filter, m2h_text_html::filter. Am I correct?

That is possibility, but would require coding effort.

Are there other solutions?

The general problem is that a MIME message can actually contain
textual data in a variety of character sets.  This becomes a problem
when one tries to the convert it to a single document.

An approach is to have everything mapped to Unicode.  In the latest
release, there is some support for with this and an example resource
file, utf-8.mrc, on how to do it (the resource file is provided
in an appendix of the docs and in the examples/ directory of the
distribution).  Note, if you are using a search engine for your mhonarc
archives, you may not be able to use utf-8 since I do not know of any
(free) search engines that support utf-8.  Check the docs of the
search engine software you are using.

This utf-8 does not solve the HTML mail data problem since HTML
entities in messages may not be in UTF-8, so you still have the
problem of trying to do a character set conversion of an HTML document.
Using MIMALTPREFS will work if a text/plain alternative is available,
but does nothing if a message is all HTML.

A possible workaround is use MIMEALTPREFS, but for cases where all
there is HTML, save it off as an external file.  The
filter can be used to do this, but the HTML is not filtered for
potential XSS exploits (if this is a concern for you -- which it
should be if your archives are made available on a public web server).
Hence, you would need a custom filter that strips out dangerous markup
before saving to a file.  The existing filter could be
leveraged for this task.

The key advantage you get in saving the HTML to a separate file is
that it can have its own charset setting.  If the document does
not already define <meta http-equiv>, it can be added by using the
charset value specified in the Content-Type field of the entity header.

A possible fast solution to converting the HTML itself is to use
the Unicode::MapUTF8 module directly.  I.e.  Convert all HTML data
from their specified charset to UTF-8.  This should be safe since
HTML-related markup should not be affected by the translation (it is
all in the ASCII range).


To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the

<Prev in Thread] Current Thread [Next in Thread>