Re: Archiving sent mail; Attachmants with non-ascii names; Preserving charset of message

2002-12-16 16:50:55
On December 16, 2002 at 22:09, Tomasz Ostrowski wrote:

I needed to archive sent mail with MHonArc and I needed to put
contents of To: header to mesage index. It was not possible with
MHonArc-2.5.13 so I wrote a small patch that added rc-variable $TO$.

Then I used
        $MSGLOCALDATE(CUR;%Y-%m-%d %H:%M)$<br>
        <em>From</em>: $FROM$<br>
        <em>To</em>: $TO$

Check the attached file MHonArc-2.5.13-to.patch.

The preferable method is to allow for arbitrary message header
variables instead of just To:.  Otherwise, you end up replicating
code when people want 'cc' or other fields.

I've considered such a feature, but it does impact things like
mha-dbrecover and the types of comments that should be placed in
message files to allow recovering.  And then address harvesters
complicate things (i.e SPAMMODE).  May just have to punt on trying
to provide mha-dbrecover ability of arbitrary message header resource

What I envision is something like the following:


And then for resource variable usage, you would access them like
the following:


2. Attachmants with non-ascii names

I had problems with accessing attachments extracted with MHonArc from
Windows if they had non-ascii characters in name or characters
forbidden for file names: \/:*?"<>| (when using m2h_external::filter;

So I have written a patch that converts both types of characters to
underscore, just like spaces in original MHonArc.

Check the attached file MHonArc-2.5.13-attachment_name.patch.

Good catch.

Probably more efficient would be just exclude whitespace and non-ascii
characters in one tr// operation:

  $fname =~ tr/\0-\40\t\n\r\177-\377/_/;

3. Preserving charset of message

Most mails I have to convert use central european ISO-8859-2
encoding. Converting it to named entities did not work - it lacks

The named entities are going away for most of the iso-8859-x sets
since they were based on SGML.  They will be replaced with
Unicode character entity references, and it has already been
done for the latest snapshot builds.

browsers support. Using UTF-8 would make my archives un-grep-able so
I wrote a patch that made possible that text/plain MIME-parts
preserve original charset by adding rc-variable $CHARSET$.

This feature is insufficient.  It assumes that messages only contain
a single text entity part, which of course is wrong when dealing
with MIME messages.  MIME allows you to have multiple text entities,
with each one having a different charset.  Therefore, with your patch,
the last filtered entity wins out while the text from the other
entities are mis-rendered in the browser.

(I have thought of doing something like your patch does in the past,
 but due to the multipart issue, I did not.)

A more robust solution is under development where you will be able
to define a final text encoding that all text entities should be
converted to.  Generally, you would use it to map everything to
utf-8, but if certain Perl modules are installed, you could have all
text data encoded to what you choose, like iso-8859-2.  Of course,
choosing a non-universal encoding may cause characters to get "lost"
in text entities orginally encoded in a different charset, but this
may be acceptible in some locales.

As for the un-grepable utf-8, I think people will eventually have to
dealing with it if they want to have archives that are multi-lingual.
Perl 5.8 finally has robust utf8 support, so whipping of a grep-like
tool in Perl would solve the "grep" problem.

Unfortunately, HTML does not allow mixed-character encodings is
the same document, making things problematic when trying to
convert MIME mail into HTML.


To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the