I've been writing a small Perl script that
scans MHonArc archives and HTMLizes the messages
bodies. I've been using Txt2Html as a library
to analyze the raw text and convert it to HTML.
The simplest messages are converted fine:
- paragraphs recognition
- short lines breaks
Sometimes the results are not probant:
- quotations not always detected
- ordered lists mixed
First, it may be better to hook in your filter as a MIME filter
into mhonarc. This way the conversion is done as mhonarc processes
messages. See the MIMEFILTERS resource for details.
As for converted text to HTML, you will never get a perfect solution.
Any filter requires heuristics, and it is practically impossible to
get them working on all cases. From my experience, it is best to
compromise and use simple heuristics that will work all the time.
It is best to have all messages readable/useable instead of getting
messages that get garbabled because the heuristics did not handle
particular cases properly.
You also need to worry about performance. The more complicated
the filter, the longer it will take to do its job. This can be
a problem for archives that are updated automatically as new
messages come in.
If HTML formatting is important, people can always use text/html
messages. multipart/alternative can be used to provide text/plain
and text/html in case of receipents that cannot read HTML directly
from their MUAs. There is also the text/setext type which leads to
easy translation to HTML but is still readable in raw form. MHonArc
comes with Setext->HTML filter.