I've been trying to migrate a collection of messages from yahoogroups to
sympa (which uses mhonarc as it's archiving engine).
There's a great little script, yahoo2mbox, that pulls messages form
yahoogroups and aggregates them into an mbox file - ideal for processing
Unfortunately, when I run mhonarc on the mbox file, it seems to cut out
the bodies of a lot of, but not all of the messages - leaving the header
intact. It seems like messages that originated with MS Outlook are
particularly likely to end up with empty bodies.
Now I've read the archives of this list, and this seems to be a known
problem with mhonarc filtering out malformed HTML, but I haven't seen
any recent traffic indicating a solution of any sort.
So... has anybody come up with a straightforward way to clean up an mbox
file sufficiently for mhonarc to process? (e.g. a way to run the mbox
file through HTML Tidy or some such)? Or can anybody offer some
suggestions, recipes, recent experiences, etc.?