I'm guessing you're hitting unescaped From lines.
This is my script for processing individual messages.
if [ -z "$*" ]
for f in $*
cat $f | sed -e '/^$/ q' >head.$pid
cat $f | sed -e '1,/^$/ d' >tmp.$pid
cat tmp.$pid | sed -e 's/^From/>From/' -e 's/^\./ \./' >body.$pid
cat head.$pid body.$pid >$f
rm head.$pid tmp.$pid body.$pid
>I've been trying to migrate a collection of messages from yahoogroups to
>sympa (which uses mhonarc as it's archiving engine).
>There's a great little script, yahoo2mbox, that pulls messages form
>yahoogroups and aggregates them into an mbox file - ideal for processing
>Unfortunately, when I run mhonarc on the mbox file, it seems to cut out
>the bodies of a lot of, but not all of the messages - leaving the header
>intact. It seems like messages that originated with MS Outlook are
>particularly likely to end up with empty bodies.
>Now I've read the archives of this list, and this seems to be a known
>problem with mhonarc filtering out malformed HTML, but I haven't seen
>any recent traffic indicating a solution of any sort.
>So... has anybody come up with a straightforward way to clean up an mbox
>file sufficiently for mhonarc to process? (e.g. a way to run the mbox
>file through HTML Tidy or some such)? Or can anybody offer some
>suggestions, recipes, recent experiences, etc.?
pegmgr at peg dot com