mhonarc-users

Re: cleaning up yahoogroups messages?

2007-02-06 08:42:15
I'm guessing you're hitting unescaped From lines.
This is my script for processing individual messages.

pid=$$
if [ -z "$*" ]
then
  exit
fi
for f in $*
do
  cat $f | sed -e '/^$/ q'   >head.$pid
  cat $f | sed -e '1,/^$/ d' >tmp.$pid
  cat tmp.$pid | sed -e 's/^From/>From/' -e 's/^\./ \./' >body.$pid
  cat head.$pid body.$pid >$f
done
rm head.$pid tmp.$pid body.$pid

   >Hi Folks,
   >
   >I've been trying to migrate a collection of messages from yahoogroups to 
   >sympa (which uses mhonarc as it's archiving engine).
   >
   >There's a great little script, yahoo2mbox, that pulls messages form 
   >yahoogroups and aggregates them into an mbox file - ideal for processing 
   >by mhonarc.
   >
   >Unfortunately, when I run mhonarc on the mbox file, it seems to cut out 
   >the bodies of a lot of, but not all of the messages - leaving the header 
   >intact.  It seems like messages that originated with MS Outlook are 
   >particularly likely to end up with empty bodies.
   >
   >Now I've read the archives of this list, and this seems to be a known 
   >problem with mhonarc filtering out malformed HTML, but I haven't seen 
   >any recent traffic indicating a solution of any sort.
   >
   >So... has anybody come up with a straightforward way to clean up an mbox 
   >file sufficiently for mhonarc to process?  (e.g. a way to run the mbox 
   >file through HTML Tidy or some such)?  Or can anybody offer some 
   >suggestions, recipes, recent experiences, etc.?
   >
   >Thanks much,
   >
   >Miles Fidelman
   >
   >

-- 
PEG Manager
pegmgr at peg dot com

<Prev in Thread] Current Thread [Next in Thread>