Re: Tool to split huge archive before using MHonarc

2003-04-07 20:45:36
I would like to use MHonarc to process a huge 50meg file of mailing list
messages that go back to 1995. Before doing so I'd like to break the large
file up into separate monthly files in the form of YYYYMM. Are there any
fairly recent, stable tools that will do this? I tried an old program called
"spmail" but it craps out with a segmentation fault.

Hi Kevin,

I see that you're using Outlook Express as your mailer, so I don't
know if this will help you or not.  It's strictly UNIX based, but if
not, maybe it will help others on the list.

I also don't know what format your mail archive is in; if it's in Berkeley
mbox format the UNIX tools procmail and formail will probably work for you.
If not, let me know and we can probably still devise a scheme based on 
formail to split up your mailbox.

So...  The assumptions here are that you have access to UNIX and your 
mailbox is in Berkeley mbox format.

The first thing to do is create a procmailrc file that looks like this 
(call the file whatever you want):

   :0 Wic
   * ? test ! -d $DIR
   | mkdir $DIR

   * ^From.*\/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+[0-9]+[ 
]+[0-9:]+[ ]+(19|20)[0-9][0-9]

      * FROM ?? ^^\/[^   ]+
      { MONTH = $MATCH }
      * FROM ?? .*(19|20)\/[0-9][0-9]
      { YEAR = $MATCH }

Change DIR to be the directory where you want your mailboxes to 
appear (I recommend putting them in their own subdirectory).  Now run
this command:

   formail -s procmail -m procmailrc <

Be sure to replace procmailrc with the name of the file that you created
above.  Replace with the name of your mailbox archive.

There's one main difference between what you asked for and what you're
getting here -- the filenames will be in the format of YYYYmmm, where
mmm is the three letter abbreviation for the month.  

If your system has the GNU date program installed (part of the GNU
sh-utils package and most likely installed on Linux systems) you can
change things even further.  Your procmailrc file could look like this:

   :0 Wic
   * ? test ! -d $DIR
   | mkdir -p $DIR

   * ^From [^ ]+[ ]+\/.*
      DATE=`date --date="$MATCH" +%Y%m`


If all went well, your mailboxes will be split up the way that you asked.
You can change the format by tweaking the format sequence passed to the
date command (%Y%m in this case).  If there are messages in $DIR/unknown,
they will have failed this process and may need to be handled manually.


To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the

<Prev in Thread] Current Thread [Next in Thread>