procmail
[Top] [All Lists]

Re: generic questions

2003-12-08 15:39:37
At 14:58 2003-12-08 -0600, Chuck Campbell wrote:

1)My "ham" emails are in hundreds of separate mbox files.  Can they simply be
"catted" together, or do I need to run procmail with a single recipe to file
them in a new location?

mbx files can be concatenated, but you need to ensure that there's a blank line between the tail end of one message and the start of another.

You could simply invoke a formail splitter on each mbx:

for x in (*.mbx)
 do
  formail -s procmail -m reprocessor.rc < $x
 done

Note that the reprocessor is assumed to contain the messageid cache code you mention below, or you could have it on the invocation line before the -s arg:

  formail -D 1000000 reprocessor.cache -s procmail -m reprocessor.rc < $x

By invoking it on the separate mailboxes, you avoid issues with the need for the separating lines between mailboxes if you were to concatenate them. You also avoid needing the interrim disk space required in order to dump them all into one large mbx to THEN split for reprocessing.

2)Can procmail help me with weeding out duplicates in this file? I use it for removing duplicates in my normal rc file, but my historical mai have all been through this procmail recipe once before. I normally use this:

[snip]

The recipe in 'man procmailex' for eliminating duplicates based upon a messageid cache should work, although if you're processing a large number of messages all at once (versus messages arriving with some shred of chronologic relationship), you probably want to bump the cache size way up. Example - you process a thousand messages from one MBX covering a couple of years on some discussion list, then go to another MBX which contains directly-addressed messages for the same time period - but the first mbx probably saturated the cache and caused it to eject old entries, so much of the dupes won't be matched by the time you run through the second mbx.

Will running my existing mbox folders through this again result in either
 a)confusion for my regular mail?

Use a different msgid.cache file (if you don't specify a hard coded path before it in the rcfile, then it's going to be relative to $MAILDIR, so run the process in a different directory if you're incuding the code, and the cache file will show up there).

 b)skipping any messages seen previously?

Yes, if the cache is too small for the number of messages you're going to process. IMO, very likely if you're reprocessing very much email, and ESPECIALLY so if they're from separate files - if they were all in one file (written that way upon receipt, not simply concatenated into one), then they could reasonably be expected to have some general chronological ordering (allowing for some staggering due to mail delivery delays), so the dupes would be expected to be near one another, and therefore their individual messageids wouldn't expire from the cache before the second message had been encountered.

Should I just change the msgid.cache to say msgid2.cache to avoid this issue?

That'd be one solution. But if you're going to manually reprocess your mail, how about you do it from a directory OTHER than where your regular mail gets stuffed?

3)Is there any mechanism in procmail for helping me keep only the most recent
(n*1000) emails in this file?

Nothing IN procmail. Even the msgid cache (a function of formail, which is an ancillary utility included in the procmail distro), doesn't define a NUMBER of messages, but rather the SIZE of the cache file. Message-IDs vary considerably in length, but commonly you'll see them from 30-60 characters. So, without any other knowledge of how formail actually manipulates the cache file, I'd say an 8KB cache file is probably good for storing 180 or so most recent message ids. That's a drop in the bucket if you're reprocessing a significant amount of email. Adjust your messageid cache size accordingly - bump it right through the roof if you want to best ensure you're not hitting dupes (at least those which have identical message-ids).

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>