At 14:58 2003-12-08 -0600, Chuck Campbell wrote:
1)My "ham" emails are in hundreds of separate mbox files. Can they simply be
"catted" together, or do I need to run procmail with a single recipe to file
them in a new location?
mbx files can be concatenated, but you need to ensure that there's a blank
line between the tail end of one message and the start of another.
You could simply invoke a formail splitter on each mbx:
for x in (*.mbx)
do
formail -s procmail -m reprocessor.rc < $x
done
Note that the reprocessor is assumed to contain the messageid cache code
you mention below, or you could have it on the invocation line before the
-s arg:
formail -D 1000000 reprocessor.cache -s procmail -m reprocessor.rc < $x
By invoking it on the separate mailboxes, you avoid issues with the need
for the separating lines between mailboxes if you were to concatenate
them. You also avoid needing the interrim disk space required in order to
dump them all into one large mbx to THEN split for reprocessing.
2)Can procmail help me with weeding out duplicates in this file? I use it
for removing duplicates in my normal rc file, but my historical mai have
all been through this procmail recipe once before. I normally use this:
[snip]
The recipe in 'man procmailex' for eliminating duplicates based upon a
messageid cache should work, although if you're processing a large number
of messages all at once (versus messages arriving with some shred of
chronologic relationship), you probably want to bump the cache size way
up. Example - you process a thousand messages from one MBX covering a
couple of years on some discussion list, then go to another MBX which
contains directly-addressed messages for the same time period - but the
first mbx probably saturated the cache and caused it to eject old entries,
so much of the dupes won't be matched by the time you run through the
second mbx.
Will running my existing mbox folders through this again result in either
a)confusion for my regular mail?
Use a different msgid.cache file (if you don't specify a hard coded path
before it in the rcfile, then it's going to be relative to $MAILDIR, so run
the process in a different directory if you're incuding the code, and the
cache file will show up there).
b)skipping any messages seen previously?
Yes, if the cache is too small for the number of messages you're going to
process. IMO, very likely if you're reprocessing very much email, and
ESPECIALLY so if they're from separate files - if they were all in one file
(written that way upon receipt, not simply concatenated into one), then
they could reasonably be expected to have some general chronological
ordering (allowing for some staggering due to mail delivery delays), so the
dupes would be expected to be near one another, and therefore their
individual messageids wouldn't expire from the cache before the second
message had been encountered.
Should I just change the msgid.cache to say msgid2.cache to avoid this issue?
That'd be one solution. But if you're going to manually reprocess your
mail, how about you do it from a directory OTHER than where your regular
mail gets stuffed?
3)Is there any mechanism in procmail for helping me keep only the most recent
(n*1000) emails in this file?
Nothing IN procmail. Even the msgid cache (a function of formail, which is
an ancillary utility included in the procmail distro), doesn't define a
NUMBER of messages, but rather the SIZE of the cache file. Message-IDs
vary considerably in length, but commonly you'll see them from 30-60
characters. So, without any other knowledge of how formail actually
manipulates the cache file, I'd say an 8KB cache file is probably good for
storing 180 or so most recent message ids. That's a drop in the bucket if
you're reprocessing a significant amount of email. Adjust your messageid
cache size accordingly - bump it right through the roof if you want to best
ensure you're not hitting dupes (at least those which have identical
message-ids).
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail