procmail
[Top] [All Lists]

Re: Filtering already saved messages

1998-05-19 07:25:16


On Tue, 19 May 1998, era eriksson wrote:

On Mon, 18 May 1998 10:20:33 -0400 (EDT), Matt Cortes
<link(_at_)alpha(_dot_)pulsar(_dot_)net> wrote:
 > One question I'm wondering..  I've got some older backedup mail too that I
 > might want to reprocess back into my main mail.  Now that I have procmail
 > that makes it pretty damn easy but of course the older mail I have has
 > literally thousands of messages I already have in my main mail.  Is there
 > anyway I can create a filter that would look at my old mail..  compare it
 > to my main mail and copy over only messages I don't already have in there?

Formail has an option to do duplicate exclusion. You need to get the
message-id:s of the messages you already have in your inbox into the
message-id cache file (and perhaps even include a recipe in your
regular .procmailrc to update the file each time a new message comes
in) and then in the procmailrc you use to split out the old messages,
skip anything that is already in the message-id cache.

Reading the formail and procmailex manual pages left as an exercise :-)

Ok at first this was over my head.  Then I checked out the man pages and
well..  it not too bad now.  :)
Just to make sure I got this though I'm hoping for a little more feedback
from ya..

The procmailex man page has the following recipe:

:0 Wh: msgid.lock
| formail -D 8192 msgid.cache

Let me know if I get everything I'm planning to do right or if I need to
change my thoughts any..

-Firstly I'm pretty damn sure this recipe needs to be the first entry in
the first rc file that procmail looks at so that it doesn't filter away
any messages before formail has a chance to look at them all.

-Each message that is looked at by formail is compared to the msgid.cache
file.  If the messageid for that message is not present in that file, the
message id is recorded into the msgid.cache file and then allowed to pass
on to the rest of the recipes in .procmailrc.  If the messageid is in the
msgid.cache file formail then considers that message a duplicate message
and then does what with it??  Sends it to /dev/null or bounces it back to
the sender or what?  Can I choose what I want it to do with the message?
For example..  if it is a duplicate but I still want to see them, can I
tell formail to pass all duplicates to a mailbox called Dupes?  I don't
know what the syntax would be for things like that.  Chances are I'm just
going to want formail to get rid of the message, but its still nice to
know my options.

-I have literally thousands of messages..  is an 8k cache size enough?
Perhaps I should make it 1024000 (1meg) in size just to be save?
If the cache gets to be around 100k or so though..  how slow of a
processing am I going to endure?  Especially since its going to be
creating and deleting a lock file for what looks like every message..

-I also wonder if formail is going to freak out when I pump a few thousand
messages through that postprocess procmail script.  If all goes well it
will hopefully (at a decent processing speed) check for dupes, then filter
the messages to the right mailboxes for my main mbox.  I'm guessing
it'll account for mail that was sent back to me from lists as well.  Then
I send my old mbox through the process where it will find TONS of dupes
but also a handful of unique messages that need to get filtered in.

-What I'm left with is a mailbox with a bunch of really old email at the
end of it and so I would love to sort it all by date.

Ok so you see anything wrong in my thinking there?


 > Also on another note..  If I eventually do decide to process some old mail
 > into my main mailboxes in the future..  Is there a way I can resort my
 > mboxes so that the messages are in order from time recieved (last message
 > being most recent)?  I don't want to have a bunch of old mail at the
 > bottom of my mbox.  And the client sorting it temporarly during reading

This is not necessarily trivial. It's been discussed here on the list
from time to time but none of the solutions I've seen have struck me
as particularly elegant. 

mush is a command-line mail client which has the commands to sort a
folder in date order. Search the archives for "mush". (Wasn't there
something like this in mh, too?)

Ah ok cool.  If mush does the job I'm going for it.  How do you mean
though that its not particularly elegant?  Just that it was designed just
for the task?  As long as it does the job perfectly thats fine with me.
But I'd hate to find it striping out all the dates or turning all my mail
form unread to read or something.  :)
Speaking of which..  is there a program out there that can manipulate the
read/unread flag on emails in a mbox?

Also forgive me but I've seen "mh" mentioned all over the place.  What is
"mh"?  I'm guessing its a mailbox format standard or something.


You could use Procmail to split messages into date order somehow based
on the Date: header or the date in the From_ line. The problem with
Date: is that a lot of mail has invalid date stamps in this header,
and it's not necessarily always unambiguous. From_ will contain the
date of arrival on your machine (at least on the machines where I've
investigated this; I understand there is room for some variation here
too) which may or may not be perfect for your purposes.

Ugh.  Well my guess is that I'd have the same problem with mush too then,
wouldn't I?  As far as various date formats..  incorrect date stamps are
to be expected all the time sadly.


GNU date is pretty good at getting "21st March 1998 I think" into a
machine-readable date stamp (such as seconds since the epoch); other
implementations of date(1) might be less capable. There are some
standalone date processors out there which you could try as well. A
program called mdate was posted to this list a year or two ago. I have
a link to it on the Procmail Links page at

    <http://www.iki.fi/~era/procmail/links.html>

Ugh.  At this point I think I've chopped up my mail into enough pieces.
hehe.  :)


Thank you so much for all the time and help you've been putting into my
questions.  I really really appericate it.

-Matt