procmail
[Top] [All Lists]

Re: duplicates and delivery problems

1997-09-05 21:16:30
On Fri, Sep 05, 1997 at 12:57:02PM -0700, Alan K. Stebbens wrote:

This is the reason why I rewrote "dupcheck.rc": to provide for a
complete duplicate filtering solution, without worrying about mail
loss. 

    > One thing I wish I knew how to do is detect messages where the
    > _bodies_ have duplicate content, but came through list servers
    > that changed the message ID and perhaps tack on a trailer.

The other reason why I rewrote "dupcheck.rc" is so that it can also do
body matching, using md5 checksums.  So now, it does both "formail -D"
and "md5sum" filtering.

Using md5sum on the body is not quite sufficient for the task of
recognizing mail which is *mostly* like another.  The body must be
*exactly* like another, or the match fails.  This is not sufficient for
recognizing duplicate mail which has been processed by a "helpful" mail
gateway, one of which, for example, shifts the entire body one space
right (for no good reason).

Alan,

I've been experimenting with dupcheck.rc for a little while tonight
and have some feedback.  First, note that the ".md5sums" file must
already be there for the md5sum filtering to work.  If it's not, you
will get the following error and the md5 checksum won't be calculated
or recorded.

fgrep: can't open .md5sums 

Touching the .md5sums file will get things working, but it might be
better to have procmail create the file initially if it doesn't yet
exist.  Second, am I just missing something big, or does your approach
of using md5sum on the message body for duplicate checking have some
problems?

I don't think that an identical message body necessarily constitues a
"duplicate" message.  Symantically it does sure, but not practically.
For example, say a chum nearby is in the habit of sending off a
message every friday near the close of the work day reminding me to
join him at the neighborhood pub for a brew.  For the sake of
argument, say his MUA doesn't generate message-id's which will cause
dupcheck.rc to run the body through MD5SUM.

So, the first friday:

warhammer$ echo "time to meet at the pub" | mail bnorton(_at_)mastaler(_dot_)com

The next friday:

warhammer$ echo "time to meet at the pub" | mail bnorton(_at_)mastaler(_dot_)com

etc, etc, etc..

If I'm using dupcheck.rc, I will only see the first friday's message.

These messages have the same message body content and thus the same
md5 checksum, but I think it's erroneous to delete the latter messages
in the name of their being duplicates.  Wouldn't it be a better idea
to have the md5sum take into account something else (more unique) in
addition to the message body to make this more practical?

Third, if I set "DUPCHECK_USE_MD5=on" in my .procmailrc, dupcheck.rc
will correctly apply the "formail -D" filter if a Message-Id: exists.
But, I set "dupcheck_use_md5=on" (lowercase) instead, dupcheck.rc will
not initiate the "formail -D" filter even on messages where
Message-Id's exist.  This might be an issue independent of your
program, but I am still curious.  Thanks in advance.

Burnt