procmail
[Top] [All Lists]

RE: removing duplicates based upon an excerpt from the msg. body

2006-02-10 21:59:05

From: Parv
Sent: Friday, February 10, 2006 8:41 PM


You may get false positive if you get a 400B of quoted text.



Yeah, and for that sin alone, they should be banished. <g>  More
seriously, it is true that I might throw out a valid, non-duplicate,
but these are low-priority messages to begin with.  Perhaps better
is to carve 200 bytes off the front of the message, and 200 off the
tail end.


But that destroys the original Message-ID:, potentially breaking
threading if you *just* happen to reply to one of the pundits.


No.  The original message is delivered in tact.  True, the duplicate
database has no real message id in it, but I'm not using it to check
for duplicate message id's.


Note: we limit the string length to 400 to step around potential
problems with LINEBUF, shell environment variable size limits and
so on.  It could likely be set to a somewhat larger value without
problems.

I had to recently remove multiplicate messages based on body as i
stupidly reprocessed the same mbox more than once along w/ option
to add|update a Message-ID: header.  I did that in Perl by comparing
the MD5 checksum of message body ...

I like that idea.  Will change the script accordingly.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail