procmail
[Top] [All Lists]

RE: removing duplicates based upon an excerpt from the msg. body

2006-02-10 23:57:23


-----Original Message-----
From: Professional Software Engineering
Sent: Friday, February 10, 2006 9:39 PM


After getting the cksum, you could append the email address of
the author,
which would form a more typical looking messageid while also clearly
indicating who that message was posted by.  With the shrinkage in id
length, your history will increase by five or ten fold with typical
addresses, and you can certainly match against a larger proportion of the
message (though with discussion lists, there's always the issue with
list-inserted footers, which will generate uniqueness).

Using a bit of both ideas, I've decided on this:

:0:
* PUNDIT ?? ^^YES^^
{
MD5=`formail -I '' | head -400 | md5sum | head -c 32`
FROM_ADDR=`echo "$REALLY_FROM" |
           sed -e 
's/.*\(\<[A-Za-z_(_dot_)+0-9]*(_at_)[A-Za-z0-9_(_dot_)]*\).*/\1/'`
:0
* !? echo "Message-ID: $MD5-$FROM_ADDR" | formail -D 5000
$HOME/.pundit.cache
pundit-mail
:0E
/dev/null
}

Where REALLY_FROM is set as follows:

:0
* ^From:.*\/[^  ].*
{ REALLY_FROM=$MATCH }

Thinking things over, matching the full message is likely problematic
because of the trailing adverts and list information.  Thus, truncating
to a reasonably short length is acceptable - letting the odd short message
throubh as a duplicate is not so bad.




____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail