procmail
[Top] [All Lists]

removing duplicates based upon an excerpt from the msg. body

2006-02-10 20:03:31

I'm on a few mailing lists, which are populated by certain
people, whom I'll call "pundits", who post the same message/article
to multiple lists (as separate distinct messages).  I've grown tired
of reading their contributions in duplicate/triplicate, and frankly,
want to relegate them to a separate lower priority folder,
for less frequent review.  To do this, I came up with the
following recipe:

:0:
* PUNDIT ?? ^^YES^^
{

T400=`formail -I '' | tr -c '[:alpha:][:digit:]' '_' |
       tr -s '_' | head -c 400`
 
:0
* !? echo "Message-ID: $T400" | formail -D 40101 $HOME/.pundit.cache 
pundit-mail
 
:0E
/dev/null

}

If the message has been determined to be from (or refers to) a "pundit",
then PUNDIT=YES.  In that event, we take roughly the first 400 characters
of the body of the message and deposit that into the variable $T400.  Note
that convert all non- alphanumerics to '_' and
then eliminate duplicates.  The choice of remapping character is
unimportant.  Once we have a string that is representative of the
message, we prefix it with Message-ID: and feed that into formail -D
to see if we've seen this message prefix before.  If this is the
first occurrence, we deposit the message into pundit-mail, otherwise
it is ditched into /dev/null.  The size of the cache (40101) is tuned
to ensure that we cache at least the last 100 messages (proof left
to the reader).  Note: we limit the string length to 400 to step around
potential problems with LINEBUF, shell environment variable size limits
and so on.  It could likely be set to a somewhat larger value without
problems.

Comments? Suggested improvements?


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail