Re: removing duplicates based upon an excerpt from the msg. body

in message <JCEPIPKHCJGDMPOHDOIGAEOGDCAA(_dot_)gary(_at_)intrepid(_dot_)com>,
wrote Gary Funck thusly...


:0:
* PUNDIT ?? ^^YES^^
{

T400=`formail -I '' | tr -c '[:alpha:][:digit:]' '_' | tr -s '_' | head -c 
400`

:0
* !? echo "Message-ID: $T400" | formail -D 40101 $HOME/.pundit.cache
pundit-mail

:0E
/dev/null

}

If the message has been determined to be from (or refers to) a
"pundit", then PUNDIT=YES.  In that event, we take roughly the
first 400 characters of the body of the message and deposit that
into the variable $T400.


You may get false positive if you get a 400B of quoted text.

Once we have a string that is representative of the message, we
prefix it with Message-ID: and feed that into formail -D to see if
we've seen this message prefix before.  If this is the first
occurrence, we deposit the message into pundit-mail, otherwise it
is ditched into /dev/null.


But that destroys the original Message-ID:, potentially breaking
threading if you *just* happen to reply to one of the pundits.

Note: we limit the string length to 400 to step around potential
problems with LINEBUF, shell environment variable size limits and
so on.  It could likely be set to a somewhat larger value without
problems.


I had to recently remove multiplicate messages based on body as i
stupidly reprocessed the same mbox more than once along w/ option
to add|update a Message-ID: header.  I did that in Perl by comparing
the MD5 checksum of message body ...

 Program:
 http://www103.pair.com/parv/comp/src/perl/undupe-mail-body

 Documentation:
 http://www103.pair.com/parv/comp/src/perl/pod/undupe-mail-body.pod


So, all i can say, where were you when i indeed your work?  (:
Thanks for showing the way.


  - Parv

-- 


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail