[Top] [All Lists]

Re: removing duplicates based upon an excerpt from the msg. body

2006-02-10 22:01:50
in message <JCEPIPKHCJGDMPOHDOIGAEOGDCAA(_dot_)gary(_at_)intrepid(_dot_)com>,
wrote Gary Funck thusly...

* PUNDIT ?? ^^YES^^

T400=`formail -I '' | tr -c '[:alpha:][:digit:]' '_' | tr -s '_' | head -c 

* !? echo "Message-ID: $T400" | formail -D 40101 $HOME/.pundit.cache



If the message has been determined to be from (or refers to) a
"pundit", then PUNDIT=YES.  In that event, we take roughly the
first 400 characters of the body of the message and deposit that
into the variable $T400.

You may get false positive if you get a 400B of quoted text.

Once we have a string that is representative of the message, we
prefix it with Message-ID: and feed that into formail -D to see if
we've seen this message prefix before.  If this is the first
occurrence, we deposit the message into pundit-mail, otherwise it
is ditched into /dev/null.

But that destroys the original Message-ID:, potentially breaking
threading if you *just* happen to reply to one of the pundits.

Note: we limit the string length to 400 to step around potential
problems with LINEBUF, shell environment variable size limits and
so on.  It could likely be set to a somewhat larger value without

I had to recently remove multiplicate messages based on body as i
stupidly reprocessed the same mbox more than once along w/ option
to add|update a Message-ID: header.  I did that in Perl by comparing
the MD5 checksum of message body ...



So, all i can say, where were you when i indeed your work?  (:
Thanks for showing the way.

  - Parv


procmail mailing list   Procmail homepage: