procmail
[Top] [All Lists]

Re: splitting digests, catching duplicates

1996-12-05 17:16:41
      - create a new ID for each message within the digest (md5?)
which could be used to get rid of duplicates.


Is this possible?  Parts of it?  How?

It seems to me that the general problem is:

    How do I uniquely identify messages without a Message-Id: field? 

And, a related problem is:

    How do I tell if two messages are identical in the body, regardless
    of their Message-Id: and other headers.

Using a MD5 or other checksumming tool on the body of the messages would
be a nice way to uniquely identify messages.  Then, if message goes
through several mailers, filters, etc., it's body will look the same,
even though its headers may look different.  Of course, this won't catch
mailers which throw in gratuitous headers or trailers, or other such
fluff. 

Here's how it might work:

    # If the current message does not have a Message-Id: field, or
    # if the variable "checksum_all_msgs" is set, then feed the body of
    # the message through to md5sum and save the results in a new
    # header, "X-Checksum: "
    :0 
    * ^Message-Id:
    * $${checksum_all_msgs:-2}^0 .
    * 1^0 .
    { }
    :0 E                # derive a checkum
    {
        :0b             # scan only the body
        SUM=|md5sum
        :0fh            # add the new header
        |formail -I"X-Checksum: $SUM"
    }

    # Remove duplicate messages
    OLDCOMSAT=${COMSAT:-off}    # Don't tell COMSAT anything
    COMSAT=off
    :0 Wh: .msgid.lock          # is there a Message-Id:?
    * ^Message-Id:
    | formail -D 16384 .msgid.cache

    # Now, if there is a "X-Checksum:" field, and we've already seen
    # this message, toss it.
    :0
    * ^X-Checksum: *\/[^ ].*
    {   SUM=$MATCH                      # save the current checksum
        LOCKFILE=.checksums.lock        # single access to this file
        :0
        |fgrep -s "$SUM" .checksums     # if fgrep succeeds, we've
                                        # tossed the mail
        JUNK=`echo "$SUM" >>.checksums` # add the new checksum
        LOCKFILE                        # done with the file
    }
    COMSAT=$OLDCOMSAT
    OLDCOMSAT

This is untested, but it should work without too much further effort.

BTW, "md5sum" is part of the GNU text utilities.

G'luck.
___________________________________________________________
Alan Stebbens <aks(_at_)sgi(_dot_)com>      http://reality.sgi.com/aks