procmail
[Top] [All Lists]

Re: splitting digests, catching duplicates

1996-12-05 18:50:17
Uh, I forgot to use the proper flags on an important part of
the recipe, and I forgot to cycle the checksum cache file.  
Look for the lines with "-->".

    >     How do I uniquely identify messages without a Message-Id: field? 
    > 
    >     How do I tell if two messages are identical in the body, regardless
    >     of their Message-Id: and other headers.
    > 
    > Using a MD5 or other checksumming tool on the body of the messages would
    > be a nice way to uniquely identify messages.  Then, if message goes
    > through several mailers, filters, etc., it's body will look the same,
    > even though its headers may look different.  Of course, this won't catch
    > mailers which throw in gratuitous headers or trailers, or other such
    > fluff. 
    > 
    > Here's how it might work:
    > 
    >     # If the current message does not have a Message-Id: field, or
    >     # if the variable "checksum_all_msgs" is set, then feed the body of
    >     # the message through to md5sum and save the results in a new
    >     # header, "X-Checksum: "
    >     :0 
    >     * ^Message-Id:
    >     * $${checksum_all_msgs:-2}^0 .
    >     * 1^0 .
    >     { }
    >     :0 E              # derive a checkum
    >     {
    >       :0b             # scan only the body
    >       SUM=|md5sum
    >       :0fh            # add the new header
    >       |formail -I"X-Checksum: $SUM"
    >     }
    > 
    >     # Remove duplicate messages
    >     OLDCOMSAT=${COMSAT:-off}  # Don't tell COMSAT anything
    >     COMSAT=off
    >     :0 Wh: .msgid.lock                # is there a Message-Id:?
    >     * ^Message-Id:
    >     | formail -D 16384 .msgid.cache
    > 
    >     # Now, if there is a "X-Checksum:" field, and we've already seen
    >     # this message, toss it.
    >     :0
    >     * ^X-Checksum: *\/[^ ].*
    >     { SUM=$MATCH                      # save the current checksum
    >       LOCKFILE=.checksums.lock        # single access to this file
--> >       :0
--> >       |fgrep -s "$SUM" .checksums     # if fgrep succeeds, we've

Should be written as:

            :0Whi
            |fgrep -s "$SUM" .checksums     # if fgrep succeeds, we've

I also neglected to show how the file should be cycled to avoid caching
too many checksums.  So, the following line:

--> >       JUNK=`echo "$SUM" >>.checksums` # add the new checksum

should be rewritten as:

            JUNK=`(tail -8000 .checksums; echo "$SUM") >.checksums.new;
                  mv .checksums.new .checksums`

which will keep 8000 of the most recent checksums.  If you receive 1000
unique pieces of mail a day, this is 8 days worth of checksums.  You
could safely make this number smaller: the worst case would be an
undetected, duplicate email message.

    >       LOCKFILE                        # done with the file
    >     }
    >     COMSAT=$OLDCOMSAT
    >     OLDCOMSAT
    > 
    > This is untested, but it should work without too much further effort.
    > 
    > BTW, "md5sum" is part of the GNU text utilities.

___________________________________________________________
Alan Stebbens <aks(_at_)sgi(_dot_)com>      http://reality.sgi.com/aks