Uh, I forgot to use the proper flags on an important part of
the recipe, and I forgot to cycle the checksum cache file.
Look for the lines with "-->".
> How do I uniquely identify messages without a Message-Id: field?
>
> How do I tell if two messages are identical in the body, regardless
> of their Message-Id: and other headers.
>
> Using a MD5 or other checksumming tool on the body of the messages would
> be a nice way to uniquely identify messages. Then, if message goes
> through several mailers, filters, etc., it's body will look the same,
> even though its headers may look different. Of course, this won't catch
> mailers which throw in gratuitous headers or trailers, or other such
> fluff.
>
> Here's how it might work:
>
> # If the current message does not have a Message-Id: field, or
> # if the variable "checksum_all_msgs" is set, then feed the body of
> # the message through to md5sum and save the results in a new
> # header, "X-Checksum: "
> :0
> * ^Message-Id:
> * $${checksum_all_msgs:-2}^0 .
> * 1^0 .
> { }
> :0 E # derive a checkum
> {
> :0b # scan only the body
> SUM=|md5sum
> :0fh # add the new header
> |formail -I"X-Checksum: $SUM"
> }
>
> # Remove duplicate messages
> OLDCOMSAT=${COMSAT:-off} # Don't tell COMSAT anything
> COMSAT=off
> :0 Wh: .msgid.lock # is there a Message-Id:?
> * ^Message-Id:
> | formail -D 16384 .msgid.cache
>
> # Now, if there is a "X-Checksum:" field, and we've already seen
> # this message, toss it.
> :0
> * ^X-Checksum: *\/[^ ].*
> { SUM=$MATCH # save the current checksum
> LOCKFILE=.checksums.lock # single access to this file
--> > :0
--> > |fgrep -s "$SUM" .checksums # if fgrep succeeds, we've
Should be written as:
:0Whi
|fgrep -s "$SUM" .checksums # if fgrep succeeds, we've
I also neglected to show how the file should be cycled to avoid caching
too many checksums. So, the following line:
--> > JUNK=`echo "$SUM" >>.checksums` # add the new checksum
should be rewritten as:
JUNK=`(tail -8000 .checksums; echo "$SUM") >.checksums.new;
mv .checksums.new .checksums`
which will keep 8000 of the most recent checksums. If you receive 1000
unique pieces of mail a day, this is 8 days worth of checksums. You
could safely make this number smaller: the worst case would be an
undetected, duplicate email message.
> LOCKFILE # done with the file
> }
> COMSAT=$OLDCOMSAT
> OLDCOMSAT
>
> This is untested, but it should work without too much further effort.
>
> BTW, "md5sum" is part of the GNU text utilities.
___________________________________________________________
Alan Stebbens <aks(_at_)sgi(_dot_)com> http://reality.sgi.com/aks