- create a new ID for each message within the digest (md5?)
which could be used to get rid of duplicates.
Is this possible? Parts of it? How?
It seems to me that the general problem is:
How do I uniquely identify messages without a Message-Id: field?
And, a related problem is:
How do I tell if two messages are identical in the body, regardless
of their Message-Id: and other headers.
Using a MD5 or other checksumming tool on the body of the messages would
be a nice way to uniquely identify messages. Then, if message goes
through several mailers, filters, etc., it's body will look the same,
even though its headers may look different. Of course, this won't catch
mailers which throw in gratuitous headers or trailers, or other such
fluff.
Here's how it might work:
# If the current message does not have a Message-Id: field, or
# if the variable "checksum_all_msgs" is set, then feed the body of
# the message through to md5sum and save the results in a new
# header, "X-Checksum: "
:0
* ^Message-Id:
* $${checksum_all_msgs:-2}^0 .
* 1^0 .
{ }
:0 E # derive a checkum
{
:0b # scan only the body
SUM=|md5sum
:0fh # add the new header
|formail -I"X-Checksum: $SUM"
}
# Remove duplicate messages
OLDCOMSAT=${COMSAT:-off} # Don't tell COMSAT anything
COMSAT=off
:0 Wh: .msgid.lock # is there a Message-Id:?
* ^Message-Id:
| formail -D 16384 .msgid.cache
# Now, if there is a "X-Checksum:" field, and we've already seen
# this message, toss it.
:0
* ^X-Checksum: *\/[^ ].*
{ SUM=$MATCH # save the current checksum
LOCKFILE=.checksums.lock # single access to this file
:0
|fgrep -s "$SUM" .checksums # if fgrep succeeds, we've
# tossed the mail
JUNK=`echo "$SUM" >>.checksums` # add the new checksum
LOCKFILE # done with the file
}
COMSAT=$OLDCOMSAT
OLDCOMSAT
This is untested, but it should work without too much further effort.
BTW, "md5sum" is part of the GNU text utilities.
G'luck.
___________________________________________________________
Alan Stebbens <aks(_at_)sgi(_dot_)com> http://reality.sgi.com/aks