procmail
[Top] [All Lists]

Re: Alleviating Duplicates

1999-05-08 00:24:41
Professional Software Engineering wrote:

At 18:38 1999-05-07 -0500, David W. Tamkin wrote:

[snip]

There has been a little talk here in the past about using checksum programs
on the bodies of incoming email and keeping caches of the checksums.  The
shortcoming there is that a trivial change to the body can affect the
checksum and that potentially two very different messages can generate the
same checksum.

There is another mechanism:  CRCs.  Takes a little more horsepower to
generate (not as trivial as a simple checksum, but hardly a CPU hog
either).  Take the CRC of the body and store it into one field, the length
of the body into another field, and the CRC of the subject and store it
into yet another.  CRCs of other select header elements could be used as
well (such as From, and of course, the Message-ID).

There is no real escaping the fact that a minour change to the body can
affect a signature.  One _possible_ mechanism to reduce the chances that a
simple reformat of the text would cause a mismatch would be to have the
signature generator IGNORE whitespace (tabs, spaces, newlines), and
probably quotation markers as well (although attribution headers would pose
a unique challenge) - thus forwarded copies without additional commentary
_could_ be classified as duplicate messages if there is nothing in them to
truly differentiate the "beef" of the message from another copy you've
already received.

I think CRC would be an excellent idea. It's very accurate if "whitespaces" are
ignored and the need to go beyond the scope of the message's body and a couple
of header lines would be negligible.

This would also eliminate other duplicates from someone who has too much time
on their hands, the resting of an elbow on the return key, or a need to send
the same message from another address source. But, not the change of one letter
in the message, which would allow revisions to filter through without the cost
of elimination.

The void of "whitespaces" and the chances of someone writing the same letter
character for character would not be a necessary factor. CRC would definitely
make the identity of detecting duplicate messages an exact science.

*********************************************************************
Signed,
Daniel D. Munden


<Prev in Thread] Current Thread [Next in Thread>