Re: Alleviating Duplicates

On Fri, 07 May 1999 17:17:47 -0700, PSE-L(_at_)mail(_dot_)professional(_dot_)org
(Professional Software Engineering) wrote:

At 18:38 1999-05-07 -0500, David W. Tamkin wrote:

There has been a little talk here in the past about using checksum programs
on the bodies of incoming email and keeping caches of the checksums.  The
shortcoming there is that a trivial change to the body can affect the
checksum and that potentially two very different messages can generate the
same checksum.

<...>

There is no real escaping the fact that a minour change to the body can
affect a signature.  One _possible_ mechanism to reduce the chances that a
simple reformat of the text would cause a mismatch would be to have the
signature generator IGNORE whitespace (tabs, spaces, newlines), and
probably quotation markers as well (although attribution headers would pose
a unique challenge) - thus forwarded copies without additional commentary
_could_ be classified as duplicate messages if there is nothing in them to
truly differentiate the "beef" of the message from another copy you've
already received.


All of this could actually be nicely generalized, in theory: The
solution amounts to finding a "canonical" format which neutralizes all
the kinds of changes that you want to disregard (in princple, all
changes people might make before they resend a message, but we're
already up in the higher spheres) and converting all messages into
this format before comparing them. Like Liviu points out, anything
like this is probably too heavy to be done on a routine basis, on
today's hardware, but it could be an interesting topic for a master's
thesis or something.

Some possible ideas -- I'm not claiming any of these are mine, or
terribly original:

  * disregard everything except graphical words +not+ recognized by a
    database of known words (in essence, a spelling checker which also
    accepts common "variant" spellings such as my old favorites
    "definately", "seperate", and "grammer". Oh, and "varient"). You
    could also go so far as to reduce all words to their Soundex
    "fingerprints" and compare those instead of the words themselves.
    (This has the semi-ovbious drawback that Soundex is specific to
    English, and that you'd probably have to find something else for
    messages in other languages, if you even can figure out what
    language they're in.)

    [Then there's the issue of whether you should be capable of
    recognizing that amd URL pointing to a page at www.helsinki.fi is
    in fact the same as an URL pointing to brakteaatti.helsinki.fi if
    the rest of the URL is identical, and so forth for IP numbers for
    this host in a variety of formats. This is relatively esoteric at
    first glance, but might be important for identifying effectively
    identical material in e.g. spam and other material which is
    specifically designed -- to some extent -- to bypass conventional
    duplicate detection. Off the far end, you could be up against
    comparing the same message in two different languages. :-]

  * divide messages into "zones" and regard the whole message as a
    duplicate if you find duplicates of at least a few of them (where
    "a few" could be as low as one, depending on how you do the
    zoning, how you possibly canonicalize the messages before dividing
    them, etc)

  * give up on the messages themselves and extract various pieces of
    "signature" data which would include the Message-Id but also
    (various aspects of) "Sender", "Recipient", "Subject", and
    "References" (not the headers, more like what you intuitively
    understand by these concepts, keeping in mind that e.g. a
    forwarded message could have many different kinds of "senders").

A lot of effort has been put into different "message digest"
algorithms which provide a "fingerprint" for each message. These are
easier to compare than entire messages, and of course you can do all
the preprocessing you want before passing the message to the digest
algorithm. This makes the whole procedure a lot more convenient from a
computational point of view, but there is always the theoretical
possibility that two different messages could produce the same
"finterprint" (be it a CRC or an MD5 checksum or whatever). And so you
might still want to keep the messages in some sort of database so you
can verify whether you indeed have a true match when the fingerprint
says you do.

Pardon for the lenght of this,

/* era */

-- 
.obBotBait: It shouldn't even matter whether     <http://www.iki.fi/era/>
I am a resident of the state of Washington. <http://members.xoom.com/procmail/>
 * Sign the European spam petition! <http://www.politik-digital.de/spam/en/> *