There's a post from me on the way to the list on this topic, in response
to a further article from Daniel Munden, which Daniel had mailed directly
to me as well as to the list, and which I drafted in response to the direct
copy before I checked mail from the list. When I did get to my folder for
the Procmail List, I saw this from Sean Straw:
| There is another mechanism: CRCs. Takes a little more horsepower to
| generate (not as trivial as a simple checksum, but hardly a CPU hog
| either). Take the CRC of the body and store it into one field, the length
| of the body into another field, and the CRC of the subject and store it
| into yet another. CRCs of other select header elements could be used as
| well (such as From, and of course, the Message-ID).
Much better idea. I would leave out CRCing the From: because they're often
trivially different (and all forged) on multiple copies of spam, and Daniel
said from the first that the duplicate bodies were on messages with different
IDs, so comparing Message-Id: CRCs wouldn't help him.
As for grouping the right spelling of a word and assorted common misspellings
together (yes, Era, but what's a "finterprint"?), I would guess that it's
unnecessary to allow for them, because most of these are canned texts that
go out again and again, and nobody edits them. A punctuation, spelling,
grammar, diction, or syntax mistake in the original almost always remains
unchanged. If someone does fix it, so you'll be sent 200 copies with the
typo and 200 without and your duplication detector will let one of each
through instead of only one; I think the likelihood and the damage are both
so slight that it's not worth the trouble.
Points to Era for using "intuitively" correctly, though, and we can live with
what a certain other poster did to "minor".