procmail
[Top] [All Lists]

Re: Alleviating Duplicates

1999-05-10 01:59:32
On Mon, 10 May 1999 03:05:32 +0000, Bennett Todd
<bet(_at_)newritz(_dot_)mordor(_dot_)net> wrote:
1999-05-09-20:49:11 Stan Ryckman:
1999-05-08-18:35 Bennett Todd:
Suppose we pursued the "fingerprint" trick to an extreme. I'm
thinking, toss the entire header, then start scanning through the
body, ignoring whitespace and filler words, for each substantive
word, generate the soundex code; accumulate say the first 16
The start of the message is probably a bad place to start for
anything useful; it's not uncommon to see a canned intro to
changing data (such as a list digest).
Also, note my quoting of you above... if several did this, your
filter would toss all but the first as "duplicates."
Good point. Maybe the trick would work if you soundexed every
nontrivial word in the entire body, then MD5-ed the string of
soundexes to make the fingerprint?

I think it's good to keep in mind that Soundex solves only one kind 
of problem and that it might not be very useful as a generalized
canonicalization of all sorts of strings.

(A particular example I was told about a long time ago was in a Fido
software package which would attempt to deliver to misspelled
addresses [which is misdirected anyhow -- there's an essay on this in
the Sendmail FAQ or somewhere] and used Soundex to sort them out.
Problem was, "sysop" and "uucp" both produce the same Soundex code.
[Both are pseudoaccounts of some importance on a Fido system with
UUCP.])

/* era */

I'd also be a bit suspicious of anything that ran multiple "digest"
algorithms on top of each other, but maybe I'm just superstitious.

-- 
.obBotBait: It shouldn't even matter whether     <http://www.iki.fi/era/>
I am a resident of the state of Washington. <http://members.xoom.com/procmail/>
 * Sign the European spam petition! <http://www.politik-digital.de/spam/en/> *

<Prev in Thread] Current Thread [Next in Thread>