procmail
[Top] [All Lists]

Re: Alleviating Duplicates

1999-05-09 20:17:40
1999-05-09-20:49:11 Stan Ryckman:
1999-05-08-18:35 Bennett Todd:
Suppose we pursued the "fingerprint" trick to an extreme. I'm thinking, toss
the entire header, then start scanning through the body, ignoring whitespace
and filler words, for each substantive word, generate the soundex code;
accumulate say the first 16 soundex codes. Pad with a null value (say 0000)
for short messages that don't produce 16 soundex code worth of body.

The start of the message is probably a bad place to start for anything
useful; it's not uncommon to see a canned intro to changing data (such as
a list digest).

Also, note my quoting of you above... if several did this, your filter
would toss all but the first as "duplicates."

Good point. Maybe the trick would work if you soundexed every nontrivial word
in the entire body, then MD5-ed the string of soundexes to make the
fingerprint?

-Bennett

<Prev in Thread] Current Thread [Next in Thread>