procmail
[Top] [All Lists]

Re: Alleviating Duplicates

1999-05-08 11:51:08
An idea just hit me. Since the "formail -D" trick works find for many people,
this idea could also be implemented as a separate standalone program, with
similar behavior.

Suppose we pursued the "fingerprint" trick to an extreme. I'm thinking, toss
the entire header, then start scanning through the body, ignoring whitespace
and filler words, for each substantive word, generate the soundex code;
accumulate say the first 16 soundex codes. Pad with a null value (say 0000)
for short messages that don't produce 16 soundex code worth of body. If you
want to get even more efficient, pack the soundex codes down; with 26000
distinct possible codes packing into two bytes is a no-brainer.

This is beginning to sound like a pretty fun project. And I've got a great
archive, complete with scads of dups, to test against. Hmm. Maybe I'll score a
round tuit.

-Bennett

<Prev in Thread] Current Thread [Next in Thread>