Re: formail -D & using hashcodes instead?

"Karl" == Karl E Vogel 
<vogelke(_at_)c17mis(_dot_)wpafb(_dot_)af(_dot_)mil> writes:


    >>> Robert <dummy(_at_)c2(_dot_)org> writes:
    R> Has anyone considered the thought of using a hash function on
    R> the body of a message instead of merely using the Message-ID
    R> field for "formail - -D"?

    Karl>    A few years ago, the same folks who wrote "agrep" and
    Karl> "glimpse" presented a Usenix paper on a program called "sif"
    Karl> (similar files).  Their approach was to generate and store a
    Karl> series of hash values from a moving 70- or 80-byte window in
    Karl> a file and then look for recurring combinations of
    Karl> hashcodes.  It took up some disk space but worked well for
    Karl> finding occurrences of one file which had been included in
    Karl> another one, etc.

Robert did say "identical", so maybe if he can be more specific about
what "identical" means we could come up with a suggestion more
efficient than this general-purpose sif.  (A better name would be
"sift", "SImilar File Tagger" ;-)

I don't see how 'sif' would work exactly.  However, I suppose they
would have a hash code for each position in the file?  I can't see how
you could avoid sliding the window one byte at a time.  Still, the
disk space doesn't sound too great for Robert's application; I don't
see that you'd really need more than one-byte hash codes, so the
storage would be equivalent to one week's email.  Presumably he can
filter out most correspondents who are *not* susceptible to
duplicates....  So it would actually be less than that.

But the computation sounds awfully expensive for regular use by a
"normal user".  If you keep 50kB of mail in 50 messages from the last
week, that's 1000 hash codes for every message, matched against 50
hash code sequences.  Sounds great for spam-hunters and indexing
services (the kind of folks who can devote a half-dozen Pentia and
several GB of disk space...).

I suggest that Robert arrange that his spam-hunting recipe
automatically generate a complaint to the correspondents who are
consistently sending repeats, and get them to arrange a mailing list
instead sending dupes ... that would make the computation worth it!

-- 
                           Stephen John Turnbull
University of Tsukuba                                        Yaseppochi-Gumi
Institute of Policy and Planning Sciences  http://turnbull.sk.tsukuba.ac.jp/
Tennodai 1-1-1, Tsukuba, 305 JAPAN                 
turnbull(_at_)sk(_dot_)tsukuba(_dot_)ac(_dot_)jp