Re: formail -D & using hashcodes instead?

"Mark" == Mark J Bynum <bynum(_at_)cs(_dot_)fsu(_dot_)edu> writes:


    >> >>> Robert <dummy(_at_)c2(_dot_)org> writes:
    R> Has anyone considered the thought of using a hash function on
    R> the body of a message instead of merely using the Message-ID
    R> field for "formail - -D"?

    Mark> Why couldn't Robert do something like write a checksum to a
    Mark> cache file and then on each successive message check to see
    Mark> if that checksum is in there. If so then don't accept (and,
[snip]
    Mark> Could it be as simple as this?

As I mentioned before, it depends on what he means by "identical".
For the checksum idea to work

(1) leading and trailing whitespace should be deleted (no sweat)
(2) there must be no leading garbage (eg, "Hey, check this out!")
(3) there must be no trailing garbage (but sigs are OK as long as they
    are strippable, that is they follow the "\n-- \n" convention)
(4) the systems sending duplicates treat whitespace the same way (some
    systems expand tabs, some systems will automatically convert
    newlines to CRLF pairs)
(5) the systems sending duplicates treat line length the same way
    (some systems will automatically fold long lines, some that do do
    so at whitespace, others at a fixed column)

Of course, stripping the leading and trailing whitespace and sigs
would demand the use of sed or gawk or a special-purpose filter.  But
those are easy enough tasks.

That's my short list of caveats; I could probably come up with a
couple more.  The language used in my locale (Japanese) would make
life substantially harder (there are at least three encoding systems,
which would screw up a blind checksum approach, since almost all
systems convert to the local format for storage and editing, and then
send it out in another form), but I assume none of the other parties
in this discussion have to worry about *that*.

Anyway, the only plausible scenario I've heard for true duplicates
without identical message IDs is the "mouse bounce in Netscape" one,
and I can't duplicate it in my Netscape.  I guess you could be
subscribed to a mailing list that munges message-IDs.  But both of the 
copies of the message I'm replying to had the same message ID....  So 
I think we really need to have a better idea of how these duplicates
are generated.

-- 
                           Stephen John Turnbull
University of Tsukuba                                        Yaseppochi-Gumi
Institute of Policy and Planning Sciences  http://turnbull.sk.tsukuba.ac.jp/
Tennodai 1-1-1, Tsukuba, 305 JAPAN                 
turnbull(_at_)sk(_dot_)tsukuba(_dot_)ac(_dot_)jp