procmail
[Top] [All Lists]

Re: formail -D & using hashcodes instead?

1996-05-16 16:51:24
"Robert" == Robert  <dummy(_at_)c2(_dot_)org> writes:

    Robert> Has anyone considered the thought of using a hashing
    Robert> function on the body of a message instead of merely using
    Robert> the Message-ID field for "formail - -D"?  The problem I'm
    Robert> having is that I often get the exact same message, but
    Robert> sent via two separate mailings (i.e., they don't have the
    Robert> same Message-ID).

    Robert> If no one has done such a thing, has anyone written a
    Robert> recipe which might do the equivalent, i.e., filter out
    Robert> messages which are exactly the same in the body?

I don't know if it's ever been done for private use with email, but
the spam-hunters were doing this kind of thing on Usenet.  I don't
know the current state of the art but if you go back a couple of years
in the archives and look at alt.current-events.net-abuse and its
successor news-group (news.admin.net-abuse ? it's been a while and I
got bored with it so I forget) you should find lots of discussion on
algorithms.

It's nowhere near as easy as doing a hash on the body, though;
typically people will add different quoting header strings, their own
sigs, etc, etc, and you will in general have to allow for that.  YMMV, 
though, depending on exactly the circumstances.

The only situation I can think of where you wouldn't have to worry
about sigs, quoting, and other minor variations is when somebody send
to you and to a mailing list/email-gated newsgroup, and then the
message-ID should be the same.

-- 
                           Stephen John Turnbull
University of Tsukuba                                        Yaseppochi-Gumi
Institute of Policy and Planning Sciences  http://turnbull.sk.tsukuba.ac.jp/
Tennodai 1-1-1, Tsukuba, 305 JAPAN                 
turnbull(_at_)sk(_dot_)tsukuba(_dot_)ac(_dot_)jp