Dealing with spam/UBE (summary)

I'm sure spam filtering is not a new topic to this group, but the method
I'm currently using might be.  I'll breifly state the idea below along
with some issues I have yet to solve.  Your comments would be appreciated.


Thanks all for the feedback thus far.  It helps to know that this has
been tried before (I figured it had), and to what degree of success.
It sounds like this is definitely a sensitive issue.  I have definitely
received my share of thrashing so far..   :) 

I'll address some of the comments in this summary, trying to keep them
more relevant to procmail and less philosophical.  I assume spam filtering
is interesting to most of you, but if some of you find the perpetuation of
this thread inappropriate for this list, let me know and I'll reply directly
to the senders instead.

I was hoping this discussion might get more into the mechanics of how to do
some of these things with specific recipes, but as long as the rationale
itself is interesting to the readers we can pursue that first...


Basically I agree that implementing this kind of filter may reduce the
amount of "desired" mail I receive, but I think I'm willing to live with
that in order to keep out the spam.  I use my mailbox more for a tool than
for a service or meeting place.  That doesn't mean I won't do what I can to
make it easy (or transparent) for people to reply -- I think there are some
effective ways to do that too.


Inconvenience to sender -

Several of you commented on the issue of people not bothering to resend
the message (i.e, register), especially when they are sending the message
directly to me in response to a Usenet posting or the like.

First off, I hope to make the registration as easy as hitting a reply
button (with include).  If that is the case, then it would be almost the
same amount of effort for someone to reply as to delete.  If they took the
time to compose a response in the first place, I figure they won't want
to just toss it away when they can push it through with almost the same
effort (most people are proud of what they write and want it to be read).

This is the area where I think the recipes get interesting -- how to
make this work so that a simple reply is sufficient (even without include,
assuming we cache the original message locally for some period of time)
This could go a lot of different ways.  I'd welcome some additional input
along these lines...

Secondly, at least in my case, the vast majority of mail I receive from
"new" sender addresses comes unsolicited.  Only rarely are they in response
to some question I posed, and if so, it's because I sent the question to
a specific person or domain in which case I can preregister them so that
they don't need to do it themselves. 

I stopped posting to newsgroups a while ago when doing so inevitably got
me on lots of bulk email lists (from newsgroup address harvesting).  If
you often post to newsgroups and look for direct email responses, then this
kind of filtering scheme may not be for you.  It works best when the majority
of your mail from "new" senders is unsolicited.


Inconvenience for the mailbox owner -

The idea here is that it constitues _no_ effort for the mailbox owner,
since all the registration is done by the new sender.  If you're really
paranoid about them not being willing to register, then yes you can
preregister them, but even that can be automated so you don't have to
spend any effort doing it.

I'd imagine all the active filtering-out of specific spam sites or content
phrases, and keeping that current, also constitutes an inconvenience for
the mailbox owner.  I guess it just depends on which is more inconvenient
for you. 


Bandwidth consumed by autoreplies -

I don't think it's fair to compare the bandwidth used by this kind of
autoreply to the bandwidth used by the spam in the first place.  For any
spam event, this is just one small reply compared to the multi-thousands
of copies of the spam that were sent out.  And as for autoreplies to non-
spam messages -- it is only one extra message compared to the entire set
of interchanges between you and the sender from that point into the future.
I really don't consider the bandwidth consumption of the filter replies
a significant issue.


Mailing lists:

This is the challenge I'm most interested in addressing now.  In case you're
just tuning in, it has to do with how to discriminate between a mail received
through a mailing list and those which come in independently.

This whole idea of bouncing mail from unfamiliar addresses falls apart
quickly if you apply that to mail you receive from a mailing list.  I
basically won't use this kind of filter until I get that problem solved.
I'm not expecting list participants to have to register with me -- that
would be self-centered and presumptuous, and fundamentally wrong.

Era Eriksson (era(_at_)iki(_dot_)fi) mentioned something about using the 
FROM_MAILER
procmail qualifier, but I guess I'm not sure how that would consistently
trigger on a mailing-list mail and not on others (Era, perhaps you can
elaborate).

Any other comments regarding the mailing list issue?  How about some
discussion on common header formats or other match techniques to
identify list mail?

If you simply can't, in good conscience, contribute tips on how to
implement this scheme, then so be it.  However I'm sure some of you enjoy
working problems for the problem's sake...

Paul