How TO Filter Spam

On Thu, 19 Feb 2004, Iljitsch van Beijnum wrote:

I actually think that the spamassasin/procmail combination above is
nearly ideal on the MUA side,


It is not, because:

1. Bandwidth is used up by spam (which is fortunately usually not that 
big) and worms (which tend to be much bigger)


Close to a MB/day (on average) for me personally -- html is not compact.
Bouncing these requires (at the MTA or beyond) requires examining them
and applying tests to the entire message, generally, so bouncing at the
MTA saves no wasted bandwidth -- it close to doubles it.  

More than doubles it if the bounce generates further autogenerated mail
as it caroms off of bogus return addresses.  Note that server/netowrk
load is likely dominated by latency, not bandwidth -- most email
messages are at order of a single packet or two of data with ethernet
MTUs, but negotiating a transaction requires several rounds of small
packets.

Not enormous, agreed (although that is partly a matter of the size of
your organization and number of users and their internet "profile" to
spammers) but a significant fraction of all mail and not trivial.

2. A lot of processing time is used on your system(s)


This is the same for MTA-side and MUA-side processing as well.  In fact
it might be the same core tool used in the two cases.  Processing power
is linked to how sophisticated a filter you apply, and a stupid filter
is a cure FAR worse than the disease.  Either way you have to receive
the entire message at least to memory and apply the filter.

The processing power required, BTW, while non-trivial is easily within
the reach of modern CPUs for up to perhaps a thousand users per server
(I don't really know how much beyond -- we're not close to a boundary
here with hundreds of users).  A fully userspace solution permits it to
even be distributed to the user's processors and offload it altogether
from the mail server.

The one thing I can think of that an MTA-side bounce without any sort of
spooling or significant logging of the rejected messages saves relative
to user side sorting is disk spool, and this is really at the option of
the user, who can spool rejects into /dev/null.

3. Either:
3a. Legitimate senders who are tagged as spam are blackholed and are 
unaware that you don't read their message
3b. You must manually sort through all messages that are flagged as spam


Instead you have legitimate readers who may be unaware that a message to
them was bounced or blacklisted, you have one size fits all message
sorting, you have the aforementioned doubling of bandwidth consumption,
you have the possibility of your site being used as a "reflector" in a
DDOS attack.  And you (the user) CAN'T sort through all messages that
are flagged as spam because they aren't spooled, they are rejected!

In MUA-side sorting with no bounce you control the sorting process more
or less completely and can set rejection thresholds high enough that
your false positive rate is (in YOUR judgement, not mine, the IETF's, or
your local system's administrator) negligible and then neglect them.  At
5, SA's false positive rate is maybe one message in 100 days, from my
spot checks of it, and that rate actually decreases in time as they
refine the tests.  Its false negative rate is high -- a few percent of
the spam it sees makes it through -- but so does more or less all of the
spam-like legitimate mail I get including mail from vendors and vendor
quotes.  The spam that does make it through is generally "stealth spam"
(the kind with relatively short messages with lots of random words
intended to confuse word count filters, few or no graphics, and no
embedded html) and is easily rejected.  A weekly catalog newsletter from
e.g. musician's friend and a local bookstore make it through (I'm
subscribed on their lists but have NOT whitelisted their sites) and act
as coal miner birds for other desired spam-like traffic.

Those of my friends who use a lower spam threshold (one that lets almost
NO spam through to their primary spool at the expense of a larger false
positive rate) do manually sort their presorted spam folder.  However,
BECAUSE the folder is presorted, all they functionally have to do is
scan the subject/from lines and use their "internal" whitelists to pick
out the false positives, click a button or two to delete the rest en
masse, and move on.  As in it takes them maybe a minute or two to
process hundreds of rejects with very high reliability and without
having to look at message contents at all in almost all cases.  If you
like, they prefer a two state sort with human judgement used before
final rejection but using a computer to do 80% of the work (winnowing
all the spam out of their mainline mail spool where they DO read each
message and think about it one at a time.  I'm talking systems people,
mostly, who have a very low false positive threshold and who DON'T want
to explain to their user base (which might include their employer, for
example) why a message sent to them is flagged and bounced as SPAM.

With MUA-side processing, you can even do a three way sort -- sort all
spam-level 4-5 stuff into a "maybe" folder, and all 5+ stuff into a spam
folder or /dev/null.  At 5+ your false positive rate would be one to
three a year -- far lower than the human false positive or "oops"
factor, especially when dealing with a junk-filled mailbox.  Almost no
spam would make it into your mailbox.  The 4-5 "borderline" messages
would constitute maybe 10% of all spam, would have maybe 2% false
positives mixed in (mostly things like musician's friend catalogs that I
often don't read anyway) and would take a very short period every day to
terminally sort and reject.

Note well the fineness of control.  I'm NOT totally against MTA-side
processing but would want an MTA-side agent to basically be "spam
assassin" and permit spooling of an intermediate class of "maybe" spam
according to its necessarily fuzzy sorting logic.  The bounce issue is
endlessly debateable.  

At the moment, bounces of viruses are a pat peeve of mine -- they ARE a
form of abuse, this should NEVER be done, and the IETF would do
everybody a favor if they articulated this, and the reasons for it very
clearly, combined them with a statistical study that indicates that 0%
of the bounce messages actually reach the infected party and no more
than 1 or 2% reach somebody that might be on the same network as the
infected party and capable of passing the word to them, and took it to
the worst AV vendors and get them to deprecate the use of bounce
features in their existing products and remove them altogether in future
ones because they do no good, do a fair bit of harm in the steady state,
and have the potential to do a great deal of harm all at once if all the
AV products suddenly become reflector points in a timed, virus-driven
DDOS attack.  Or we can just tool along and wait for this to happen and
I can continue to live with the ten or so messages a day, NONE of them
flagged as virus or spam, that are "virus bounces".

The big questions are then: what fraction of SPAM has forged headers and
is a form of "virus" in its own right? and could the MTA filter be made
smart enough NOT to generate a bounce unless the from/return addresses
matched the registered domain of the originating host. I actually think
that this sort of thing would be possible -- one way that I identify
viruses now WITHOUT the use of a virus scanner loaded with signatures is
to look at the headers and look for lines like:

X-Authentication-Warning: moorcock.acpub.duke.edu: cyrus set sender to
  bbooth(_at_)earthlink(_dot_)net using -f

or a clear indication that the message has forged headers.  spamassassin
does the same thing:

        * 2.6 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook

Here is a message for which there is no point in generating a bounce,
for example:

From bobby20(_at_)earthlink(_dot_)net  Thu Feb 19 07:42:40 2004

Return-Path: <bobby20(_at_)earthlink(_dot_)net>
Delivered-To: rgb(_at_)phy(_dot_)duke(_dot_)edu
Received: from ipserv.phy.duke.edu (ipserv.phy.duke.edu [152.3.182.5])
        by mail.phy.duke.edu (Postfix) with ESMTP id CD802A77CF
        for <rgb(_at_)phy(_dot_)duke(_dot_)edu>; Thu, 19 Feb 2004 07:42:40 
-0500 (EST)
Received: from 152.3.182.5 (unknown [62.150.67.242])
        by ipserv.phy.duke.edu (Postfix) with SMTP id 48A443AFFB
        for <rgb(_at_)phy(_dot_)duke(_dot_)edu>; Thu, 19 Feb 2004 07:42:38 
-0500 (EST)
Received: from 188.106.194.65 by web365.mail.yahoo.com; Wed, 18 Feb 2004 
12:33:04 -0300

The mail originates from an anonymous client on an unregistered network
and is sent to one of the spam-oozing pustules on the internet where
anybody can get an email account in seconds and abuse it for minutes,
and then throw it away.  It has a forged from and return path (ones that
clearly bear no resemblance to the originating more or less open yahoo
proxy).  It is also OTHERWISE identified as spam -- this particular
message has a spam level of 24.6 (5 required for ~0.01% false positive).

It should be (and was, until now:-) rejected unread and without tainting
the human mind.  It should NOT generate a bounce message, as a bounce is
utterly pointless and will actually create a "reverberation" as
bobby20(_at_)earthlink(_dot_)net is doubtless not a valid address any more (it 
is
in multiple blacklists) and the earthlink.net MTA will undoubtedly
generate a second bounce of your bounce or harrass the hapless user
bobby20 whose fraternity brothers thought that they'd prank him as a
nerd.

Going through my spam reject spool (and this is a REAL reject spool,
filled with current messages like the one above from this very morning)
I have a very hard time finding a single message for which a bounce is
appropriate.  "Real" vendor mail (even mass mail) has a from address and
delivery path that correspond and other things of that ilk such that
sure, a bounce would work but for a real vendor mailing, even a mass
mailing, a bounce isn't really necessary.  They will generally have the
required-by-law unsubscribe at the bottom, and will generally honor it
if you unsubscribe.  They MAY remarket your email address if you
unsubscribe.  With SA you are free to /dev/null their message and let
them continue to waste their resources talking to the hand or you can
actively unsubscribe.  A bounce message might or might not serve to get
you unsubscribed.  Finally, I shudder when I think of bounces of any
sort that make it through lists.  They drive me nuts on the beowulf list
although they frequently ARE caught -- some come "from" a whitelisted
user.

Overall, it boils down to a question of degree of specificity, utility,
and statistics.  Numbers, I truly do love them.  Looking at my own
admittedly anecdotal numbers (although I'm pleasantly exposed to a
horrifically wide range of spam as I seem to be in a few thousand
address books of complete strangers and have a middling large web
presence, so my anecdotal numbers are actually probably not a terrible
sample), a "blind" MTA bounce of spam is more or less totally useless as
not 1 bounce message in 100 will actually go back to the originator.
1% efficiency is a joke.

SO, if y'all really want to push MTA-side bounces:

  a) Use one of the very good MUA-side sorters such as spamassassin for
a few months and accumulate a nice, fat spool full of presumptive spam.
Do this with a few hundred volunteers, actually to get better
statistics.

  b) Sort out the message headers into malformed and not in some way
that leads you to believe that there is nobody home at the return
addresses of the malformed.  Go ahead, test the return addresses to be
sure!  Just do it from YOUR account and not mine...:-).  

  c) Email the well-formed header return addresses as well.  For yuks,
do so from a new, pristine email address (one that is in NOBODY'S lists)
and measure the time required for the new address to appear in new spam.
For even more yuks, generate an email address for each reply and get
per-return statistics on same.

  d) Determine the statistics from the data.  What is the probability
that a bounce will reach the human or even the organization that
originated the spam?  What is the probability of nucleating ten new
messages from new spammers per bounce (even bounce) reply or
unsubscribe?

THEN you can do two things:  Propose an MTA bounce with a proper
foundation, and propose that foundation.  Presumably it will be one that
does NOT bounce to pretty much the entire class of spam with malformed
headers and may have other rules that you discover by sorting through
all the patterns -- spamassassin gives you a very DETAILED breakdown of
the rules and results used to tag a message spam, so you can actually
look for very subtle multivariate correlates if you want to do a
thorough job.

I personally am a great believer in the central limit theorem, and in
spite of possible biases in my limited sample I suspect that it is
actually very likely highly representative of the current profile of
spam today.  On the basis of my own sample, it is not WORTH it to winnow
out the one message in 10 (to be generous) or more likely one message in
100 (to be realistic) that might, and I say might, make it back to an
originating human.  It definitely isn't worth my own time to try "hand
bouncing" lots of messages to validate my anecdotal impressions.

So there's the gauntlet.  You propose that bounces are important.  Are
they important enough to justify bouncing 9 messages out of 10 to false
return addresses?  How about 99 messages out of 100?  You propose that
bounces will be effective.  Well, starting with a 1% to 10% MAXIMUM
success rate off the top you can't be VERY effective...:-) Perhaps
you'll be effective in warning the even smaller percentage (one that
depends on rejection threshold, to be sure) of false positives that
their message didn't get through.  Here is where one has to look hard at
the cost benefit equation.  Given the above first-approximation
statistics, IS it worth the very substantial costs in misdirected and
useless bounces to catch that small and controllable percentage of false
positives?  If you disagree with my statistics, come up with better
ones.  If you think that the answer is a matter of principle and that we
must leave No Legitimate Mail Behind (however tiny a fraction it is),
well, I would disagree, and we'd have to try to work out a consensus of
some sort.

At the moment, I'm open minded.  Come up with an actual bounce algorithm
(based on spamassasin ratings unless you can come up with something
better) that bounces no MORE than one message in 100 to a non-existent
or false address and that leaves no room for a DDOS attack to propagate.
You can publish the same tests to the AV crowd so that they can fix
their bounce programs.  That would be an important first step in
convincing me that MTA-side bounces are feasible/desireable, especially
if they are driven by a truly intelligent agent that leaves a user with
PERSONALIZED control over their filtering controls.

   rgb

-- 
Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     
email:rgb(_at_)phy(_dot_)duke(_dot_)edu