ietf-asrg
[Top] [All Lists]

Re: [Asrg] Re: bounces, and anti-spam principles

2007-01-25 21:02:25
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

gep2(_at_)terabites(_dot_)com wrote:
I'm grouping together responses to several individual points on this
thread.

Before trying to reply to this whole thing, you might want to scroll
forward to Comment #6.  You seem to be assuming various things about how
filters have to work that simply aren't the case.

You _can_ handle false positives 100% without the recipient controlling
the rules or even necessarily know that a false positive has occured. In
effect, reducing your false positive level to zero, without any
recipient involvement.

Our overall system is designed towards _zero_ false positives.  In at
least one way, we're theoretically (and near effectively) there.  With
DNSBLs and other techniques.

In fact, in most ways we're far more aggressive than other filtering
schemes.  We can afford to have the filters misfire on legitimate email.

Simply because we've paid as much attention to finding out about
misfires and remediating them as we've done with filtering itself.

What happens after being blocked is just as much part of a well designed
filtering strategy as the filters are.

[comment #1]

In any case, I still contend that simplistic blocking by IP address 
or domain name is a very poor approach, and for a whole variety of 
reasons.

I will contend that there cannot be a content filter that can 
reliably separate spam from non spam. 

It doesn't NEED to be 100.000% accurate.

Nor does any other form of filtering then.

The bulk of mail most people receive comes from people they are familiar
with, and which fits certain patterns. A given sender (mailing list etc)
will typically have a signature file, for instance.  I know that Aunt
Matilda is NOT going to send me an E-mail containing a JavaScript
decryption routine, or an ActiveX enclosure.  She also is not going to
send me an executable attachment.  If stuff like that arrives here, it
is safe to presume it is NOT from her, no matter what the From: address
says (and even if it WAS sent from her computer).

This does not work in the large scale beyond a limited subset of users.
Not everyone has that small a set of correspondents to cope with, and
the "new correspondent" issue remains a big problem.

I guess that depends on what you call "bulk", and how you propose to
detect it.  Again, whatever rule you put into effect (on a global-type
basis) is going to be discovered by spammers and they will engineer
their sending patterns to avoid violating it.  That's why you need a
really narrow and twisty 'gauntlet' they must negotiate, with DIFFERENT
RULES for different recipients, where they don't know and basically can
not figure out what rules they would have to comply with to get a
message through to a particular person.

Having had almost 4 years of intimate operational experience with what
is probably the most effective single anti-spam filtering method there's
 been (one specific DNSBL), I can assure you that it is both more
reliable in terms of FPs than any content filter I've ever dealt with
(we've always run hybrid content+source+other techniques), I can also
assure you that its effectiveness is _not_ declining - in fact it's
getting better.

Is it perfect?  No.  Does adding other filtering techniques help?  Yes.
But to claim that it's useless/trivially defeatable/too many FPs/trivial
to do better turns out to simply not be the case.

The trick  is to stop accepting mail from that IP address only until
it has  cleaned up. 

Again, when you have a LOT of users (and possibly MANY servers) behind a
NAT router, denying mail from that IP address results in simply too much
collateral damage. More to the point, it's a very blunt instrument for
the job, and it's relatively simple to do very much better.

So far, there's no indication of the latter being true.  Show me a
methodology that (a) doesn't abuse innocent third parties and/or (b)
requires personal twiddling simply not achievable in the large scale
that's better than the CBL (or for that matter Zen), I want to hear
about it.

Once the spam is gone there is no need to block the  address unless it
has proven to be a repeat offender without an  effective process for
shutting the spammers down.

What about when the flow of spam is interleaved with all sorts of
good/important traffic as well?

You make a cost-benefit analysis and/or apply other techniques.  Even
content.

Why the insistance on choosing only one?  You don't have to.

Effective spam filtering is best done with a hybrid solution.  No one
technique is complete on its own.

There is a lot of spam which is obvious.  That includes messages which
contain links to known-spam-promoted Web sites (at least in the absence
of contradicting factors, say being from a list discussing spam senders!)

SURBLs are blacklists too, and can be equally as blunt as a source IP.
Or are you contending that the user should be explicitly entering them?
 With _individual_ spammers using 1000 or more domain names to advertise
the same thing, how effective do you think that can be?

It also includes, for example, messages which are identical to messages
that some number (dozens? hundreds?) of other recipients at the same ISP
have already reported as being "spam".

Have you done any work with checksumming/hashing spam, ala DCC or Razor?
Yes, they're useful.  But these days most high volume spammers randomize
content such that even highly developed de-hashbusting techniques don't
work very reliably.  Doesn't even work on graphical spam anymore.

One would think that ISPs could
locate and perhaps recategorize identical messages (again, perhaps
tempered by a specific recipient rule) which are still queued and have
not yet been delivered to their remaining customers.

Yes, it will yield some results, but the overall results are highly
disappointing.

But let me state again (and this is part of what made me respond,
starting this sub-thread) is that it is virtually NEVER a good idea to
send a bounce message after-SMTP-time, because you can't be sure where
to send it, and most likely you are just harassing another innocent
victim.

That's something I'll certainly echo.

Being able to "slam the phone down" on miscreant IP blocks at the
accept() or helo is much, much, less processing than going thru the
entire SMTP interaction and whatever it takes to pass processing off
to an end-user.

It's true that it costs less, but it's also true that it blocks a lot of
innocent and legitimate mail that might be
originating from the same IP address (NAT router?).

This generally doesn't turn out to be a significant issue in practise.
If you spend the time you _should_ be spending to research the DNSBLs
(say) you plan on using.

There could be
dozens, hundreds, or even thousands of innocent users affected.

There could be, if the DNSBL is built in such a way that it's
susceptable to that.  But it doesn't have to be.  And the good ones
aren't in any meaningful way.

So the zombie becomes unable to emit spam, but there's no incentive to
fix it so it's still available to the botmaster for use as a C&C
machine, web/DNS server, and DDoS participant.  I'd prefer that it get
uninfected.

Obviously, that is ideal, but the problem is that after (first!) SMTP
time, the (intermediary, or final) recipient doesn't really know who
they ought to notify...! Notifying the wrong person, or someone who has
no control over the situation, probably does more harm than good.

You don't need to notify out-of-band.  Presuming that recipient systems
use DNSBLs in an appropriate fashion (inline rejects with pointers to
more information), legitimate senders find out that it's blocked and
why.  That's their notification.  With decent DNSBLs, that's sufficient
to initiate resolution of the problem.  It sure is a lot easier than
tweaking Bayes or SpamAssassin (we specifically reject whitelisting
email addresses because of the forgery problem)

Also, there's a remark somewhere on the CBL web site that impressed me
with its simplicity - something to the effect that "we fully expect the
vast majority of infected/blocked users _never_ notice that they're
listed, because they're using their provider's smart-hosts as they
should".  If an ISP wants to proactively scan for infected IPs to see
about getting them fixed, they can do that too.

[comment #4]

I'm skipping this one because it takes too long to comment on ;-) other
than:

 Executive summary...

 - blocking email, because it meets some technical criteria, is easier
   on the technical side, but introduces legal problems

It may, perhaps depending on exactly what the technical criteria _is_
and the rationale for blocking it, but taking risks when the business
case/result justifies it, is what people do.

You can forestall a lot, for example, by simply saying "it's our policy
to reject email from IPs that appear to be dynamic".  That says nothing
about the spammyness of an individual sender, and if mistaken in
"appearance" simply needs to be fixed.  Or not.

 - blocking email, because the customer said so, may be harder
   technically, but avoids legal problems

The ISP could equally establish a infrastructure where the customer
explicitly delegates filtering decisions to the ISP.

And the protections in law for ISPs to be held harmless for mistakes in
good-faith filtering go a long way to shoot down any attempt even where
the customers haven't formally delegated.  It ain't easy for a sender to
prove bad-faith.  An ISP hasn't been sued in ages.

 - any complications on the anti-spam side are outweighed by equivalant
   complications on the spammers' end.  ISPs will have to enable end
   users to configure their own rules, and everybody's filters and
   whitelists will be slightly different.  Imagine how spammers will
   feel knowing that each of several million targets for a spam-run has
   a slightly different defense, that has to be overcome in order to
   deliver the email.

EXACTLY.  But also, knowing that all the classical ruses to avoid spam
classification (text as image, embedded links, attachments, scripting,
disguised HTML links, etc etc) are a priori denied them.... certainly
takes a major bite out of spammers.

It would if you could.  How can you tell that an image is text?
Blocking spam with embedded links or attachments would probably put us
out of business.  More likely get me fired.

And only allowing executable attachments, HTML, and "big" messages from
known/trusted senders basically eliminates E-mail as a vector for
virus/worm propagation,

"known/trusted" senders by what measure?  Explicit listing of them?
Well, I can tell you about lots of viral attachments from "known senders".

[comment #5]

That's certainly true, and one advantage of fine-grained recipient
blocking is that it doesn't require any great worldwide consensus, nor
any re-engineering of Internet infrastructure.

Nor do DNSBLs ;-)

What WOULD be helpful, though, would be a recognition by the IETF that:

 a) such fine-grained per-sender by-recipient blocking (and hopefully
augmented by subsequent content scanning) is an effective and desirable
approach to the problem, and

As I've been saying, that has yet to be established.

 b) in the general case, blocking of all non-whitelisted E-mails
containing HTML, scripting (probably covered under HTML... is it
possible to put in scripting without HTML?), or attachments is a "best
practice".  (It is probably a good idea to suggest including a maximum
message size, too, as a way of preventing "denial of service" attacks by
sending big E-mails to someone which would be expected to fill their
E-mail inbox to overflowing, blocking subsequent legitimate E-mails).

Obviously, you've not had to deal with the legitimate mail traffic of a
large corporation.  I mention those measures as comic relief at
meetings, because it always produces hysterical laughter.  It'd shut us
down.

This has been true in the past - consider the many DNSBLs and other
activities against spam. When we kept a list of spamming IP addresses
sending to our MTA, we found after 2 weeks that only 1% of the IPs had
send more than one message. Our subscription to Spamhaus kills about
65% of incoming messages. That is a victory for cooperation and it
makes us think that more cooperation might be better.

Again, the problem is the degree of collateral damage that IP-based
blocking produces.

You haven't demonstrated what that degree _is_.  By long exposure, I can
assure you that the degree is surprisingly low.  If you do your homework
as you're supposed to.

We're receiving 1-2 million emails per day.  80-90% of that is spam.  We
have less than 10 FPs per month against Spamhaus' sbl-xbl (which is
doing about 85% of our filtering).  We've arranged things so that the
sender finds out if they're blocked, and there's a well established
procedure by which they can notify us and we can override listings.

If an email is blocked, they contact us, and we forward their email and
fix the listing, is it really a FP?  No.

If every sender correctly interpreted the error message they got and
followed through, then there'd be zero FPs.

I wish our content filters even remotely approached being _that_ good.

[comment #6]

[how users configure their whitelist rules]

The problem being that out of the 60,000 seats here, perhaps less than
10 of them are able to competently configure a set of rules like what
you have.  

That's a software implementation issue, not an inherent problem in the
approach.  I envision a button to click on that simply says "allow
E-mails like this from the same sender in the future" and where the
software will open the keyway JUST enough to allow that type of message
if seen again from that sender.  How that recognition is accomplished,
whether by something crude like simple GREP-type scanning, or something
brain-damaged like RegEx pattern matching, or something still more
sophisticated like the pattern matching SNOBOL/SPITBOL offers, or even a
different sort of statistical ranking/rating approach like content
scanners use... will vary from one implementation to another.  The final
products will probably use a combination of techniques.

I'm sorry, this simply isn't a human interface issue.  No amount or
technique of per-sender whitelisting comes remotely close to the
accuracy of our production filters, entirely aside from the new
correspondent issue.

You could give our users common filtering software (the reader is
already pretty much standardized) with every filtering knob known to
man, and perhaps three of our users could approach the effectiveness of
the production systems.  I'm only including _me_ in that list because
it's me who built the production systems...  Over a million decision
items are being changed every day in our filters.

Our users simply don't know effective spam filtering techniques.  They
push the wrong button, twist the wrong knob, and they're blocking
something business critical.  Or trusting malicious content with forged
credentials.  Or simply trusting...

My favourite incident was the user who repeatedly insisted that he
needed to receive the "important information from the FBI" that was
sitting in his quarantine, and the quarantine forwarder refused to
forward to his mailbox.

Sorry, I said, but as much as you may want to see it, forwarding the
virus is a really bad idea ;-)

Many of them don't even have a clear notion of the concept of
"source IP" is, let alone being able to make reasonable choices of, say,
knowing why you'd want to block dynamic IPs or IPs in Korea.

Again, I consider IP-based blocking to be inherently flawed, to the
point where I consider it a dead-end.

It's a remarkably vigorous dead end ;-)

Furthermore, and with complete irony, I'll note that the only reason I
read this thread is that my very own, personally trained, UA bayesian
filtering flung it all in the junk folder! ;-)

:-)

Yeah, I admit that I usually at least cast a cursory eyeballing of the
Yahoo mail "spam" folder too, rather than just emptying it. 
Occasionally I -do- find a non-spam message there.  (Although that
happens seldom, as I almost never give that E-mail address to anybody...
It's almost useful as a "personal honeypot" to see what's being spammed
out, before going to my more usual E-mail accounts and possibly
wondering if that curious E-mail just MIGHT be legitimate).

"Almost useful" is the key.  When you have users whose spam load ranges
from one or 2 per month, to 4000+ per day, you can see that junk folders
have only limited usefulness, and not to everyone.  Nobody can find
legitimate email in a 4000 spams/day feed, now matter how the filters
are implemented.

Perhaps part of your problem is that you're not seeing the big picture
of how you can _use_ DNSBLs or any other filtering technique.

Your remarks seem to imply that DNSBLs necessitate no notification
anywhere, the email just disappears.  Or that filtering in general is
that way.

On the contrary, ours have never done that.  Indeed, without anybody
looking at quarantines, without anybody personally twiddling filters,
that 4000 spams/day user _does_ get the email that was accidentally
blocked.   Simply because we do inline rejects with instructions on what
to do, and problems get fixed fast and without harm.


We're achieving effectiveness rates in excess of 98% with our "one set
of rules" server based defences.  My personal account, which receives
400-600 emails/day, has 100 or more spams/day filtered out by the
central server solution.  I usually go a week or so between spams that
get past those central filters - I see _many_ more FPs with my bayesian
than I see spam getting through.

There will be FPs and spams get through, probably regardless of what
filtering technique you use. 

No "probably" about it.  I wrote exactly that almost 10 years ago ;-)

The question is how to deal with it.

The important thing is that the RECIPIENT
controls that, so they can decide the rule that determines what gets
blocked and what gets through.  That way they don't have to wonder what
SHOULD have been delivered to them and wasn't.

But what if you arrange things so that the recipient doesn't have to
control the rules, and still doesn't have to wonder?

Our false positives are handled usually without the recipient even
knowing that something got blocked.


My personally trained bayesian filtering has an absolutely abysmal track
record.  

Spammers have gotten good at throwing enough random junk into E-mails to
confuse Bayesian filters.

And sender whitelisting and...

On the spam aimed at the false positive handling address, which
by design has _no_ filtering, Bayesian has an effectiveness rate of
about 50%.  Yuck.  No amount of personal twiddling, custom rules,
explicit pattern matching in my UA is going to make much difference to
that.

Some E-mails are going to get through.

Yes.  But 50% is absurd.  If I presented that solution to the boss, I'd
be looking for another job.

And meanwhile, giving the recipient the ability to at least not see the
SAME kind of stuff over and over again, if they choose to use those
features, demonstrates the ISP's trying to give the user the tools to
reduce the frustration.

If you can _detect_ the "SAME kind of stuff" over and over again.  Even
the best content techniques aren't very good at that anymore.

Correctly chosen and utilized DNSBLs do a vastly better job.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRbl8+Z3FmCyJjHfhAQJnoQQAx3p41yIzTGB3gPlEIS1bRmfZ36yOcDBf
VCl5IeN6UKb9uiO1jtYOp1Zm3xz50QVtdj8XwdDHbG6qMA+eOzRouKvpGwUbu42M
MANtwZwqu72IU2PbqQ3V300VsNU4hMaiE2hoqUOdm4tWVQjsPp+dU+BAHuYMpIGE
g0/vaqER++8=
=cHX0
-----END PGP SIGNATURE-----

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg