Re: [Asrg] Re: "worm spam" and SPF

On Dec 09 2004, gep2(_at_)terabites(_dot_)com wrote:


I'm not IGNORING anything.  But that doesn't mean that you necessarily have 
to 
believe everything that E-mail tells you, either.  :-)


Do you fully parse the MIME structure to extract HTML or not?  Do you
decode + scan each attachment, identify its type, further decode its
contents according to identitied type, and extract things
 such as embedded HTML? Unless you respond yes to these questions, you
are ignoring the actual structure, and instead making assumptions
about certains string patterns. It doesn't matter if you use SNOBOL or
regular expressions, you make decisions without the full context,
which is risky in terms of content alterations.

I accept your point that you are willing to pay that price, but since
there are effective competing solutions available without this flaw (I
consider alteration a flaw), users already have alternative
options. May the best offering win, as they say.

[snip steganography]
You're NEVER going to be able to prevent the transfer of 
all kinds of information.  No point in even trying.


That's what spam filtering is. A competition between those who want to
pass along information and those who want to censor it. Spammers have
shown that they are willing to think laterally to get their message
across, and don't shy away from building complex edge cases (as well
as simple ones if that works). Dismissing such things now will only
mean more work for you later, imho.

Whatever regular expression for an HTML tag you come up with, it can

easily be made unrecognizable. 

Sure, but it can also in the process be made unrecognizable to MUAs, too.


As a filter, you take on the responsibility to censor for an unknown
class of MUAs with varying capabilities, unless you're building a
plugin for a single specific MUA. The latter problem is much easier,
but much more narrowly applicable, and certainly doesn't deserve to be
considered a general or universal solution to the spam problem.

  1)  the USE of such types of stuff is prima facie evidence of an E-mail 
having 
something to hide;


If your (or any) filter becomes widely used, email will be crafted to fool
it if it has weaknesses. Being fooled literally means that filter won't 
recognize
such evidence, even though it might be obvious to other filters. So your
point 1) only applies as long as your filter is unpopular enough to not impact
spam distribution.


  2)  such tricks are of little value if they confuse or break MUAs too;


Different MUAs break in different ways. There is no set of tricks which are
obviously going to break all MUAs and therefore can be dismissed a priori.

  3)  translating pointy brackets to curly brackets (or square brackets or 
something else) will also effectively "neuter" such HTML, such that MUAs 
won't 
try to process it;

  4)  it's relatively easy to (again, by default) simply say "NO HTML, 
period" 
and divert offending mail.


By claiming to modify content, you're only making your system less
attractive to a section of the user population who: want HTML, don't
want HTML but feel it is crucial to accept HTML in the event it is
sent bu customers, expect messages to be untampered for technical or
ethical reasons. That's a lot of people who have an interest in
preventing your system from getting critical mass.

But say you keep up to date with tricks designed to make a complex

payload look innocuous to simple minded filters, then you are on the
losing side of such an arms race, because a spammer need only change
their email, while you need to patch your software with new regular
expressions and redeploy it to all the customers every time.

Well, "patch" isn't really necessary.  :-) It's rather easy to add
new stuff to SNOBOL/SPITBOL programs, including at run time.


Can you do it in an evening and have all your deployed systems fixed
up world wide by the next morning? The spammer needs an evening to
figure out a loophole and send millions of mails exploiting it the
next day.

Some commercial systems already have elaborate world wide networks
designed to propagate new email signatures in a matter of
seconds. That's the technology you're competing against today.

Fair enough, although it's pretty extraneous to discuss them
publicly at this time.  As I've said, the current implementation is
"experimental" and like all such software, a work in
progress.... which I modify and improve from time to time as that
seems necessary.


Unless you have an argument that breaks the spam arms race, the
distinction between "experimental" and "release" is blurry, I think.
You can spend months or years polishing a user interface only to 
have your solution be outdated.

...It only makes discussion imprecise and harder to see any flaws.


The important thing is NOT whether there are "flaws" at the lowest
level (and undoubtedly there are, since all nontrivial programs
contain bugs or at least opportunities for improvement).  At this
point we ought to be talking concepts and approaches, rather than
getting bogged down in pointless minutae and detail which in any
case is going to be implementation-dependent.


I'm not convinced that your high level description will handle the
rough seas of widespread deployment, but instead will get bogged down
in endless silly details.  Just my opinion ;-)

 OK, but at least it's not going to be something that they just
click on (again, by denying HTML, and hopefully by implementing
suitably dire-sounding warnings when they try to follow any other
links to external executables, whether EXE files or SCR files or DLL
files or ActiveX or whatever).


Denying HTML is useless if the MUA generates it on the fly whenever it
sees something that looks like an url. People are inundated with
warning dialogs which they just click mindlessly. As is, I don't see
that your filter can change or address these issues (why should it?
those are some of your stated high level means of protection).

 Perhaps so, and what may end up happening is that content filters
will be simplified and re-engineered to make them faster and more
tailored to use within a framework such as I propose.  (Although
some of those "hard cases" might still get through, from
"somewhat-trusted" senders).  Current content filters usually
presume that they are getting E-mail "raw" and therefore have to
handle cases that might be filtered out already by the time mail
would get to them through my filter.


Either that, or users will decide that your system is hampering their
content filtering and remove it. Nobody knows...

A few points about "Bayesian" systems:

To my knowledge, no successful attack has been performed on such

systems yet. 

Depends on what you call an "attack", but certainly an awful lot of
spam contains bogus (random or unrelated) stuff that's designed to
confuse or evade such types of filters.


Most of the random stuff is designed to evade a different kind of filter,
namely those filters which keep a database of email hashes. It sticks
out to statistical filters because it contains novel tokens.

I call an attack a repeatable procedure which can consistently bypass
the spam verdict with sufficiently high success rate to be valuable.
An attack requires a modification of the filtering algorithm.

 > There is a lot of garbage in mail to try to pass through the
statistical filtering, but just like you look for nonsense tokens as
an indicator of spam on a case by case basis, such nonsense tokens
if present easily tip the balance toward spam in a statistical
filter, automatically.

Perhaps, and we agree that a good program can detect certain types
of such stuff. but at SOME point the spam E-mail in question is
going to look EXACTLY like a regular E-mail that you want to get,
except for the spam content (which might be JUST a URL or a phone
number or who knows what?)


Not quite. The spam content is what looks statistically different
(unless the recipient is in the habit of discussing this kind of spam
content regularly), while the statistical commonalities between spam
and nonspam are automatically discounted if they appear in both types
of mail.  That and the fact that there is no need to do complex
parsing text is why it's difficult to evade. Any string added to a spam
message either

1) looks like some nonspam string, in which case it counts for
approximately nothing, or

2) doesn't look like some nonspam string, in which case it tips the
balance towards spam.

Fine.  In any case, it is POSSIBLE to create spam E-mail that looks
just like legitimate E-mail, at least within statistical
uncertainties.  There are limits at what can be achieved going down
that path (but that doesn't necessarily mean that it's not worth
going there, if there's useful progress there nevertheless).


There are always limits, e.g. a single statistical system shared by
many people is vulnerable to contradictory inputs. As far as
advantages are concerned, the statistical systems have the simplest
user interface yet devised, as the user only needs to label the
message without analysis, and update automatically without programmer
intervention.

-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg