Regarding HTML and MIME obfuscation tricks:
I still don't agree that you are proposing things that aren't easily
evaded by determined spammers (explained below).
I'll deal with those points to them as you raise them.
When asked about mislabeled attachment types, you suggest (unless I
misunderstand) that you can simply ignore the problem...
I'm not IGNORING anything. But that doesn't mean that you necessarily have to
believe everything that E-mail tells you, either. :-)
...and scan the body directly for the kind of regular expressions associated
with HTML tags and remove them.
Not quite.
First of all, I am *not* using "regular expressions" (a basically braindead
excuse for pattern matching, which dates back to the days of early, primitive
Unix systems and mechanical teletype terminals). SNOBOL and SPITBOL
(programming languages; SPITBOL is to SNOBOL sort of what Turbo Pascal is to
ordinary Pascal) allow for MUCH more sophisticated pattern matching than is
possible using reg-ex type patterns.
Besides destroying (and in the process subtly breaking) the message
contents, which has serious user privacy issues,
Well, yes and no.
Agreed that if the mail is PGP-signed or something, then changing the mail
contents will (of course) show up as a "changed" message body. Frankly, that
is
a price that *I* am very willing to pay... a lot of the stuff that I get in a
lot of my incoming E-mails is simply repetitive and annoying, and I'd rather
simply have those kinds of things just disappear from the mail that arrives
here. Obviously, a different recipient might make different choices than I
would, and that's part of why my system is designed for the RECIPIENT to be
able
to control, in a finegrained way and based on who the sender was, what it does
and doesn't do.
...it won't stop all HTML from bypassing the filter.
Maybe, although:
1) it will prevent MANY kinds of (recognizable) HTML from being passed
(including, hopefully, kinds that are likely to be dangerous or even just
confusing);
2) it can transform the remainder so that ANY OTHER program is unlikely to
recognize it as HTML and thus act on it (for example, by changing pointy
brackets to curly brackets).
For example, message parts can be encoded in several formats
(UUENCODE, Base64, Quoted-Printable etc) with arbitrary levels of
nesting. (e.g. a message/rfc822 containing a message/rfc822 containing
a message/rfc822 containing a message/rfc822 containing a
message/rfc822 containing..., with each layer encoded differently. And
the very first layer might be labeled a GIF file.)
Of course, and there's all manner of ways to encode information (including
steganographic encryption of information, such as for example sending a
pornographic JPEG photo also containing on the nightstand table a clock where
the hands point at 11:17:32 which time maybe has some secret meaning to the
intended recipient). You're NEVER going to be able to prevent the transfer of
all kinds of information. No point in even trying.
Fortunately, one doesn't HAVE to get that paranoid to essentially solve the
problem, since MUAs aren't all that aggressive and suicidal in ferreting out
dangerous HTML to interpret, either. :-)
In my current experimental incoming mail filter, I use a recursive subroutine
to
deal with nested encodings and message parts. I suppose I could even
(trivially, in fact) limit that routine to only support a limited nesting
depth,
but I suspect that most mail clients would crash on such mails long before my
mail filter would. :-)
Whatever regular expression for an HTML tag you come up with, it can
easily be made unrecognizable.
Sure, but it can also in the process be made unrecognizable to MUAs, too.
...Even the interpretation of HTML tags
can be redefined on-the-fly if it comes to that.
Probably, but again:
1) the USE of such types of stuff is prima facie evidence of an E-mail
having
something to hide;
2) such tricks are of little value if they confuse or break MUAs too;
3) translating pointy brackets to curly brackets (or square brackets or
something else) will also effectively "neuter" such HTML, such that MUAs won't
try to process it;
4) it's relatively easy to (again, by default) simply say "NO HTML, period"
and divert offending mail.
But say you keep up
to date with tricks designed to make a complex payload look innocuous
to simple minded filters, then you are on the losing side of such an
arms race, because a spammer need only change their email, while you
need to patch your software with new regular expressions and redeploy
it to all the customers every time.
Well, "patch" isn't really necessary. :-) It's rather easy to add new stuff
to
SNOBOL/SPITBOL programs, including at run time.
But again, that's why one doesn't just look for FIXED limited number of
specific
things.
If one simply bans (default case, for unknown senders) *all* attachments and
*all* HTML, then it's pretty hard to argue that they'll figure out some new
kind
of HTML (but if and when they do, then one MIGHT have to tweak the filter a
little bit). If it doesn't look enough like HTML to be recognized by the MUA,
then it clearly doesn't have to be recognized by the filter, either.
Note also that it is straightforward for spammers to deduce the checks
made if they have access to your software (as they invariably will if
it becomes widely deployed), so there is little point in not
discussing specific parsing techniques publicly.
Fair enough, although it's pretty extraneous to discuss them publicly at this
time. As I've said, the current implementation is "experimental" and like all
such software, a work in progress.... which I modify and improve from time to
time as that seems necessary.
...It only makes discussion imprecise and harder to see any flaws.
The important thing is NOT whether there are "flaws" at the lowest level (and
undoubtedly there are, since all nontrivial programs contain bugs or at least
opportunities for improvement). At this point we ought to be talking concepts
and approaches, rather than getting bogged down in pointless minutae and detail
which in any case is going to be implementation-dependent.
Some direct points:
You argue that perhaps the most important overall function...
I don't know that I would characterize it that way, although I *do* believe
that
it's sorta silly to try to address the spam problem while ignoring the kinds of
ridiculously suspect worm/virus stuff that clueless users naively click on.
...is to block the spread of viruses, worms and zombies, as these are the
current enabling technology.
Again, let's not say "the" (which implies one).
If so, you should address that problem directly, as it has much wider scope
than the "attachment" problem.
I think it makes sense to attack the attachment problem DIRECTLY, and HEAD ON,
since it is important NOT ONLY JUST to worms/viruses, but ALSO for spam evasion
of content filters (e.g. text-as-image or even just content-as-image).
Blocking attachments, if widesread, will only achieve that the payload
is moved from the email body to an external server.
That's fine.
Users are then tricked to open an external connection...
OK, but at least it's not going to be something that they just click on (again,
by denying HTML, and hopefully by implementing suitably dire-sounding warnings
when they try to follow any other links to external executables, whether EXE
files or SCR files or DLL files or ActiveX or whatever).
Hey, we DO agree that "social engineering" can result in people doing pretty
stupid things (like giving their secret passwords because someone calls on the
phone and asks for them, etc etc) but at least we can offer REASONABLE
safeguards and "are you SURE?" type things to at least make them have a second
thought before proceeding with such things.
A person who is DETERMINED to sink their own ship, of course, CAN still do
that,
and at some point one simply has to cut the rope and let them go.
...which downloads the malware in any of a wide variety of ways, and still
sends spam from then on.
Right. It's important to at least make that less "encouraged". That's one
good
reason for also (by default, from unknown senders) getting rid of HTML, which
tends to encourage (and conceal/misrepresent) external links.
Meanwhile, in the process you destroy the user's reasonable
expectation that their email is delivered as-is, unless they are in
some first class relationship with you.
If you don't know me (and even if you DO) you do NOT have any right to expect
that I will even receive or choose to read your mail AT ALL, let alone without
my modifying it beforehand to suit MY tastes. You lose your right to its
absolute integrity as soon as you seal it up and send it to someone else.
Perhaps the key to your point, though, is the word "delivered". And I suppose
that your point is okay, since it _is_ "delivered" (to my incoming mail
processing filter!), and that filter (AS I HAVE INSTRUCTED IT) then chooses to
modify the incoming mail according to rules *I* have established to help make
it
acceptable to me, before I need to look at it. Perhaps this is another good
reason to implement the filter at the recipient end, rather than somewhere
enroute.
Another issue is the use of your system in conjunction with a content
filter. If you remove/modify the mail content before passing it to a
content filter which is expected to handle the hard cases, you may be
shooting yourself in the foot. Modern content filters often have many
rules which are optimized to work together, but are not necessarily
optimized to work on mangled email.
Perhaps so, and what may end up happening is that content filters will be
simplified and re-engineered to make them faster and more tailored to use
within
a framework such as I propose. (Although some of those "hard cases" might
still
get through, from "somewhat-trusted" senders). Current content filters usually
presume that they are getting E-mail "raw" and therefore have to handle cases
that might be filtered out already by the time mail would get to them through
my
filter.
A few points about "Bayesian" systems:
To my knowledge, no successful attack has been performed on such
systems yet.
Depends on what you call an "attack", but certainly an awful lot of spam
contains bogus (random or unrelated) stuff that's designed to confuse or evade
such types of filters.
There is a lot of garbage in mail to try to pass through
the statistical filtering, but just like you look for nonsense tokens
as an indicator of spam on a case by case basis, such nonsense tokens
if present easily tip the balance toward spam in a statistical filter,
automatically.
Perhaps, and we agree that a good program can detect certain types of such
stuff. but at SOME point the spam E-mail in question is going to look EXACTLY
like a regular E-mail that you want to get, except for the spam content (which
might be JUST a URL or a phone number or who knows what?)
In some ways, these systems are a generalization of where you are headed.
For example, where you have code such as "if rule X is triggered or rule Y
is triggered" (with rules X and Y being statements about email structure or
presence of HTML etc), a Bayesian system will put weights on rule X and rule
Y, combining the weights to obtain a belief about the message. But that is
for another discussion.
Fine. In any case, it is POSSIBLE to create spam E-mail that looks just like
legitimate E-mail, at least within statistical uncertainties. There are limits
at what can be achieved going down that path (but that doesn't necessarily mean
that it's not worth going there, if there's useful progress there nevertheless).
Gordon Peterson http://personal.terabites.com/
1977-2002 Twenty-fifth anniversary year of Local Area Networking!
Support free and fair US elections! http://stickers.defend-democracy.org
12/19/98: Partisan Republicans scornfully ignore the voters they "represent".
12/09/00: the date the Republican Party took down democracy in America.
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg