ietf-asrg
[Top] [All Lists]

Re: [Asrg] Re: "worm spam" and SPF

2004-12-07 19:38:25
 What my current mail filter here does is to first strip
HTML-burdened alternative attachments, and then to further strip the
majority of HTML tags that it finds in "plain text" parts of the
message as well.

How does the scheme deal with mislabeled attachments? Spam messages
don't play by the rules. 

The current implementation I use here basically handles that by:

  1)  IF there is a plain text version and an alternative HTML version of the 
message body, the alternative HTML version is simply removed;  the plain text 
version is retained.

  2)  HTML tags (most of them, at least;  I retain those that might be useful 
in 
determining spam origin or otherwise tracking back to the perps) are stripped 
from the message body, for essentially ALL types of message body.  So even if 
the part type is mislabelled, most HTML types will still be removed from it.

...And some popular mail programs virtually
ignore attachment types altogether and use the filename extension, or
rely on defaults known for the target mail reading software.

Sure, and presumably those will still work that way once they've picked up 
their 
(filtered, processed) mail.  The way they handle it doesn't really impact the 
filter.

How does the scheme identify HTML in plain text? 

It looks for things that "resemble" HTML tags, given a list of common types and 
what those types look like.  It also uses a variety of rules to identify most 
cases of "bogus" HTML tags (that will end up being ignored when the mail is 
read) that are only there to break up keywords that otherwise would trigger 
content filters (and it removes those, too).

Although I don't think it makes a lot of sense to describe in detail what those 
rules are (and we must presume that spammers are monitoring our discussions 
here) it's fair to point out that SNOBOL/SPITBOL offers FAR more sophisticated 
and powerful pattern matching than the braindead reg-ex type pattern matching 
more typically found in Perl and similar.

Does it correctly recognize a discussion about HTML tags from a text using 
HTML markup?

Not at present, although (again) if you had a correspondent or a mailing list 
where that might be expected (and desired) you could easily enough whitelist 
those specific senders to send you that sort of message content.  It's probably 
fair to presume that if you're sophisticated and enlightened enough to be 
discussing such points, you're less likely than the average Joe (or Gertrude!) 
to be confused or tricked by them.

The only sure way to protect users against HTML attachments is to
prohibit them from using mail software which displays HTML,

There's no point in "prohibiting users" from doing most ANYTHING.
The whole point is that users should be able to do anything they CAN
do now, but encouraging them to do it selectively, for those cases
where they trust the senders.

But as I understood it, your scheme is supposed to block the receipt
of email containing HTML from unknown/untrusted senders originally,
and only allow such email through after the sender is trusted. 

Right.  Now, what one DOES with such "untrusted mail" is something the 
receiving 
user ought to be able to configure... whether they blindly T-can it, or 
quarantine it, or flag it, or whatever.

...If you let HTML through to begin with (assuming you can always identify 
it), it's much less effective?

Well, not necessarily. 

For example, one thing that one can do (and again, no reason why this can't be 
configurable depending on how the recipient wishes to deal with such things) is 
to simply replace the pointy-brackets in HTML tags with curly-brackets before 
finishing with it.  This allows the recipient, IF desired, to examine the 
particular tags that the filter chose to pass through (and nothing says it 
can't 
still simply remove other stuff).  And there could be a suitable 'goback' 
program provided (or a retrievable, unmodified copy archived somewhere) that 
could allow the recipient to go back to the unmodified original copy (or 
whatever) should they want to do that.

Anyhow, by letting {a href="gotcha.phishing.ru"}security.sunbank.com{/a} 
through 
I think you'll agree that the user is pretty well protected.  Meanwhile, things 
like font sizes and colors and boldface and the like, while arguably 
interesting 
if you like stuff like that, can usually be simply stripped from E-mail 
messages 
without harming their actual information value.  And likewise, "junk" HTML tags 
like {qewhjwhfqeasdf} (using pointy brackets, of course) can of course be 
deleted from incoming mails without affecting their content, too.

What you perhaps don't realize is that an attachment marked plain
text but containing HTML tags is often displayed as HTML by mail
reading software anyway. Some software reads plain text, looking for
anything that resembles a web address and generates a clickable URL
(thereby turning the plain text into HTML).

Fine, but at least in that case the URL will NOT be spoofed or
misrepresented, right?  :-)

The art of misrepresenting URLs to the public is called phishing, 

Well, that's one USE for it, but that's certainly not the ONLY use for it.  A 
more common use is simply to obscure the URL (using any of a whole variety of 
techniques) in an attempt to evade filters which try to block "disreputable" 
domains.  

...and is fairly well developed ;-)

Of course.  Anyhow, limiting the ability to HIDE such obfuscations makes it far 
more apparent that the URL is not trustworthy.

Clearly, but the point is to force the spammers into areas that are
VERY, VERY gray (and which for most users, simply don't exist at
all).

The fact that most users aren't in the gray area is irrelevant. If
(your, any) scheme lets through too many spams, it is as useless as if
spams weren't being identified in the first place.

Not so.

First, spam blocking is only ONE of the functions performed by my approach, 
although it's certainly one of the notable ones.  As (even MORE?) important is 
the fact that it shields users from the GREAT majority (and perhaps virtually 
all) viruses and worms, INCLUDING those which set up armies of zombie spambots.

Second, "too many" is a very squishy number.  If it blocks 90% of spams and 99% 
of viruses and worms, is it "useless" because some spam still gets through?  I 
don't think so.  Likewise, it doesn't particularly bother me if a few 
relatively 
benign spams get through now and again.  (At SOME point, spams look a lot like 
minimal messages from friends that I *might* want to still get).

Third, remember that my approach is intended to be used ALONG WITH a good 
antispam content filter (and, it's IMPORTANT to note that antispam content 
filters will be a LOT more effective on the type of mails that have already 
passed through a filter such as mine, since most of the tricks used to evade 
and 
defeat them have already been blocked).

However, the amount of spam passing through depends solely on the spam
filter and how clever the spammers are. 

Most of their "clever" depends on tricks based on attachments and HTML.  By 
slashing the utility of those, spammers are denied most of their favorite 
deceptions.

So discussions naturally gravitate toward spammer tricks.

Of course.  The way we're going to defeat them, though, is by changing the 
rules, and I think we can do that in a way that's minimally (or not at all) 
destructive to legitimate users while still creating very difficult gauntlet 
for 
spammers to try to navigate their way through.

 I'm not naive enough to presume that ANY one approach will work for
EVERYBODY or for ABSOLUTELY ALL possible (past/present/future)
software.  I don't think that any other approach I've seen proposed
will protect so many people, so automatically, with so little
interference with their existing methods and systems, and with so
many other advantageous effects.

I'm not suggesting you are. However, current state of the art systems
achieve 99+ percent success, 

...I guess that depends on how you define "success"...

...and for organizations with very large userbases, even that isn't enough.  

In any case, my proposed fine-mesh permissions system would eliminate a LARGE 
amount of the stuff that still is getting through.

...Unless your scheme can claim these sorts of numbers for nearly all people 
to begin with, then it isn't worth trying to convince people here. 

I don't agree.  I don't think we have to use a SINGLE tool and a SINGLE 
approach 
(and said as much above in fact).  

And in particular, virtually NONE of the other approaches we've been discussing 
here have the effects of substantially reducing viruses/worms (and those are 
how 
these zombie spambots, that in fact prove SO problematical for these other 
non-starting approaches that so many people seem so focused on, typically 
propagate).

Most people on this list are only interested in systems which can deal with 
vast numbers of messages with little extra work.

Most people here seem to be fascinated by (and fixated on) technologically 
complicated solutions that require a major transformation of the planet's 
E-mail 
systems.  While it's tempting to look at those, and there are admittedly 
advantages to trying to truncate spam as close to the entry point as possible, 
I 
personally think it's more important to squash it (at least) before the end 
user 
is greatly inconvenienced by it.  And, of course, at SOME point of success in 
preventing it from being seen by anybody, it will cease being profitable for 
spammers and they'll simply go off to pursue some other scam instead.

There's nothing in my proposal that prevents it from being used in dealing with 
"vast numbers of messages", especially if it's implemented at the recipient 
level (Outlook/Outlook Express/Eudora/Pegasus/whatever).  There's PLENTY of 
end-user computing resources, if used responsibly, to deal with that stuff.

[snip]

Moreover, you might not realize that SpamAssassin is bundled with
bulk mail sending software precisely so that spammers can design
their emails against SpamAssassin and tweak them until they pass the
tests performed by SA, and only then start spamming.

Of course.  Most of those deceptions [mis-]use HTML or attachments.
That's one big reason for encouraging users to limit the use of such
features only to trusted senders.

I don't seem to remember whether you explained how to initially deal
with (untrusted) people who (unwittingly, incapably) send exclusively
HTML burdened messages.

Probably, during a transition period a recipient will choose to receive (at 
least SOMEWHAT more) HTML than they maybe will allow in the long run.  Most 
people (clueless AOL and HOTMAIL types) probably don't send (even 
INTENTIONALLY, 
let alone by default) anything more complicated than fonts, sizes, colors, 
bold, 
underline, and the like.  Happily, those are pretty benign, for the most part.  
As they get more familiar with the system, recipients will have whitelisted 
their familiar senders, and might choose to clamp the default "allow" filter 
somewhat tighter than they initially did.  

Also, as an approach like mine becomes more widespread, groups like AOL and 
HOTMAIL that presently casually and sloppily turn on HTML-burdening with nary a 
thought might be inclined to be more responsible in that regard, realizing that 
HTML-burdened mail is less likely to be delivered in a more spam-defensive 
world.

Again, my approach really is a FINE-MESH filter, where it doesn't have to be 
(and indeed, SHOULD NOT be) as gross as "HTML" or "no HTML".  (Although 
honestly, I _personally_ would be inclined to block or at least quarantine HTML 
mail overall, if from a non-trusted sender).  But that should be a recipient 
option.

 Well, it WOULD be sufficient to block "large" (as defined by the
recipient) messages coming from unknown senders.

It is debatable whether large attachments are in fact an indicator of 
friendly (ie nonspam) messages. 

It doesn't really matter.  The FACT is that most folks have limited Inbox space 
and are NOT willing to devote unlimited amounts of it to people they don't 
know. 
 ISPs tend to have 5-10Mb Inboxes (although we're seeing some recent 
improvements in that area, fortunately) but still often limit incoming 
individual E-mail messages to some smaller size.  If someone checks their 
E-mail 
every day or two, they obviously don't want an E-mail to arrive 30 seconds 
after 
they look, which fills their inbox and causes all their other mail to bounce 
until they get around to checking again.

But there are NUMEROUS reasons to not want big messages from unknown folks.  
Limited Inbox size, more likely virus/worm content, and so forth.  It's just 
another (useful!) tool that can and should be made available in the war against 
E-mail nuisances and annoyances.

Spammers tend to send small messages, unless they're trying to be clever, 

Sure, and that's nice... at least they aren't then wasting huge(r) amounts of 
bandwidth.

...while worms are as you pointed out a bit larger. 

Usually a GOOD bit larger... generally 45K to 200K, by the time they're encoded 
for E-mail attachment.

But the truly large messages are sent by clueless people who will
write two lines in Word, include a BMP image, then attach the whole
document to an empty email message and send the lot...

Fine, and if I have a correspondent I want to hear from that tends to be 
belligerant about stuff like that, I can whitelist them and continue to let 
them 
do that.  Hopefully, before simply sighing and accepting that, I can ask them 
to 
please be more responsible in the future.  Meanwhile, we don't do ANYBODY on 
the 
Net any favors to bend over backwards to support and continue that kind of 
irresponsible practice.

There are two avenues for worms under your scheme. Worms can get
smaller in size, or worms can stay the same size and spread via
trusted senders, by looking for regular correspondents in address
books etc. Fifteen years ago viruses were less than 1K in size,
there is no reason beyond lazyness why worms need to be around 25K.

All this use of heuristics can be evaded by various strategies.  If
worms shrink to less than 25K, it might be that a recipient will
shift their "allowed maximum unsolicited E-mail size" to 15K, or
10K, or maybe even 5K or something.  THAT IS *THEIR* CHOICE, and
they aren't constrained to maintain the SAME choice.

None of these suggestions is preventative, only reactive. You're
suggesting that attacks on your proposed spam defense should be
best handled by an arms race between the spammer and the user.
The spammers still have the initiative.

Ultimately, if we do our job well, we'll force the spammers (and viruses/worms) 
OUT of the E-mail arena entirely (much the way that diskette boot sector 
viruses 
have virtually gone extinct);  I suspect they'll retreat to malicious Web 
sites. 
 But that is a different war, and we'll proceed to that one (undoubtedly) in 
due 
course.

But I disagree with your claim that my approach isn't "preventative".  For most 
users, who install the system with default rules and who aren't deceived into 
re-enabling executable attachment delivery, they will PREVENT the delivery of 
executable attachments, PERIOD, and that doesn't depend on constantly updating 
antivirus files (which ARE reactive and INVARIABLY lag the appearance of new 
viruses).

The devil is in the details, because the high level concepts in
which you describe the scheme do not map perfectly into the low
level concepts required for implementation.

There is no one SINGLE set of details required.  There is a wide
leeway in terms of how the various aspects get implemented, and in
fact this is an advantage...  it allows different companies to
differentiate their products, while at the same time creating
distinct products which probably will not share common potential
weaknesses that a spammer might exploit.

Right, various elements of what you propose (scanning for HTML,
blocking attachments) are widely implemented in many different
anti-spam products. 

Right, although I'm not aware of ANY of those which use a fine-grained 
permissions approach and which is based on a sender-by-sender basis.

I'm sure that fact will change, over time, since I'm convinced that it's a VERY 
sound approach and eventually other companies, I'm quite sure, will come to the 
same conclusion.

One could argue that companies already have several pieces of your puzzle and 
have mixed and matched them as they like.

And that's a good sign.  Hopefully they'll get the pieces together in the right 
way.  

One BAD sign I've seen is that Microsoft (sigh) seems determined to build their 
antispam defenses into Exchange rather than Outlook/Outlook Express, which is 
terribly clueless;  the spam problem (like the virus/worm problem) will NEVER 
be 
solved if we only work on it at the corporate/enterprise level.  It needs to 
work in the mass-market tools that clueless individual Net users can use and 
understand.

Some problems with whitelists: 

Once you've whitelisted all your friends and colleagues, you depend on
how good *their* spam defenses are. If they receive a worm, it can
travel to you freely. 

No, ABSOLUTELY NOT, since my proposed "whitelist" is NOT just a simple 
yes/no 
whitelist.

You've talked about examining a limited set of properties such as existence of
attachments and of HTML tags. 

That's good on defaults, but the goal is for a much finer set of permissions 
than that.  Executable/PIF/scripting attachments are clearly a LOT less safe 
than GIF or text attachments are.  Javascript or ActiveX HTML tags are clearly 
a 
lot less safe than boldface or italics markup tags are.  Again, I think the 
real 
goal is to whitelist correspondents based on what you EXPECT and TRUST them to 
send;  it's hardly as if there's NO choice other than "guest" versus "root".

Your whitelisted friend may have a trojan or worm which will send plain text 
advertisements to all people in his address book.

Sure, and if that "advertisement" makes it past the additional content filter, 
it might actually get delivered.  Again, I don't think it's possible to 
eliminate ALL illegitimate mail.  (I have one friend who tends to come home 
drunk on his butt and send almost completely illegible E-mails at three or four 
in the morning;  I don't expect or require that my E-mail filter block things 
like that... it's part of his personal charm and personality, I suppose.  But 
as 
long as I can reduce the really egregious volume of wasteful and intellectually 
insulting spam/worms/viruses to an occasional message or two, I personally 
don't 
have a big problem with that.

If some combination of tags and attachments can get through, it will
get through eventually.
 
Sure, and as long as the volume of that is negligible, I don't consider it a 
huge problem.  I'm going to receive, after all, some amount of LEGITIMATE mail 
which will be less interesting and useful.

  1)  Most of my "friends and colleagues" don't REQUIRE explicit 
whitelisting 
because the stuff they send me does not exceed the default (very safe) rules.

That's great for you, but these issues must be addressed when proposing a
spam solution for the masses.

I think that'll be true for the masses, too.  Most people do NOT receive EXE 
files or JavaScript or ActiveX or JavaScript decryptors and obscured URLs in 
legitimate mail.  One can, I feel, make some pretty general rules which block 
(or at least tag, whatever) a lot of stuff without greatly inconveniencing 
recipients (but that's why THEY must be in control of the filter).

  2)  The ones I *would* whitelist don't automatically receive "wide open" 
permissions.  I would allow individual senders to send me the type of stuff 
that 
I might expect THEM as an INDIVIDUAL to send.  NONE of them, for example, are 
likely to be permitted to send me (say) PIF attachments.  :-)   Likewise, 
VERY 
few (if any) would be whitelisted to send me (say) ActiveX or cookies or 
scripting.

How do you recognize a PIF attachment? 

Clearly, by extension... the same way the operating system does.

...What if it's inside a ZIP file?

Then it's a ZIP file (which might contain one or more files inside it).  I 
don't 
need (or WANT!) to receive unsolicited ZIP files, especially if they're 
encrypted.  If someone has a legitimate need to send me a ZIP file, they can 
arrange that with me in advance... and I'll whitelist them to allow me that 
specific type of content.

Inside an LHA file? Inside a zipped HTML file containing an embedded
OBJECT tag with a particular class ID pointing to external data? 

Likewise.  I'm willing to simply DENY anything suspicious like that to anybody 
who doesn't bother to get my permission for such stuff in advance.

There are innumerable ways of sending malicious code with various levels of
obfuscation. 

Sure, and by simply blocking attachments and HTML from unknown/untrusted 
senders 
you eliminate basically ALL of those (at least for E-mail!).

Spammers don't use them because the simple stuff works
quite well for now.

Sure, but (again) the basic approach of denying HTML and attachments from 
unknown senders eliminates basically ALL of those, in ONE FELL SWOOP.  Maybe 
you 
can come up with some more devious scheme, but then my feeling is that one can 
deny THAT class of deceptions in unsolicited mail from unknown senders, too.

 One of the advantages of my approach is that, in fact, people using
it would block VIRTUALLY ALL incoming worms and viruses, EVEN IF
they came (ostensibly) from people who they otherwise know and
trust.

And worm writers realize this, so it is one of their priorities.

You greatly underestimate the difficulty of evading the type of
permissions-list filter such as I'm proposing.  Perhaps you still
don't understand my approach; I'll be glad to try to explain it
further.

I think I sort of understand your approach. Your filter scans an
email's MIME structure, looking for particular types of
attachments. It also scans for some simple tags such as <FONT>, <A>,
etc., in the plain text attachments.

Likewise, of course in decodable text attachments (another common spammer 
trick).

[Not all email uses MIME, not all MIME using email is correctly
generated. Not all MIME attachments contain what they are labeled as
containing, not all mail software interprets attachments according to
the labels either, but others do]. 

Of course, and that's why I remove HTML tags also from "plain text" attachments 
as well.

If the sender is unknown, the mail is blocked if it contains the tags
you scan for or attachments labeled as forbidden. 

Yes, although my CURRENT implementation simply REMOVES the HTML alternative 
attachments entirely, in addition to removing HTML tags from remaining portions 
(including supposedly "plain text" portions).

If the sender is known, the mail is blocked if it contains tags or 
attachments 
which are prohibited for that sender.

OK, although I'd change that to "not allowed" rather than "prohibited", because 
in general I think it makes more sense to specify (perhaps automatically, for a 
future implementation level) ALLOWED types.

[It blocks mail from unknown people who use HTML or attachments or
large mail bodies, which is a sizable proportion of the email sending
population. People need to know enough about email structure to maintain
complex sets of permissions.]

Currently, I don't block mail that contains HTML attachments... as you point 
out, virtually all AOL mail and most HOTMAIL messages have HTML parts.  (OTOH, 
those tend to at least have 'well formed' HTML and MIME types...).  Instead, my 
current implementation simply strips the HTML portions from those messages (and 
HTML from the remaining parts) and proceeds with delivery.

As for knowing about "complex sets of permissions", that would depend on how 
the 
implementing software is designed.  Obviously, one approach would be that the 
recipient user could see either a summary of the blocked mail(s) (who it's 
from, 
the subject line, who it was addressed to, how big it was, why the filter 
blocked it, etc etc) and be given an option to "allow messages like this one 
from this sender in the future" which could open the filter rules and 
permissions for that sender to allow (just) those features through for future 
mails from that sender.  That doesn't really mean that the recipient has to 
understand each of those individual permissions.

Blocked mail can be inspected individually, possibly by grouping 
according to some heuristics based on header fields or message structure.

Right.

[For spam, header fields are both faked and randomized to thwart easy
grouping, as that would also make filtering easy. 

Fine, but at least you can see if the E-mails LOOK like those things you EXPECT 
to receive... say, mailing lists you know you belong to.

...If you receive 500 spams a day and wait a week to check them at your 
leisure, you have to wade through 3500 messages looking for the elusive 
incorrectly blocked email - 

If you had to check all of them individually, yes.  Again, that's not what I'm 
seeing here... I've got about 2000 E-mails quarantined since mid-August.  And 
there are some types of detection rules that I'm willing to provide global 
T-can 
rules on... such as non-listed senders who send me mails containing URLs like 
the-rxsite.com or cyberbargain.biz or hotpersonalads.info.  Honestly, I don't 
need to look at those E-mails, I'm reasonably sure that they're not legitimate 
(or, at a minimum, not important).

...that's still a chore, once a week, and the unfortunate sender
has been waiting for a response for a week].

If something is IMPORTANT and urgent, they can pick up the phone and call me.  
:-)  Occasional E-mails can (and do) go astray.  E-mail has NEVER been a 
guaranteed-delivery medium.

Based on the above objections, I'm not convinced your system is
competitive (accuracy wise) with other systems for large
organizations. For individual use by power users, it appears to me to
be less work than custom keyword filters, but more work than personal
Bayesian filters.

Well, for one, it works FAR better than Bayesian filters do, since so many 
spammers devote so much effort to messing up those types.

More to the point, I'm not EXPECTING that I will *ever* convince everyone, and 
maybe not even you.  I know what *I* think works, and I'm convinced that it 
will 
work for most people (and better overall, than competing approaches).

This is just my opinion based on our discussion, of course. 
For a serious evalutation, nothing beats physical deployment at 
multiple locations, and you at least pass the test of using your own
system yourself.

Thanks (at least to the extent that my present experimental implementation 
models portions of the more complete and polished implementation that I 
envision 
and propose).

Gordon Peterson                  http://personal.terabites.com/
1977-2002  Twenty-fifth anniversary year of Local Area Networking!
Support free and fair US elections!  http://stickers.defend-democracy.org
12/19/98: Partisan Republicans scornfully ignore the voters they "represent".
12/09/00: the date the Republican Party took down democracy in America.



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg