ietf-asrg
[Top] [All Lists]

RE: [Asrg] Trust relationships etc.

2005-07-21 10:32:06
You just described statistical filtering, a well-known variant being
Bayesian filtering. Later generation Bayesian filters can be very
effective and if Aunt Mildred gets zombied the filter will allow her
email through while still keeping the spam sent from her machine out. 

No, not really.

Those approaches basically look at the words used, etc., and that's not
really 
what I am talking about (not for MY purposes, anyway).

My approach involves primarily things like the presence or absence of
HTML (and, 
more finely, what TYPES of HTML tags are present), the presence or
absence of 
attachments (and, more specifically, what TYPE of files are attached),
message 
size, and so forth.

I think you misunderstand what statistical filters can do. Statistical
filters don't, strictly speaking, look at words, they look at events or
tokens. You can decide whatever these events mean. Most statistical
filters simple equate an event with a single word. More sophisticated
implementations can feed the filter other events such as IPs, dollar
amounts, appropriately processed time, etc. For example, in my
implementation I consider very long words as a type of event. I don't
pass the filter the word itself, just an event telling the filter
"very-long-word-here". You can easily extend a filter to know about
attachments.

Sure, if you want to define them that way, EVERY filter is a "statistical" 
filter.  I'm less concerned with semantics and definitions than I am with 
function.

Bayesian filters can not only detect the absence/presence of html but it
can also tell you which tags or even attributes are relevant
automatically. I would never have thought that #fffff or #ff0000 were
very good indicators of spam mail until I went through the statistical
data of my filter. A human can't be as thorough or as accurate as a
machine.

We agree that the volume involved calls for computer-based algorithms.

... Those would be checked subject to a fine-grained 
"permissions list" established by each recipient, based upon who the
stated 
sender of the message was.

This is fundamentally a whitelist being based on heuristics instead of
email addresses. 

It is a sort of whitelist, yes, in that the privilege of sending more advanced 
E-mail features would be restricted to a recipient-established set of senders 
who had negotiated that right with the intended recipient, in advance.

Non-technical users will be stumped trying to configure it. 

That depends ENTIRELY on how it's implemented in the software.  There's nothing 
inherently complicated about "allow future mail that looks like this from this 
sender."  What the specific criteria and finer-level permissions are need not 
necessarily be understood or handled by the recipient, any more than that you 
have to understand the inner workings of your automatic transmission to drive a 
car.

BTW, heuristics, in general, are dodgy - they are static, need manual
maitenance, and are usually easy to evade anyway.

Sigh.  Two remarks in a row that are clearly bogus.

I guess it's true that it's possible to "evade" _some_ types of heuristics.

But if the default rules for unknown senders are:

  1)  There shall be no attachments.

  2)  There shall be no HTML tags.

  3)  There shall be no unexpected E-mails bigger than (say) 50K bytes.

...then it's hard to imagine how those are "easy to evade".  Either the mail 
contains attachments, or it doesn't.  Either it's got HTML tags, or it doesn't. 
It is bigger than 50K, or it's not.  

If attachments are simply not allowed, then I'd love to see you explain how 
it's 
possible to "evade" such a heuristic and send an attachment.  Likewise, if the 
software doesn't allow E-mails bigger than 50K, then (if the software works, of 
course) it's hard to imagine how you could "evade" the "heuristic" and 
successfully send an E-mail of 150K.

Now, yes, we can agree that the CONTENT FILTER part of things (something 
vaguely 
on the order of Spam Assassin) uses weighting and heuristics, and spammers 
traditionally tweak their E-mails to pass (at least default settings for) 
commonly encountered versions of such filters.  But that is a separate question 
from *my* concern, which is predominantly to put a crimp in viruses, worms, 
zombie spambots, and other malware stuff (much of which is used by spammers to 
confuse and evade content filters).  This also has the effect of reducing the 
bulk of spam E-mails (and, ideally, we can push their success rate below the 
threshold at which point there is 'enough' money to be made by spamming).

You're right in that a suitable Bayesian filter MIGHT recognize the
difference 
between Aunt Mildred's vocabulary and that of a spammer or other
abuser, but 
spammers for the last year or more have been targeting Bayesian filters
by 
adding large amounts of gobbledygook to their spams to confuse their
signature 
vocabulary.

Yes, and as in any arms race, there is a response. An approach is to not
pass a single word as an event but n-grams. Lookup sparse polynomial
binary hashing (SPBH) for an example of this. And it works pretty fine
too, especially with some fine-tuning :)

Sure.  But the more of that you use (the bigger "n" is) the bigger a text 
sample 
you need in order to have confidence in the conclusion.

And, as I also mentioned the other day, in many cases users might not agree 
about whether a specific E-mail is "spam" or maybe just something that they 
don't want to deal with, for whatever personal reason.  Ultimately, the choice 
should be theirs.

Keeping the statistical data for
each recipient is expensive. In organisations it might be possible to
have keep statistical data on a per-department basis with a possible
loss of accuracy, but for ISPs this can't be done.

Straw man.  I don't think we need to limit our discussion to only just 
techniques which are suitable for ISP-level implementation

Indeed. I was just pointing out that ISPs have significant costs in
network bandwith, administrative hassles etc. 

That's one good reason why they shouldn't be expected to meddle in the details 
that are better set and controlled by the endpoints.  I'm convinced that most 
users would just as soon not have to PAY someone else to tweak their OWN E-mail 
filtering decisions.

I prefer to do filtering
on my machine then let an ISP filter it for me. 

Agreed.  I finally turned off the antispam mail filtering provided by my domain 
provider; there were simply too many false positives and I tired of fighting 
the 
software they had available at the time for controlling their filters.  I 
simply 
decided that I personally preferred to deliver it all, and let me and my 
software deal with it once it arrived here.

I was just pointing out
that huge ISPs like AOL see *lots* of spam and they too are trying to
reduce the spam problem. 

Right.  The BEST approach IMHO would be for them to make suitable software 
tools 
(and I think my proposal would be a GREAT start) to their users.

I've been trying to talk Microsoft into adding my proposed fine-grained, 
sender-specific filtering ideas into Outlook and Outlook Express.  No luck so 
far, but I'm still hoping.   :-)

This mail was checked for viruses by GFI MailSecurity. 
GFI also develops anti-spam software (GFI MailEssentials), a fax server (GFI 
FAXmaker), and network security and management software (GFI LANguard) - 
www.gfi.com 

This is another laugh... has anybody ever seen a legitimate, dangerous virus 
(the kind that antivirus software might catch, I mean) contained in a PLAIN 
ASCII TEXT E-mail?

Gordon Peterson                  http://personal.terabites.com/
1977-2002  Twenty-fifth anniversary year of Local Area Networking!
Support free and fair US elections!  http://stickers.defend-democracy.org
12/19/98: Partisan Republicans scornfully ignore the voters they "represent".
12/09/00: the date the Republican Party took down democracy in America.



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg