ietf-asrg
[Top] [All Lists]

RE: [Asrg] Trust relationships etc.

2005-07-21 01:37:57

You just described statistical filtering, a well-known variant being
Bayesian filtering. Later generation Bayesian filters can be very
effective and if Aunt Mildred gets zombied the filter will allow her
email through while still keeping the spam sent from her machine out. 

No, not really.

Those approaches basically look at the words used, etc., and that's not
really 
what I am talking about (not for MY purposes, anyway).

My approach involves primarily things like the presence or absence of
HTML (and, 
more finely, what TYPES of HTML tags are present), the presence or
absence of 
attachments (and, more specifically, what TYPE of files are attached),
message 
size, and so forth.

I think you misunderstand what statistical filters can do. Statistical
filters don't, strictly speaking, look at words, they look at events or
tokens. You can decide whatever these events mean. Most statistical
filters simple equate an event with a single word. More sophisticated
implementations can feed the filter other events such as IPs, dollar
amounts, appropriately processed time, etc. For example, in my
implementation I consider very long words as a type of event. I don't
pass the filter the word itself, just an event telling the filter
"very-long-word-here". You can easily extend a filter to know about
attachments.

Bayesian filters can not only detect the absence/presence of html but it
can also tell you which tags or even attributes are relevant
automatically. I would never have thought that #fffff or #ff0000 were
very good indicators of spam mail until I went through the statistical
data of my filter. A human can't be as thorough or as accurate as a
machine.


... Those would be checked subject to a fine-grained 
"permissions list" established by each recipient, based upon who the
stated 
sender of the message was.

This is fundamentally a whitelist being based on heuristics instead of
email addresses. Non-technical users will be stumped trying to configure
it. 

BTW, heuristics, in general, are dodgy - they are static, need manual
maitenance, and are usually easy to evade anyway.


You're right in that a suitable Bayesian filter MIGHT recognize the
difference 
between Aunt Mildred's vocabulary and that of a spammer or other
abuser, but 
spammers for the last year or more have been targeting Bayesian filters
by 
adding large amounts of gobbledygook to their spams to confuse their
signature 
vocabulary.

Yes, and as in any arms race, there is a response. An approach is to not
pass a single word as an event but n-grams. Lookup sparse polynomial
binary hashing (SPBH) for an example of this. And it works pretty fine
too, especially with some fine-tuning :)


Keeping the statistical data for
each recipient is expensive. In organisations it might be possible to
have keep statistical data on a per-department basis with a possible
loss of accuracy, but for ISPs this can't be done.

Straw man.  I don't think we need to limit our discussion to only just 
techniques which are suitable for ISP-level implementation

Indeed. I was just pointing out that ISPs have significant costs in
network bandwith, administrative hassles etc. I prefer to do filtering
on my machine then let an ISP filter it for me. I was just pointing out
that huge ISPs like AOL see *lots* of spam and they too are trying to
reduce the spam problem. 

Brian Azzopardi

  
This mail was checked for viruses by GFI MailSecurity. 
GFI also develops anti-spam software (GFI MailEssentials), a fax server (GFI 
FAXmaker), and network security and management software (GFI LANguard) - 
www.gfi.com 


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg