You just described statistical filtering, a well-known variant being
Bayesian filtering. Later generation Bayesian filters can be very
effective and if Aunt Mildred gets zombied the filter will allow her
email through while still keeping the spam sent from her machine out.
No, not really.
Those approaches basically look at the words used, etc., and that's not
really
what I am talking about (not for MY purposes, anyway).
My approach involves primarily things like the presence or absence of
HTML (and,
more finely, what TYPES of HTML tags are present), the presence or
absence of
attachments (and, more specifically, what TYPE of files are attached),
message
size, and so forth.
I think you misunderstand what statistical filters can do. Statistical
filters don't, strictly speaking, look at words, they look at events or
tokens. You can decide whatever these events mean. Most statistical
filters simple equate an event with a single word. More sophisticated
implementations can feed the filter other events such as IPs, dollar
amounts, appropriately processed time, etc. For example, in my
implementation I consider very long words as a type of event. I don't
pass the filter the word itself, just an event telling the filter
"very-long-word-here". You can easily extend a filter to know about
attachments.
Sure, if you want to define them that way, EVERY filter is a "statistical"
filter. I'm less concerned with semantics and definitions than I am with
function.
Bayesian filters can not only detect the absence/presence of html but it
can also tell you which tags or even attributes are relevant
automatically. I would never have thought that #fffff or #ff0000 were
very good indicators of spam mail until I went through the statistical
data of my filter. A human can't be as thorough or as accurate as a
machine.
We agree that the volume involved calls for computer-based algorithms.
... Those would be checked subject to a fine-grained
"permissions list" established by each recipient, based upon who the
stated
sender of the message was.
This is fundamentally a whitelist being based on heuristics instead of
email addresses.
It is a sort of whitelist, yes, in that the privilege of sending more advanced
E-mail features would be restricted to a recipient-established set of senders
who had negotiated that right with the intended recipient, in advance.
Non-technical users will be stumped trying to configure it.
That depends ENTIRELY on how it's implemented in the software. There's nothing
inherently complicated about "allow future mail that looks like this from this
sender." What the specific criteria and finer-level permissions are need not
necessarily be understood or handled by the recipient, any more than that you
have to understand the inner workings of your automatic transmission to drive a
car.
BTW, heuristics, in general, are dodgy - they are static, need manual
maitenance, and are usually easy to evade anyway.
Sigh. Two remarks in a row that are clearly bogus.
I guess it's true that it's possible to "evade" _some_ types of heuristics.
But if the default rules for unknown senders are:
1) There shall be no attachments.
2) There shall be no HTML tags.
3) There shall be no unexpected E-mails bigger than (say) 50K bytes.
...then it's hard to imagine how those are "easy to evade". Either the mail
contains attachments, or it doesn't. Either it's got HTML tags, or it doesn't.
It is bigger than 50K, or it's not.
If attachments are simply not allowed, then I'd love to see you explain how
it's
possible to "evade" such a heuristic and send an attachment. Likewise, if the
software doesn't allow E-mails bigger than 50K, then (if the software works, of
course) it's hard to imagine how you could "evade" the "heuristic" and
successfully send an E-mail of 150K.
Now, yes, we can agree that the CONTENT FILTER part of things (something
vaguely
on the order of Spam Assassin) uses weighting and heuristics, and spammers
traditionally tweak their E-mails to pass (at least default settings for)
commonly encountered versions of such filters. But that is a separate question
from *my* concern, which is predominantly to put a crimp in viruses, worms,
zombie spambots, and other malware stuff (much of which is used by spammers to
confuse and evade content filters). This also has the effect of reducing the
bulk of spam E-mails (and, ideally, we can push their success rate below the
threshold at which point there is 'enough' money to be made by spamming).
You're right in that a suitable Bayesian filter MIGHT recognize the
difference
between Aunt Mildred's vocabulary and that of a spammer or other
abuser, but
spammers for the last year or more have been targeting Bayesian filters
by
adding large amounts of gobbledygook to their spams to confuse their
signature
vocabulary.
Yes, and as in any arms race, there is a response. An approach is to not
pass a single word as an event but n-grams. Lookup sparse polynomial
binary hashing (SPBH) for an example of this. And it works pretty fine
too, especially with some fine-tuning :)
Sure. But the more of that you use (the bigger "n" is) the bigger a text
sample
you need in order to have confidence in the conclusion.
And, as I also mentioned the other day, in many cases users might not agree
about whether a specific E-mail is "spam" or maybe just something that they
don't want to deal with, for whatever personal reason. Ultimately, the choice
should be theirs.
Keeping the statistical data for
each recipient is expensive. In organisations it might be possible to
have keep statistical data on a per-department basis with a possible
loss of accuracy, but for ISPs this can't be done.
Straw man. I don't think we need to limit our discussion to only just
techniques which are suitable for ISP-level implementation
Indeed. I was just pointing out that ISPs have significant costs in
network bandwith, administrative hassles etc.
That's one good reason why they shouldn't be expected to meddle in the details
that are better set and controlled by the endpoints. I'm convinced that most
users would just as soon not have to PAY someone else to tweak their OWN E-mail
filtering decisions.
I prefer to do filtering
on my machine then let an ISP filter it for me.
Agreed. I finally turned off the antispam mail filtering provided by my domain
provider; there were simply too many false positives and I tired of fighting
the
software they had available at the time for controlling their filters. I
simply
decided that I personally preferred to deliver it all, and let me and my
software deal with it once it arrived here.
I was just pointing out
that huge ISPs like AOL see *lots* of spam and they too are trying to
reduce the spam problem.
Right. The BEST approach IMHO would be for them to make suitable software
tools
(and I think my proposal would be a GREAT start) to their users.
I've been trying to talk Microsoft into adding my proposed fine-grained,
sender-specific filtering ideas into Outlook and Outlook Express. No luck so
far, but I'm still hoping. :-)
This mail was checked for viruses by GFI MailSecurity.
GFI also develops anti-spam software (GFI MailEssentials), a fax server (GFI
FAXmaker), and network security and management software (GFI LANguard) -
www.gfi.com
This is another laugh... has anybody ever seen a legitimate, dangerous virus
(the kind that antivirus software might catch, I mean) contained in a PLAIN
ASCII TEXT E-mail?
Gordon Peterson http://personal.terabites.com/
1977-2002 Twenty-fifth anniversary year of Local Area Networking!
Support free and fair US elections! http://stickers.defend-democracy.org
12/19/98: Partisan Republicans scornfully ignore the voters they "represent".
12/09/00: the date the Republican Party took down democracy in America.
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg