On Wed, 20 Jul 2005, "Brian Azzopardi" <briana(_at_)gfi(_dot_)com> wrote:
I simply believe it makes a LOT more sense to identify most spam by
observing its variance from accepted and agreed form.
E-mail coming from a given correspondent which DOES NOT LOOK LIKE the
mail you expect to get from that correspondent
can, and probably should, be quarantined or even t-canned until a
different treatment is indicated.
If I get a 170K-byte PIF file attachment from my dear old Aunt
Mildred, it's a pretty safe bet that
it's a virus or worm... she would simply never legitimately send me
anything like that (nor, in fact,
would probably anybody else).
You just described statistical filtering, a well-known variant being
Bayesian filtering. Later generation Bayesian filters can be very
effective and if Aunt Mildred gets zombied the filter will allow her
email through while still keeping the spam sent from her machine out.
No, not really.
Those approaches basically look at the words used, etc., and that's not really
what I am talking about (not for MY purposes, anyway).
My approach involves primarily things like the presence or absence of HTML
(and,
more finely, what TYPES of HTML tags are present), the presence or absence of
attachments (and, more specifically, what TYPE of files are attached), message
size, and so forth. Those would be checked subject to a fine-grained
"permissions list" established by each recipient, based upon who the stated
sender of the message was. This could either be managed directly, or
indirectly
via some sort of "allow this sender to send this type of material in the
future"
dialog which would open the restrictions for that sender to allow the specific
(perhaps even unidentified) features which caused the mail to be questioned.
The DEFAULT (for unrecognized/prevously unknown senders) would be NO HTML, NO
attachments of any kind, and limited message size (25K, 50K, 100K, or whatever,
but probably on that order). NOTE SPECIFICALLY that these defaults would block
essentially all worms and viruses and other e-mail-borne malware exploits
coming
from unrecognized senders... and the narrow established permissions would
probably block most or all such stuff coming from RECOGNIZED senders too.
A nice extension to my approach would be to add additional content expected to
be found in a message from the given sender. For example, a message body from
a
familiar newsletter would be expected to contain the masthead or copyright
notice found in every legitimate copy of that newsletter. A message coming
from
a particular correspondent might be expected to contain their characteristic
signature file. A message from a Yahoogroups mailing list might be expected to
contain a Yahoogroups-type ad, or perhaps the group name in square brackets as
part of the subject. One might even include the characteristic mailer software
(based on the message header tag) that the sender is known to use. A message
claiming to be from that sender and NOT containing their characteristic content
and style would be immediately treated as suspect.
You're right in that a suitable Bayesian filter MIGHT recognize the difference
between Aunt Mildred's vocabulary and that of a spammer or other abuser, but
spammers for the last year or more have been targeting Bayesian filters by
adding large amounts of gobbledygook to their spams to confuse their signature
vocabulary.
SPF, reputation, et al can't do that.
Of course. And I consider it a (perhaps-)fatal and fundamental flaw in those
approaches that they are based more on HOW the message was sent than WHAT was
sent.
It's much harder for a spam or virus or worm to spoof (in a general and
universal way) the CONTENT STYLE of the owner of the machine they've
commandeered than it is to simply spoof a sender ID. Even more, a typical
recipient will likely have **NOBODY AT ALL** authorized to send them executable
content... which would essentially make it an impossibility to spoof ANYBODY
and
get viruses, worms, or other zombie-spambot software into their machine via an
E-mail vector.
The "problem" with bayesian filtering and other content checking methods
is that you need to work on at least a non-trivial part, say 4k, of the
message body. This is fine for most organisations and individuals, but
maybe too resource intensive for ISPs.
I'm not proposing that this filtering be done by ISPs, at least not at the
final
levels. There is simply (in aggregate) far more computing power available (and
more FREELY available) at the recipient machines than there is anywhere else
enroute.
At least as a first-cut, I think it's fine to do all the filtering at the
recipient level (and especially if the downloading and filtering can be done in
a nonblocking way, hopefully mostly transparently, which is of particular
concern for dialup users).
Keeping the statistical data for
each recipient is expensive. In organisations it might be possible to
have keep statistical data on a per-department basis with a possible
loss of accuracy, but for ISPs this can't be done.
Straw man. I don't think we need to limit our discussion to only just
techniques which are suitable for ISP-level implementation, especially if that
seems to eliminate consideration of the most promising and useful approaches.
Gordon Peterson http://personal.terabites.com/
1977-2002 Twenty-fifth anniversary year of Local Area Networking!
Support free and fair US elections! http://stickers.defend-democracy.org
12/19/98: Partisan Republicans scornfully ignore the voters they "represent".
12/09/00: the date the Republican Party took down democracy in America.
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg