ietf
[Top] [All Lists]

Re: Why Spam is a problem

2002-08-19 10:03:31
From: John Stracke <jstracke(_at_)centivinc(_dot_)com>

That would be less somewhat useful in this case, though, since each user 
has their own table of keywords.

That contradicts other assumptions about this mechanism

Whose? The author of the original article was very explicit that he was 
advocating users have individual tables.

please read what I wrote instead of what you wish I had written.


...
be related to future samples from new spammers?  If spam is uniform,
then why do users need private tables of keywords?

I think it was to reduce false positives--because the profile of 
different users' legitimate mail is nonuniform.

I think the purpose is to reduce false negative at least as much as to
reduce false positives.  As far as I can tell, the nature of the system
makes it at best difficult to adjust per-user tables to reduce false
positives.  (The standard usage is that a "false positive" is rejected
legitimate mail while a "false negative" is spam that leaks past a
filter.)  I've assumed that if implemented in production, the system
would use a corpus of manually identified spam and that the system
would automatically recompute the scoring with only the the samples
from the last year or so.


The major problem is that the mechanism requires a significant and
continuing false-negative rate to keep the scoring tuned as spammers
come and go.

I dunno; keeping the tuning up to date sounds like a strength to me.  It 
requires some level of effort, but a much lower level than deleting 
every piece of spam by hand.

That's a straw man.  Of course it's good to keep your filters up to
date, but there are many other tactics that require less work of
individuals and fewer false positive than this scheme.  The reason
the spam problem exists is that more than 99.99% of users cannot be
bothered to report spam to ISPs.  This scheme requires false positives,
probably at least 5% or 10%.  That's a lot of spam users would have
to read compared to some other tactics.


Of course, the main problem with any and every such system is that it
is looking for characteristics other than "unsolicited" and "bulk."

Yes, and the main problem with the DCC is that it does not.

When I moved last fall, I went through old mail, harvested the addresses 
of old friends, and sent out mail with my new address.  Some of these 
people had never received email from me (they and I were CC:ed on the 
same messages from other friends), so I would not have been on their 
whitelist.  I don't know how many people I sent to, but it was certainly 
more than 10--which you say counts as bulk.  So, if at least 10 of those 
people had been using the DCC, then my message would have been tagged as 
UBE, and some of them would not have gotten it.  I suppose one might 
argue this message was bulk email, but I knew every one of those people 
personally, considered them friends (even if I hadn't seen them since 
college), and had reason to believe that they would be at least somewhat 
pleased to keep track of me.  Why should that be filtered?

If those friends had send mail to you, you might well be on their
whitelists.  If they had never sent mail to you, and since you had
never sent mail to them, then why would you presume to clutter their
mailboxes with news of your move?

One good reason to filter such mail is that contrary to your hope, it
would have been viewed as "spam" or at least useless by many recipients
in similar situations, albeit not necessarily those friends of yours.
Most of us receive more than enough "new address" mail from people
we don't know very well and have never sent mail to.

Another reason to filter such mail is that in general it is useless
noise even from the point of view of its sender, if the sender sets
asside the normal human perspective of being the unique center of the
universe.  It is useless noise because unless you send only a very
little mail, because you cannot hope to reach more than a small fraction
of your correspondents with your change of address notice.  The only
workable tactics to deal with moving is to make new friends, hope your
old friends can track you down, or to get a permanent address.

Such change of address mail is usually motiviated by the same human
frailty that causes spam.  Everyone thinks that spam is something that
other people do.

A better way to summarize the main problem with the DCC is that it
requires the use of per-user or at least per-enterprise white lists.
It is not a small problem.


I'm not advocating the Bayesian approach as a silver bullet, mind you; 
but I think it's an interesting area to look into.  Even if it doesn't 
work, the general idea of filtering based on personalized statistics 
could lead to something that works better.

We agree about that.  

I'm irritated by the hype this notion has received, such as the use
of the phrase "Bayesian approach" to imply it is a revolutionary
invention.  That phrase has some relevance for tuning the scoring but
more as a formal description and about computerizing what people have
been doing informally and manually for years.  


Vernon Schryver    vjs(_at_)rhyolite(_dot_)com



<Prev in Thread] Current Thread [Next in Thread>