Re: [Asrg] 3. Requirements - Non Spam must go through

On Thu, Jul 10, 2003 at 07:47:08AM -0400, C. Wegrzyn wrote


Walter Dnes wrote:

On Tue, Jul 08, 2003 at 07:52:10AM -0400, C. Wegrzyn wrote

I might suggest something slightly different: why not have it
delivered and marked as SPAM in the Subject line? In this way at
least I can check to see if it really is spam? I'm afraid that if it
is returned and we have false positives some mail might be returned
that I do want to see.


 a) Why would I want to wade through 15 megabytes of spam after not
    reading my mail for a few days ?

 b) My inbox may be the *TARGET* of 15 megs of spam, but it can only
    hold 10 megabytes in total... oops.

Because detecting spam isn't exact. I can get the number down - to a 
mere handful - but I know for certain that I have seen email from this 
list be tagged as SPAM on my client. There was some feature in it that 
classified it as spam...until I trained the system I would have lost 
them all...


  Welcome to...

  Walter's 1st (f)law of content-based spam detection;
  You can *NOT* use content-based spam detection against email from a
mailing list that discusses spam.

  Think about it for a minute.  On any spam-discussion mailing list, it
is on topic to show examples of spam.  If a spam-detector is working
properly, it *WILL* detect the spam-samples, and flag the spam-discussion
email as spam.

  Corallary 1
  Content-based spam detection is imperfect, but if you *INSIST* on
using it, the best approach requires that you...
  a) absolutely whitelist any spam-discussion mailing-lists
  b) do *NOT* include emails from spam-discussion mailing-lists in the
     filter's "learning" mode.

  Item a) is obvious.  Item b) becomes obvious with a little thought.
By including the spam-discussion list, and its spam samples, in your
"not-spam" corpus, you pollute the "not-spam" database, and encourage
the filter to accept email that contains spammy content.

  Walter's 2nd (f)law of content-based spam detection;
  Even 100% correct (0% false positives and 0% false negatives) content-
based spam detection that properly flags 14 megabytes of spam and 1
megabyte of non-spam is useless given an inbox with 5 or 10 megabytes of
capacity.

  Think about it for a minute,  If your inbox can hold 5 or 10 megabytes
total, enough spam can fill it up, and cause legitimate email to be
bounced/rejected due to insufficient disk space.

  As you can tell, I use an MTA-based approach that blocks, rather than
accepts-and-tags, email considered to be spam.  In addition to avoiding
the 2 (f)laws of content-based spam detection, it doesn't contribute to
the mailbombing of innocent 3rd parties whose addresses have been forged
spammers.

-- 
Walter Dnes <waltdnes(_at_)waltdnes(_dot_)org>
Email users are divided into two classes;
1) Those who have effective spam-blocking
2) Those who wish they did

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg