On Thu, Jul 10, 2003 at 07:47:08AM -0400, C. Wegrzyn wrote
Walter Dnes wrote:
On Tue, Jul 08, 2003 at 07:52:10AM -0400, C. Wegrzyn wrote
I might suggest something slightly different: why not have it
delivered and marked as SPAM in the Subject line? In this way at
least I can check to see if it really is spam? I'm afraid that if it
is returned and we have false positives some mail might be returned
that I do want to see.
a) Why would I want to wade through 15 megabytes of spam after not
reading my mail for a few days ?
b) My inbox may be the *TARGET* of 15 megs of spam, but it can only
hold 10 megabytes in total... oops.
Because detecting spam isn't exact. I can get the number down - to a
mere handful - but I know for certain that I have seen email from this
list be tagged as SPAM on my client. There was some feature in it that
classified it as spam...until I trained the system I would have lost
them all...
Welcome to...
Walter's 1st (f)law of content-based spam detection;
You can *NOT* use content-based spam detection against email from a
mailing list that discusses spam.
Think about it for a minute. On any spam-discussion mailing list, it
is on topic to show examples of spam. If a spam-detector is working
properly, it *WILL* detect the spam-samples, and flag the spam-discussion
email as spam.
Corallary 1
Content-based spam detection is imperfect, but if you *INSIST* on
using it, the best approach requires that you...
a) absolutely whitelist any spam-discussion mailing-lists
b) do *NOT* include emails from spam-discussion mailing-lists in the
filter's "learning" mode.
Item a) is obvious. Item b) becomes obvious with a little thought.
By including the spam-discussion list, and its spam samples, in your
"not-spam" corpus, you pollute the "not-spam" database, and encourage
the filter to accept email that contains spammy content.
Walter's 2nd (f)law of content-based spam detection;
Even 100% correct (0% false positives and 0% false negatives) content-
based spam detection that properly flags 14 megabytes of spam and 1
megabyte of non-spam is useless given an inbox with 5 or 10 megabytes of
capacity.
Think about it for a minute, If your inbox can hold 5 or 10 megabytes
total, enough spam can fill it up, and cause legitimate email to be
bounced/rejected due to insufficient disk space.
As you can tell, I use an MTA-based approach that blocks, rather than
accepts-and-tags, email considered to be spam. In addition to avoiding
the 2 (f)laws of content-based spam detection, it doesn't contribute to
the mailbombing of innocent 3rd parties whose addresses have been forged
spammers.
--
Walter Dnes <waltdnes(_at_)waltdnes(_dot_)org>
Email users are divided into two classes;
1) Those who have effective spam-blocking
2) Those who wish they did
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg