ietf-asrg
[Top] [All Lists]

Re: [Asrg] Comments: draft-irtf-asrg-criteria-00.txt

2007-01-25 11:38:49
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Nicol wrote:
On 1/25/07, Chris Lewis <clewis(_at_)nortel(_dot_)com> wrote:

My personally trained bayesian filtering has an absolutely abysmal track
record.  On the spam aimed at the false positive handling address, which
by design has _no_ filtering, Bayesian has an effectiveness rate of
about 50%.  Yuck.  No amount of personal twiddling, custom rules,
explicit pattern matching in my UA is going to make much difference to
that.

I wonder if Paul Graham was being underhanded and evil or merely naive when
he claimed that the level of noise allowed to pass through Bayesian
filters would
be acceptable http://www.paulgraham.com/spam.html

I had a long discussion (people bought popcorn to watch the fireworks,
but were disappointed ;-) with another of the 2 or 3 "big name" Bayesian
proselytizers at a big virus conference some years ago.

At the time, I realized one simple truth - the whole thing was based on
ridiculously small sample sizes.  Eg: < 100 emails/day.  I laughingly
suggested I'd aim _my_ load at him (3-4 orders of magnitude higher), at
which point he visibly winced and got rather quiet ;-)

I've also found it rather disconcerting that Graham's (and others)
papers about the possibility of fooling Bayes is rather egregiously
flagrant arm-waving and sleight-of-hand.

In all fairness, the guy I spoke to did say that TBird's Bayes was
rather badly (he almost thought _deliberately_) crippled.  When I reset
Tbird's Bayes training, the filtering actually gets better (except for a
short-lived spike in FPs), but then gets worse over time.  Explain
_that_ ;-)

[I have this possibly naive impression that (some?) Bayes classification
is vulnerable to "flooding" - smearing/smudging the spammyness/
non-spammyness of words into a vast morass of "not very anything",
leading to most computed scores being so close to neutral that Bayes'
conservative score thresholding treats it as "I dunno", and lets it
thru.  Of course, spam with few words (eg: graphical spam) is
particularly hard to deal with via Bayes techniques.]

He was showing interesting results with his Bayes - it even did a decent
job of distinguishing viral binaries from non-virus binaries.  But that
wouldn't work with per-emit munging.  His viral detection appeared to be
largely based on making Bayes "score" variable length binary fragments,
and noticing the commonality of particular code fragments in malware.

But even if it worked as claimed, the vast majority of our users simply
aren't interested in training Bayes (or ANYTHING ELSE) to recognize spam.

"Dear CEO, senior executives, staff and shareholders, we require you to
spend [in some cases VERY expensive] manhours to train/configure your
filters".

To which I'd get the reply "that's _your_ job, you're obviously not
doing it...".

I'd fire _myself_ for incompetence ;-)

They want spam stopped, _period_.  This is as true in the ISP world as
it is in the corporate one.

Out of a population of 60,000-120,000 users, with more than 10 years of
experience, we've had fewer than five people ask to have their filters
turned off.  All of those changed their minds after I told them how many
spams we block for them ;-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRbj4Sp3FmCyJjHfhAQLwNwQA7VORT2x2N0eofJQVu64aEtc5dxKk2oT1
1FDg3kTD3LEloicq2bfYwedoKPljDQz1UCSK/th9q37RHc723a8Woyl5OqlOelqc
H/7Dwq+A7e97hpTtkzBEtrl49BhQEOlHrA5P+hTUCKlKcK+vi3drb3350xWjHva5
A7ksGjyNDaY=
=Mnn/
-----END PGP SIGNATURE-----

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg

<Prev in Thread] Current Thread [Next in Thread>