Re: Spammish?

On 18 Feb, fleet(_at_)teachout(_dot_)org wrote:
| Ok.  In the last 17 days I've received 4771 messages.  Of these, 242
| messages have been identified as spam by ONE recipe (Received:
| (_dot_)*daemon(_at_)localhost).  There have been NO false positives.  How 
close to a
| fish do you have to get before you call it a fish - genetically?
| 
| Will I get a false positive?  Probably.  But will it be 1 in 1,000 or 1 in
| 10,000 or ...?
| 
| If the message gets 1000 points for "good" things - proper address, valid
| Message-Id, reasonable Subject, etc., then the score for this message
| would need to be 1001 for spam (at least).
| 
| I'm getting grumpy.  Time to quit and go to bed.

Ok Grumpy. Sorry for beating a dead horse. I know you're bailing out,
but I *wanted* to reply to this before you said that, so that should
count for something. ;-)

Let's make sure we're on the same page on a couple of things.  Nobody
is saying that anything short of a gradation method is wrong, or even
necessarily less good.  The beauty of all this wonderful software is
anyone can do almost anything they want, limited only by their own
imagination.

Maybe I'm misreading your fuzziness on this, but I think this might
help. My SPAMSCORE threshold right now is 4.  Some of the recipes I
have increment SPAMSCORE by 5, meaning they are by themselves enough
for me to identify spam. That's just like your Message-ID recipe which
(at least right now) is 100% reliable for you.  You needn't feel less of
an accomplishment for having that tool in your box, or for utilizing it
the way you do. It's a good thing, even under a spammishness scheme.

Note above I added paranthetically "at least right now". I think that's
an important consideration in the spam fight.  The spammers constantly
upgrade their arsenals to foil filters. Just think of recent discussions
regarding html comments nonsensically embedded in content and urls, and
unnecessry encoding of text messages.  These are obvious attempts to
bypass content filters. The spammers have, over the years, learned to
quit sending obvious signs in the headers, although they replace them
with others.  Some will remember discussing X-UIDL and X-PMFLAGS headers
here. At one time, filtering on them was a real find. They figured it
out and stopped whatever it was they were doing with those headers, and
the effectiveness was almost completely degraded.

So what does that mean in the context of this discussion?  One of the
first spam filters I used was !(To|Cc): me.  Years ago it probably
caught 95% or more of the spam.  Today it catches almost none. I keep
referring to the small amount of spam I get, but the address I have with
my provider is a spam magnet because it's in the whois databases for my
domain registrations.  All it gets is spam, literally, and I have 1000's
of them saved.  In recent months probably fewer than 1/10 is a Bcc:, so
that recipe is mostly deprecated but not totally useless. Where it
might have incremented the SPAMSCORE by enough to blow through the
threshold back then, it only adds 1 today.

So why bother with it?  Because in conjunction with other recipes it
still has value, even if it has virtually none by itself. Another recipe
tests for "foes" [e.g. \<((hot|pronto)mail|msn|yahoo)\.com\>] and still
another tests for html.  Given my email flow, I can't plonk a message
for being a Bcc:.  And I can't plonk it just for being html.  And I
can't plonk it just for purporting to be from yahoo.  I can't even
plonk it for all 3 of those.  BUT, the foes recipe adds 1 for each
different foe identified.  If the message has a yahoo Return-Path and
an msn Received: header, AND is html AND a Bcc:, THEN it's clear enough
to me that it's spam.  Of course, one of your Message-Id or other
recipes might already have identified it as spam. Another of mine
might also unilaterally do the same. Either way, we both get there by
different methods.

If you have recipes that work, that's great. But remember that what
works today might not work tomorrow.  The ratware that spews this stuff
might someday use Message-Id's that don't trip your filter. If you're
catching enough spam now with one-stop black or white recipes, that's
good but it might not last.  I know I've snagged some spam that wouldn't
have been caught any other way than cumulative scoring.  There was no
smoking gun, if you will, just a preponderence of eveidence from
multiple sources.

Nobody here is saying you're doing it wrong, or that they're doing it
better.  They're just trying to show you other and/or supplemental
ways.  Whether they're better or not is up to you and nobody else.  A
couple of times I've added context a long the lines of "Given my mail
flow".  I haven't stressed it, but it's an important consideration. 
What works for one person doesn't necessarily work for another.  I get
few Bcc:'s (other than lists which are always identified before getting
to the spam checks), and few html messages.  For me that's reasonably
predictable, but obviously wouldn't be for others.  You asked some good
questions, got some good answers, but I wouldn't pretend to tell you
what's "right".  That's a dubious notion within this context anyway, and
definitely something that you must answer for yourself. This is just
about being able to choose from a wider universe of possibilities if
that interests you.

Good luck.


-- 
Email address in From: header is valid  * but only for a couple of days *
This is my reluctant response to spammers' unrelenting address harvesting



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail