At 17:31 2004-05-28 +0200, Kai Schaetzl wrote:
mention by just rejecting all HTML messages
IMO, that was pretty dumb.
I admit that I classify HTML-*ONLY* with a spammishness score, but it
doesn't make it inherently spam, though it's proven to be a very solid
indicator.
am not interested in such. I just wanted to point Jim Witte to a solution
which might fit him better in the long run.
In the process however, you made the statement that nobody could do as good
a job with procmail as SA can. Not that SA might involve less personal
investment of time into managing the process, but that outright, SA was
better than anything anyone could implement in procmail.
I've personally witnessed both sides. My personal experience says that
procmail wins out.
> 2) I didn't re-invent the wheel. My rules came first.
Again, I didn't talk about you. Or did I? Adding a few obfuscation rules
*is* reinventing the wheel.
The wheels on my cars differ from the wheels on your car, but then several
of my cars can do 160MPH in stock fittment, so perhaps there's a reason
they didn't just slap some 13" tin wheel from a Hyundai on
them. Materials, appearance, features, etc. all vary. A different DESIGN
doesn't imply reinvention of the basic concept.
Howver, on the point of the basic concept -- a *LOT* of spam filtering
techniques were around before SA. Procmail certainly chief among the tools
providing access to spam filtering. I think it's safe to say that SA did a
lot of reinvention itself. If reinventing the wheel is such a stupid idea,
what on earth were the people involved in SA thinking?
People here CHOOSE to use Procmail, and while you're certainly welcome to
the opinion that SA (resource-hungry as it is) is more capable than
procmail, don't expect that people who've been using procmail nearly since
it's inception are really going to expect that they can't produce a more
effective and tuneable solution than something directed at the masses.
Even without spam, I'd be using procmail, and not just for mere filing into
folders either. If I've got it already, why add ANOTHER tool to the mix
when procmail can handle it quite readily?
While I can send any of my cars to a garage and have them do work on them,
I choose to do most everything myself (except wheels, and machining
obviously goes out to a machine shop) - *I* retain far more control that
way. Since I understand how they work and have the necessary tools and
experience, this is a feasable option for me. It's not the same for
somebody picking up a wrench for the first time, without complete service
documentation thinking they can dive in and do a better job at an engine
overhaul than a well-equipped professional mechanic.
Somebody new to mail filtering is certainly likely to find taking their
mail to a canned solution is going to work better for them. That DOES NOT
mean that the canned solution is in fact better than what someone with a
clue can manage themselves (or via a forum such as this).
> 3) Procmail as I have it configured is about 100 (or is it 500?)
> times lighter on the machine. Most of my tests are headers-only.
In that case I'd just need to change some names and get lots of spam thru.
You cannot fight spam almost only with header rules unless you use a lot of
name-specific blocking and risk a high FP rate (which you have).
That statement says you simply do not comprehend the sheer number of spammy
factors which are found in the headers. Bogus/stale datestamps,
message-ids, forged sender hosts, hostnames with consumeresque naming,
from=to, plurality of similar-named recipients, no from, domain in from not
resolveable, etc. -- none of those factors even involving lists of things
such as keywords or domains which remain useful header-based methods as well.
I've got dozens of header-only checks in my arsenal, and the only things
that check the body are:
* Body starts with HTML tags, but was identified as text/plain
(in the headers)
* Message claims (in the headers) to be multipart/alternative, but
lacks a text/plain section.
* abundance of HTML comments
* opt-out references
* non-digest message with an abundance of exclamations (I call it
"enthusiastic twit")
* Nigerian scams
* "this isn't spam" disclaimers
That's it. While these are useful tests, more often than not, they're not
necessary for identifying spam - the messages will have crossed the
spammishness threshold based upon the header evaluations alone.
Here's a few examples of messages flagged (in this morning's status
message) - and none of the characteristics were body related:
SPAM: +30 Advisory - may be forged warning
SPAM: +75 received without messageid, injected by local mailserver
SPAM: +75 Received headers include suspicious reference
SPAM: +125 relay hostname appears to be consumer dialup/broadband
SPAM: +45 MIME - multipart/related
SPAM: Advisory - spammishness is 350
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00 SBS 20040512/1941
>From ali(_at_)minuevaweb(_dot_)com Thu May 27 02:55:39 2004
Subject: Venta de PC's
Folder: gzip -9fc >> spam.gz
one message that matched *12* different header issues:
SPAM: +75 received without messageid, injected by local mailserver
SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +50 allcaps subject
SPAM: +300 Foreign character set encoding (gb2312) in body.
SPAM: +(249*2) raw 8-bit characters in the Subject/From/To
SPAM: +(249*2) No To/cc/bcc
SPAM: +45 no X-Envelope-To
SPAM: +75 no non-list cleartext recipient matching X-Envelope-To
SPAM: +50 embedded space on subject
SPAM: +249+262001 Subject Scoring match 262001
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 264400
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00 SBS 20040512/1941
>From asdf2(_at_)tom(_dot_)net Thu May 27 16:55:01 2004
Subject:
¿ìËÙÓÐЧµÍ³É±¾µÄ¹ÜÀí±ä¸ï
Folder: gzip -9fc >> spam.gz
(that allcaps subject is a bit misleading here - I don't check specifically
for caps, but rather the absence of lowercase when there are at least some
number of characters in the subject)
SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +100 Date is suspicious at 42841 seconds {365 11:54:01} AFTER reception
SPAM: +80 bogus reply context headers
SPAM: +80 spam type statements (80)
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 819
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00 SBS 20040512/1941
>From Zuiderveld(_at_)webmages(_dot_)com Thu May 27 18:55:14 2004
Subject: qeg software
Folder: gzip -9fc >> spam.gz
yea, spam type statements is a body check, but you'll see the score was
already well and above what it needed to be to flag it.
SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +100 Date is suspicious at 42841 seconds {365 11:54:01} AFTER reception
SPAM: +175 IP 219.249.141.190 listed in dialup DNSBL
SPAM: +80 bogus reply context headers
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 914
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00 SBS 20040512/1941
>From layers(_at_)cci(_dot_)dk Thu May 27 19:52:54 2004
Subject: bda software
Folder: gzip -9fc >> spam.gz
> False-positive: all dates: 1.1%
> False-negative: all dates: 0.1%
FTR, given the load of my email, I have perhaps a false-positive on
individual _characteristics_ of circa 2% -- but that is of merely
identifying a characteristic about a message as being spammy (say, a forged
freemail address, but alone isn't enough to say it's spam), not flagging
the message as spam. When taken as multiple characteristics, I have a
false positive rate of 0.25 - 0.5 %, and that is almost purely restricted
to discussion lists, where it so happens that the the two heavily-weighted
characteristics which trigger the most falses for me are:
* furrin
* excessive punctuation in subject
I've considered setting these two rules up to NOT be invoked for discussion
lists - if I did that, my false positive rate would drop to virtually nil.
FTR, after implementing rules to check for consumer hostnames (broadband
and dialup type patterns in the host relaying to me) as well as a broadband
DNSBL (but at procmail, not the MTA), spam pretty much dropped off my radar
- I had a series of messages earlier this month which prompted me to
implement a rule to deal with broadband type senders, and around the 14th,
I implemented the change. I've had ZILCH for nearly two weeks, and then
just one spam through yesterday morning - and that message was to a contact
account which for various reasons needs to not be subjected to risking
losing ANYTHING to filtering.
> Oh, and all the users running spamc on our system causes the mail server
> to overload with regularity.
>
Of course. Running spamc from procmail is inefficient in particular and
letting each user combat spam by himself is inefficient in general.
Letting each user determine what works for them however is what keeps
clients happy. In a medium to large hosting environment, it makes sense to
set things up so that users can opt in to a central spam filtering
solution, and it's certainly a sell-point for some users -- but it
shouldn't be forced upon them.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail