procmail
[Top] [All Lists]

Re: Rule to filter for letter-number combinations?

2004-05-28 11:16:38
At 17:31 2004-05-28 +0200, Kai Schaetzl wrote:
mention by just rejecting all HTML messages

IMO, that was pretty dumb.

I admit that I classify HTML-*ONLY* with a spammishness score, but it doesn't make it inherently spam, though it's proven to be a very solid indicator.

am not interested in such. I just wanted to point Jim Witte to a solution
which might fit him better in the long run.

In the process however, you made the statement that nobody could do as good a job with procmail as SA can. Not that SA might involve less personal investment of time into managing the process, but that outright, SA was better than anything anyone could implement in procmail.

I've personally witnessed both sides. My personal experience says that procmail wins out.

> 2) I didn't re-invent the wheel.  My rules came first.

Again, I didn't talk about you. Or did I? Adding a few obfuscation rules
*is* reinventing the wheel.

The wheels on my cars differ from the wheels on your car, but then several of my cars can do 160MPH in stock fittment, so perhaps there's a reason they didn't just slap some 13" tin wheel from a Hyundai on them. Materials, appearance, features, etc. all vary. A different DESIGN doesn't imply reinvention of the basic concept.

Howver, on the point of the basic concept -- a *LOT* of spam filtering techniques were around before SA. Procmail certainly chief among the tools providing access to spam filtering. I think it's safe to say that SA did a lot of reinvention itself. If reinventing the wheel is such a stupid idea, what on earth were the people involved in SA thinking?

People here CHOOSE to use Procmail, and while you're certainly welcome to the opinion that SA (resource-hungry as it is) is more capable than procmail, don't expect that people who've been using procmail nearly since it's inception are really going to expect that they can't produce a more effective and tuneable solution than something directed at the masses.

Even without spam, I'd be using procmail, and not just for mere filing into folders either. If I've got it already, why add ANOTHER tool to the mix when procmail can handle it quite readily?

While I can send any of my cars to a garage and have them do work on them, I choose to do most everything myself (except wheels, and machining obviously goes out to a machine shop) - *I* retain far more control that way. Since I understand how they work and have the necessary tools and experience, this is a feasable option for me. It's not the same for somebody picking up a wrench for the first time, without complete service documentation thinking they can dive in and do a better job at an engine overhaul than a well-equipped professional mechanic.

Somebody new to mail filtering is certainly likely to find taking their mail to a canned solution is going to work better for them. That DOES NOT mean that the canned solution is in fact better than what someone with a clue can manage themselves (or via a forum such as this).

> 3) Procmail as I have it configured is about 100 (or is it 500?)
>    times lighter on the machine.  Most of my tests are headers-only.

In that case I'd just need to change some names and get lots of spam thru.
You cannot fight spam almost only with header rules unless you use a lot of
name-specific blocking and risk a high FP rate (which you have).

That statement says you simply do not comprehend the sheer number of spammy factors which are found in the headers. Bogus/stale datestamps, message-ids, forged sender hosts, hostnames with consumeresque naming, from=to, plurality of similar-named recipients, no from, domain in from not resolveable, etc. -- none of those factors even involving lists of things such as keywords or domains which remain useful header-based methods as well.

I've got dozens of header-only checks in my arsenal, and the only things that check the body are:

        * Body starts with HTML tags, but was identified as text/plain
                (in the headers)

        * Message claims (in the headers) to be multipart/alternative, but
                lacks a text/plain section.

        * abundance of HTML comments

        * opt-out references

        * non-digest message with an abundance of exclamations (I call it
                "enthusiastic twit")

        * Nigerian scams

        * "this isn't spam" disclaimers

That's it. While these are useful tests, more often than not, they're not necessary for identifying spam - the messages will have crossed the spammishness threshold based upon the header evaluations alone.

Here's a few examples of messages flagged (in this morning's status message) - and none of the characteristics were body related:

SPAM: +30 Advisory - may be forged warning
SPAM: +75 received without messageid, injected by local mailserver
SPAM: +75 Received headers include suspicious reference
SPAM: +125 relay hostname appears to be consumer dialup/broadband
SPAM: +45 MIME - multipart/related
SPAM: Advisory - spammishness is 350
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From ali(_at_)minuevaweb(_dot_)com  Thu May 27 02:55:39 2004
 Subject: Venta de PC's
  Folder:  gzip -9fc >> spam.gz


one message that matched *12* different header issues:

SPAM: +75 received without messageid, injected by local mailserver
SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +50 allcaps subject
SPAM: +300 Foreign character set encoding (gb2312) in body.
SPAM: +(249*2) raw 8-bit characters in the Subject/From/To
SPAM: +(249*2) No To/cc/bcc
SPAM: +45 no X-Envelope-To
SPAM: +75 no non-list cleartext recipient matching X-Envelope-To
SPAM: +50 embedded space on subject
SPAM: +249+262001 Subject Scoring match 262001
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 264400
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From asdf2(_at_)tom(_dot_)net  Thu May 27 16:55:01 2004
Subject: ¿ìËÙÓÐЧµÍ³É±¾µÄ¹ÜÀí±ä¸ï
  Folder:  gzip -9fc >> spam.gz

(that allcaps subject is a bit misleading here - I don't check specifically for caps, but rather the absence of lowercase when there are at least some number of characters in the subject)


SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +100 Date is suspicious at 42841 seconds {365 11:54:01} AFTER reception
SPAM: +80 bogus reply context headers
SPAM: +80 spam type statements (80)
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 819
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From Zuiderveld(_at_)webmages(_dot_)com  Thu May 27 18:55:14 2004
 Subject: qeg software
  Folder:  gzip -9fc >> spam.gz

yea, spam type statements is a body check, but you'll see the score was already well and above what it needed to be to flag it.

SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +100 Date is suspicious at 42841 seconds {365 11:54:01} AFTER reception
SPAM: +175 IP 219.249.141.190 listed in dialup DNSBL
SPAM: +80 bogus reply context headers
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 914
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From layers(_at_)cci(_dot_)dk  Thu May 27 19:52:54 2004
 Subject: bda software
  Folder:  gzip -9fc >> spam.gz

>    False-positive: all dates: 1.1%
>    False-negative: all dates: 0.1%

FTR, given the load of my email, I have perhaps a false-positive on individual _characteristics_ of circa 2% -- but that is of merely identifying a characteristic about a message as being spammy (say, a forged freemail address, but alone isn't enough to say it's spam), not flagging the message as spam. When taken as multiple characteristics, I have a false positive rate of 0.25 - 0.5 %, and that is almost purely restricted to discussion lists, where it so happens that the the two heavily-weighted characteristics which trigger the most falses for me are:

        * furrin
        * excessive punctuation in subject

I've considered setting these two rules up to NOT be invoked for discussion lists - if I did that, my false positive rate would drop to virtually nil.

FTR, after implementing rules to check for consumer hostnames (broadband and dialup type patterns in the host relaying to me) as well as a broadband DNSBL (but at procmail, not the MTA), spam pretty much dropped off my radar - I had a series of messages earlier this month which prompted me to implement a rule to deal with broadband type senders, and around the 14th, I implemented the change. I've had ZILCH for nearly two weeks, and then just one spam through yesterday morning - and that message was to a contact account which for various reasons needs to not be subjected to risking losing ANYTHING to filtering.

> Oh, and all the users running spamc on our system causes the mail server
> to overload with regularity.
>

Of course. Running spamc from procmail is inefficient in particular and
letting each user combat spam by himself is inefficient in general.

Letting each user determine what works for them however is what keeps clients happy. In a medium to large hosting environment, it makes sense to set things up so that users can opt in to a central spam filtering solution, and it's certainly a sell-point for some users -- but it shouldn't be forced upon them.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail