Re: Rule to filter for letter-number combinations?

At 17:31 2004-05-28 +0200, Kai Schaetzl wrote:

mention by just rejecting all HTML messages


IMO, that was pretty dumb.

I admit that I classify HTML-*ONLY* with a spammishness score, but itdoesn't make it inherently spam, though it's proven to be a very solidindicator.

am not interested in such. I just wanted to point Jim Witte to a solution
which might fit him better in the long run.

In the process however, you made the statement that nobody could do as gooda job with procmail as SA can. Not that SA might involve less personalinvestment of time into managing the process, but that outright, SA wasbetter than anything anyone could implement in procmail.

I've personally witnessed both sides. My personal experience says thatprocmail wins out.

> 2) I didn't re-invent the wheel.  My rules came first.

Again, I didn't talk about you. Or did I? Adding a few obfuscation rules
*is* reinventing the wheel.

The wheels on my cars differ from the wheels on your car, but then severalof my cars can do 160MPH in stock fittment, so perhaps there's a reasonthey didn't just slap some 13" tin wheel from a Hyundai onthem. Materials, appearance, features, etc. all vary. A different DESIGNdoesn't imply reinvention of the basic concept.

Howver, on the point of the basic concept -- a *LOT* of spam filteringtechniques were around before SA. Procmail certainly chief among the toolsproviding access to spam filtering. I think it's safe to say that SA did alot of reinvention itself. If reinventing the wheel is such a stupid idea,what on earth were the people involved in SA thinking?

People here CHOOSE to use Procmail, and while you're certainly welcome tothe opinion that SA (resource-hungry as it is) is more capable thanprocmail, don't expect that people who've been using procmail nearly sinceit's inception are really going to expect that they can't produce a moreeffective and tuneable solution than something directed at the masses.

Even without spam, I'd be using procmail, and not just for mere filing intofolders either. If I've got it already, why add ANOTHER tool to the mixwhen procmail can handle it quite readily?

While I can send any of my cars to a garage and have them do work on them,I choose to do most everything myself (except wheels, and machiningobviously goes out to a machine shop) - *I* retain far more control thatway. Since I understand how they work and have the necessary tools andexperience, this is a feasable option for me. It's not the same forsomebody picking up a wrench for the first time, without complete servicedocumentation thinking they can dive in and do a better job at an engineoverhaul than a well-equipped professional mechanic.

Somebody new to mail filtering is certainly likely to find taking theirmail to a canned solution is going to work better for them. That DOES NOTmean that the canned solution is in fact better than what someone with aclue can manage themselves (or via a forum such as this).

> 3) Procmail as I have it configured is about 100 (or is it 500?)
>    times lighter on the machine.  Most of my tests are headers-only.

In that case I'd just need to change some names and get lots of spam thru.
You cannot fight spam almost only with header rules unless you use a lot of
name-specific blocking and risk a high FP rate (which you have).

That statement says you simply do not comprehend the sheer number of spammyfactors which are found in the headers. Bogus/stale datestamps,message-ids, forged sender hosts, hostnames with consumeresque naming,from=to, plurality of similar-named recipients, no from, domain in from notresolveable, etc. -- none of those factors even involving lists of thingssuch as keywords or domains which remain useful header-based methods as well.

I've got dozens of header-only checks in my arsenal, and the only thingsthat check the body are:


        * Body starts with HTML tags, but was identified as text/plain
                (in the headers)

        * Message claims (in the headers) to be multipart/alternative, but
                lacks a text/plain section.

        * abundance of HTML comments

        * opt-out references

        * non-digest message with an abundance of exclamations (I call it
                "enthusiastic twit")

        * Nigerian scams

        * "this isn't spam" disclaimers

That's it. While these are useful tests, more often than not, they're notnecessary for identifying spam - the messages will have crossed thespammishness threshold based upon the header evaluations alone.

Here's a few examples of messages flagged (in this morning's statusmessage) - and none of the characteristics were body related:


SPAM: +30 Advisory - may be forged warning
SPAM: +75 received without messageid, injected by local mailserver
SPAM: +75 Received headers include suspicious reference
SPAM: +125 relay hostname appears to be consumer dialup/broadband
SPAM: +45 MIME - multipart/related
SPAM: Advisory - spammishness is 350
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From ali(_at_)minuevaweb(_dot_)com  Thu May 27 02:55:39 2004
 Subject: Venta de PC's
  Folder:  gzip -9fc >> spam.gz


one message that matched *12* different header issues:

SPAM: +75 received without messageid, injected by local mailserver
SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +50 allcaps subject
SPAM: +300 Foreign character set encoding (gb2312) in body.
SPAM: +(249*2) raw 8-bit characters in the Subject/From/To
SPAM: +(249*2) No To/cc/bcc
SPAM: +45 no X-Envelope-To
SPAM: +75 no non-list cleartext recipient matching X-Envelope-To
SPAM: +50 embedded space on subject
SPAM: +249+262001 Subject Scoring match 262001
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 264400
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From asdf2(_at_)tom(_dot_)net  Thu May 27 16:55:01 2004

Subject:¿ìËÙÓÐÐ§µÍ³É±¾µÄ¹ÜÀí±ä¸ï

  Folder:  gzip -9fc >> spam.gz

(that allcaps subject is a bit misleading here - I don't check specificallyfor caps, but rather the absence of lowercase when there are at least somenumber of characters in the subject)



SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +100 Date is suspicious at 42841 seconds {365 11:54:01} AFTER reception
SPAM: +80 bogus reply context headers
SPAM: +80 spam type statements (80)
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 819
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From Zuiderveld(_at_)webmages(_dot_)com  Thu May 27 18:55:14 2004
 Subject: qeg software
  Folder:  gzip -9fc >> spam.gz

yea, spam type statements is a body check, but you'll see the score wasalready well and above what it needed to be to flag it.


SPAM: +125 Single received header for foreign sender
SPAM: +35 from_domain not found in received chain
SPAM: +150 No rDNS for host passing message to our MX
SPAM: +100 Date is suspicious at 42841 seconds {365 11:54:01} AFTER reception
SPAM: +175 IP 219.249.141.190 listed in dialup DNSBL
SPAM: +80 bogus reply context headers
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 914
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.07.00  SBS  20040512/1941
>From layers(_at_)cci(_dot_)dk  Thu May 27 19:52:54 2004
 Subject: bda software
  Folder:  gzip -9fc >> spam.gz

>    False-positive: all dates: 1.1%
>    False-negative: all dates: 0.1%

FTR, given the load of my email, I have perhaps a false-positive onindividual _characteristics_ of circa 2% -- but that is of merelyidentifying a characteristic about a message as being spammy (say, a forgedfreemail address, but alone isn't enough to say it's spam), not flaggingthe message as spam. When taken as multiple characteristics, I have afalse positive rate of 0.25 - 0.5 %, and that is almost purely restrictedto discussion lists, where it so happens that the the two heavily-weightedcharacteristics which trigger the most falses for me are:


        * furrin
        * excessive punctuation in subject

I've considered setting these two rules up to NOT be invoked for discussionlists - if I did that, my false positive rate would drop to virtually nil.

FTR, after implementing rules to check for consumer hostnames (broadbandand dialup type patterns in the host relaying to me) as well as a broadbandDNSBL (but at procmail, not the MTA), spam pretty much dropped off my radar- I had a series of messages earlier this month which prompted me toimplement a rule to deal with broadband type senders, and around the 14th,I implemented the change. I've had ZILCH for nearly two weeks, and thenjust one spam through yesterday morning - and that message was to a contactaccount which for various reasons needs to not be subjected to riskinglosing ANYTHING to filtering.

> Oh, and all the users running spamc on our system causes the mail server
> to overload with regularity.
>

Of course. Running spamc from procmail is inefficient in particular and
letting each user combat spam by himself is inefficient in general.

Letting each user determine what works for them however is what keepsclients happy. In a medium to large hosting environment, it makes sense toset things up so that users can opt in to a central spam filteringsolution, and it's certainly a sell-point for some users -- but itshouldn't be forced upon them.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail