RE: Rule to filter for letter-number combinations?

Kai Schaetzl schrieb:

Dallman Ross wrote on Thu, 27 May 2004 13:54:29 +0200:

1) It doesn't work better than my procmail rules.


But maybe it would work better for Jim Witte? I don't deny that
your rules work quite well for you. They have grown over years for
the type of spam that you get. About two years ago I had about the
same FP/FN rates you mention by just rejecting all HTML messages and
using about 100 - 150 "buzz" words. And if I reactivated it and added
some regexp to it it would still work for most spam, maybe a bit
less efficient. But it doesn't work on a great scale. I'm sorry for
starting this debate, I didn't intend and am not interested in such. I
just wanted to point Jim Witte to a solution which might fit him
better in the long run.

2) I didn't re-invent the wheel.  My rules came first.


Again, I didn't talk about you. Or did I? Adding a few obfuscation
rules *is* reinventing the wheel.

3) Procmail as I have it configured is about 100 (or is it 500?)    
times lighter on the machine.  Most of my tests are


headers-only.

In that case I'd just need to change some names and get lots of spam
thru.  You cannot fight spam almost only with header rules unless you
use a lot of name-specific blocking and risk a high FP rate (which you
have).


I agree that SpamAssassin may be just what the doctor ordered for
Jim Witte, or any individual other person who finds it useful.
I have no real bone to pick with SA.  I just find that my stuff
works better, and far more efficiently, for me.

However, I do not have any recipes that look at specific words,
except one "gimme" that looks for "^AD(V)?(ert)?\>" or something
like that, which doesn't catch much but which really doesn't
ever false-poz, so it's harmless and can stay there.

I don't use any name-specific blocking.  I only have a high
FP rate because I choose not to get any spam in my main inbox.
My advertised FP rate *includes my "maybe" file* (actually
called "purgatory")!!  I have a three-pronged decision-tree:
spam, not spam, or maybe spam.  I expect not to get much in the
"maybe" pile, so when I do, I am disappointed, and I count
good mail landing there as a false-poz.  I probably am being
too hard on myself.  If I took that out, my FP rate would be
quite a bit lower.  Still, as I said in an earlier post, I
have 22 domains converging here; and besides getting lots of
spam, I get much more weirdly formatted good mail than I
think the average person does.  And I get lots of mail from
unknown persons in foreign countries, and in a couple of
languages.  So my FPs are going to be higher than they ought 
to by rights for people with more normal mail-flow.  :-)  
Finally, I could minimize FPs easily and have more iffy stuff 
land in the inbox.  But I choose not to do it that way.  So the 
decision about how to cause things to land in FP or FN piles is 
one (flexible) choice; but the total of false things in the 
aggregate is more important.

Oh, an afterthought on FPs: you should understand that I
do allow Bcc's in general.  I'm really not making it easy
on myself, in other words.  If your FP stats don't include
Bcc'd mail (e.g., because you're blocking it), then you
can see that my stats are conservatively stated.

SA on my good mail gets me about 5% FPs.  I don't have any
customizations in place for it.

   They are effective as follows (stats over more than a year):

   False-positive: all dates: 1.  False-negative: all dates: 0.  1%


I consider this a very bad FP rate. It would not be acceptable for our
clients. If it would be the other way around this would be acceptable.


If I were running this for clients, I would turn the decision-
tree on its head and be much more conservative about how I
characterize the mail.  That's not really an issue, except that
my stats are broken down that way because I chose to be aggressive
for my own mail.  Again, SA does much worse for me -- about
five times as badly.

Another thing, and this is important to me:  I do not want my
mail service provider (e.g., ISP) filtering my mail!!  Not any
of it!!  I trust my rules much more than I trust any they would
concoct.  I like my shell service very much indeed, and one
reason is that they will not filter my mail (unless I want
them to turn some SMTP-transaction stuff on, but then I wouldn't
see the rejects).  I won't have an account where my main mail
goes, where the provider does filtering for me!  (The only 
exception is that panix does a bit of virus rejection at the 
SMTP connection level.)

Oh, and all the users running spamc on our system causes


the mail server

to overload with regularity.


Of course. Running spamc from procmail is inefficient in particular
and letting each user combat spam by himself is inefficient in
general.


So you're saying you want to do systemwide filtering, then.
Not on my mail, you're not!  :)

I will give you an example of a recipe of mine that catches
mondo spam.  This is an English description of what this
one example one does:

It simply looks at how many Received headers there are and what the
X-Mailer line says, if there is one; and deduces, e.g., that Outlook
Express (what the spammer claimed was Outlook Express) isn't going
to be hopping to my upstream provider with no ISP's Received header 
in between.  What's an Outlook Express or Eudora or whatever 
(home-PC-type) mail user doing mainlining his mail to my SMTP server?  
It's bogus, and obviously so.[1]

I have under forty recipes in my spam-fighting arsenal!  All
but four are headers-only.  The four body-checks only run if the
other 36 didn't catch anything (and, of course, we passed
my greenlists and mailing lists).  "Didn't catch anything" for
me means more than one headers-only recipe has to hit, or
we go for confirmation in the body.  Those four body checks
really don't run very often.  Here's a distribution I run when 
I feel like it (shell script).  It shows the names I gave the 
recipes, and how they hit against the last-100 spam messages -- 
which is generally all I keep around:

 12:51am [~/Mail] 463[1]> distro
Finding distribution for "^X-Recipe-ID: " within selected file(s)
(default: [*])
  50 UBE.RC.MYUPSTREAM
  44 UBE.RC.LOW_COUNT+TO.!ME+TRUST<HIGH
  39 UBE.TRUST<LOWEST
  36 UBE.DT.!FR_.DATE_SPOTTY:FUTUREDAY
  28 UBE.RC.BOTTOMFEEDERS
  27 UBE.XM.NONBULK+PIPELINED
  22 UBE.RC.DODGEY
  19 UBE.FR+RC.DELTA-TLD
  19 UBE.RC.SPLIT
  16 UBE.VH.!HOTHOO
  16 UBE.VH.TOO_SHORT|LONG
  11 UBE.ID.!RFC:1
   9 UBE.DT.!RC.DATE_SPOTTY:0
   7 UBE.CT.HTML+BASE64
   7 UBE.DT.!FR_.DATE_SPOTTY:PASTDAY
   7 UBE.SJ.END+(SPACEY|NUMS|NOVOWELS)
   7 UBE.VH.REPEATS
   7 UBE.XM.FELONS
   6 UBE.DT.!FR_.DATE_SPOTTY:PASTHOUR
   6 UBE.ID.FAKE:1
   6 UBE.VH.BOGEY
   5 UBE.RC.QUADRAPHONY
   5 UBE.SJ.LOCALTO
   3 UBE.DT.!RC.DATE_SPOTTY:2
   3 UBE.SJ.CAPPY
   3 UBE.SJ.PUNKY
   3 UBE.VH.RETROFIT-MUA
   2 UBE.DT.!FR_.DATE_SPOTTY:FUTUREHOUR
   2 UBE.FR.!(VOWEL|CONSONANT)
   2 UBE.ID.FAKE:4
   2 UBE.ID.MYUPSTREAM
   1 UBE.B.SPAMISH:A=14
   1 UBE.B.SPAMISH:A=8
   1 UBE.B.SPAMISH:A=9
   1 UBE.DT.BOGUS
   1 UBE.RP.3+NUMS+TO.!ME
   1 UBE.SJ|FR.HI-BIT

You can't tell all that much from this, except you can infer
a few things by the names.  But note that 50 of my last 100
spam messages were tagged by one single recipe!  And it's
a headers-only one, of course.  

The second one, tagging 44 of 100, found a low Received
count combined with the message's not being addressed
to me (and I do let *(_at_)one-of-my-domains(_dot_)dom mail into
my servers), and also one or more of various other
shady things that I have found to go often with the
above.

Okay, explaining the distro a bit further, the first part of 
the name  (after UBE) between dots describes the type of 
checking the recipe does.  E.g., "CT" is Content-Type; "DT" 
is Date; "XM" is X-Mailer; "VH" is "various headers"; "ID" 
is Message-ID; and so on.  "B" is body."  Three of my last-
100 spam messages were ID'd under body checks.  That's pretty
high, actually.  Let's look at those more specifically:

 12:57am [~/Mail/.myspam] 465[0]> grep ^X-Recipe- * | grep -w B
msg.Ey0X:X-Recipe-ID: UBE.XM.NONBULK+PIPELINED, UBE.B.SPAMISH:A=8
msg.Hy0X:X-Recipe-ID: UBE.RC.MYUPSTREAM, UBE.XM.FELONS, UBE.B.SPAMISH:A=9
msg.tooH:X-Recipe-ID: UBE.XM.NONBULK+PIPELINED, UBE.B.SPAMISH:A=14

Okay, they were all also caught by header-check recipes first,
but with only one hit -- which caused the run through the
body sniffers for confirmation of spamminess.  On some
of my header checks, only one hit on a recipe isn't enough to
land the message in the spam pile.  It would land in the
"maybe" pile instead, but for the confirming body test on those
emails.

Most of my recipes could be adapted by others.  In fact,
one fellow panix user does incorporate my rc, though he
has to wean out some personal stuff of mine in order to do
so.  I have planned for a while to make some of it modular
in the way that my Virus Snaggers is, but I have too many
other projects.

Last thing: if the recipes don't hit anything in a month,
I delete them from my rc.  (I have a file hash based on the
recipe names, shown above, that gets touched when I run
my "distro" script.)

Okay, hope that was useful.


[1] Waiting for Sean to remind me that he uses Eudora. :-)  But
I have a couple of exceptions built into the recipe; and I also
have greenlists.

Dallman


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail