Re: Counting hits

On Thu, Jun 03, 2004 at 09:21:29AM -0400, fleet(_at_)teachout(_dot_)org wrote:

On Thu, 3 Jun 2004, Dallman Ross wrote:

On Wed, Jun 02, 2004 at 10:02:00PM -0400, fleet(_at_)teachout(_dot_)org 
wrote:

I've come up with the following scheme to count "hits." I have a
couple of questions below the example:


I am not clear on what you are counting.  What do you consider
"hits"?


"Hit" is any reaction to the condition line that would "activate" the
action line.  I'm looking for spam indicators, of course.

[snip]

I want to be able to identify the condition that got tripped, and
total the "hits" for each message.

You ought to be able to count most anything right in native procmail
language.


[snip]

Basically, I'm trying to identify a spam condition, assign a number
value to it, and "formail" the condition description into the header
of the message. (I can do the formail part - in an equally inefficient
manner - and don't have that part included in the example.  I suppose
it would replace the "LOG" function eventually.)


Here is how I do this in my own rc.  I have about 40 spamtrap recipes
(all but four of which operate only on headers).  I explained all this
in a (longish) message (to Kai) last week, actually, so I won't repeat
all that now.  But I want to explain how I have procmail identify the
recipes that hit.  The recipes below are fake, of course, but the
action lines are a couple of my actual action lines.  I assign the
running string of recipe names hit to the $RX var:

  :0 flags
  * conditions
  { RX = "${RX:+$RX, }UBE.TO.ILLEGAL" }

  :0 flags
  * conditions
  { RX = "${RX:+$RX, }UBE.ID.MYUPSTREAM" }

  :0 flags
  * conditions
  { RX = "${RX:+$RX, }UBE.ID.!RFC:1" }


and so on.

Note that I run anything that's gotten to this area of my rc through
all the header tests.  I actually don't need to -- I could stop at
three spamtrap hits and be sufficiently convinced it's spam.  But I
do all of them for statistical reasons.

As I explained last week, the body checks don't happen unless we have
zero or one header-check hits.  I save a lot of MIPs that way.  :-)
And body checks only happen on about 2-3% of my spam.

I do have a system in mind that I intend to code that dynamically
changes the order and stops when a sufficient calculus is reached.
But I've got way too many irons in the fire and don't know when I'll
get to that.

Okay, so now assume we're at the bottom of the spamtrap section, and
we want to see what happened.  Here's what's there:

 :0  # 030105 () if any VIR/UBE recipe succeeded, dispatch message
  * $  RX  ??  $TRUE
  { SWITCHRC = $RX_DELIVER }


Note that date coded in the first line of the recipe, which I use
to see when I've last changed a recipe.  I haven't messed with that
in nearly 1.5 years.  Okay, what's in the $RX_DELIVER SWITCHRC?
All right, here's part of it.  (The var called $LACUNA is what I
call some empty space I place in my log for visual aid.  The
word _lacuna_ in the English dictionary means, among other things,
"an empty space.")

-------------------------------
## Dispatch UBE mail

   # log Recipe-ID assignments
   logtext = "Recipe-ID: $RX"
   LOG = "$angle_L $logtext $angle_R $LACUNA"

   # bail now if we're in test mode
   INCLUDERC = $KILLONTEST



  :0 fhw  # 021205 () brand the spam with our snag info
   | formail -I"X-Recipe-ID: $RX"
-------------------------------

That's part of the rc.  The last recipe above is your "branding."
Then there's some more stuff, and after that, this:

-------------------------------
  :0  # 021205 () no lockfile; .myspam is a directory, not a flat file
   .myspam

  :0 e:  # 020325 () else, uh, that didn't work, so name file manually
   .myspam/msg.$$.$HOST.fallback
-------------------------------



Note that I haven't changed some of this in over two years.
It's getting long-in-tooth, though it still works fine.  But
I have newer ideas I'd like to code when I get around to it.
The very last one, the failsafe, is because in 2002 and early
2003 I was experiencing some NFS file-write problems that caused
some spam not to be archived (saved) correctly on rare occasion.
I actually haven't seen that problem recur in about a year now,
though.  I believe panix (my shell provider) changed some of the
mechanisms of its NFS in the interim.  Anyway, the recipe just sits
there and doesn't hurt anything.


Here's what the X-Recipe-ID: header shows me.  I'll show the most recent ten.
Sorry for the line-wrap:

 1:56pm [~/Mail/.myspam] 313[0]> ls -t | head | xargs grep ^X-Recipe-
msg.eiPK:X-Recipe-ID: UBE.DT.!RC.DATE_SPOTTY:2, 
UBE.DT.!FR_.DATE_SPOTTY:FUTUREHOUR, UBE.SJ.LOCALTO, UBE.RC.DODGEY
msg.AiPK:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.DT.BOGUS, UBE.ID.FAKE:1, 
UBE.VH.RETROFIT-MUA
msg.farF:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.DT.!FR_.DATE_SPOTTY:PASTDAY, 
UBE.FR+RC.DELTA-TLD, UBE.RC.BOTTOMFEEDERS
msg.earF:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.DT.BOGUS, UBE.ID.FAKE:1, 
UBE.VH.RETROFIT-MUA, UBE.RC.QUADRAPHONY
msg.darF:X-Recipe-ID: UBE.VH.!HOTHOO, UBE.FR.!(VOWEL|CONSONANT), UBE.RC.SPLIT, 
UBE.RC.QUADRAPHONY
msg.carF:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.FR+RC.DELTA-TLD, UBE.RC.QUADRAPHONY
msg.barF:X-Recipe-ID: UBE.VH.!HOTHOO, UBE.TRUST<LOWEST, 
UBE.SJ.END+(SPACEY|NUMS|NOVOWELS), UBE.VH.REPEATS, UBE.RC.DODGEY
msg.aarF:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.ID.!RFC:1, 
UBE.DT.!FR_.DATE_SPOTTY:FUTUREDAY, UBE.RC.SPLIT, UBE.XM.NONBULK+PIPELINED
msg.ZarF:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.DT.BOGUS, 
UBE.DT.!FR_.DATE_SPOTTY:PASTHOUR, UBE.FR+RC.DELTA-TLD, UBE.RC.DODGEY
msg.YarF:X-Recipe-ID: UBE.TRUST<LOWEST, UBE.ID.!RFC:1, 
UBE.DT.!FR_.DATE_SPOTTY:PASTDAY, UBE.ID.FAKE:1, UBE.RC.SPLIT, 
UBE.XM.NONBULK+PIPELINED


Note as an aside that every one of the last ten spams was caught on 
headers-only tests, and
submitted to multiple "hits," as you call them.  There's very little risk of a 
false poz
here, with such results!


Note, also, that when I see a bunch of the same recipe monikers grouped 
together in
time -- especially when the particular spamtrap of mine is not an exceedingly 
popular
one in the overall stats -- I can be pretty sure that all those messages were
generated by one particular spammer on his regular run through his spew.  :-)
The "QUADRAPHONY" one is probably that.

Just for grins, let's look:

 DING! [~/Mail/.myspam] 315[1]> ls -t | head
msg.eiPK
msg.AiPK
msg.farF
msg.earF
msg.darF
msg.carF
msg.barF
msg.aarF
msg.ZarF
msg.YarF

[217.86.12.223 -> panix5] {dman} [2.46]
 2:00pm [~/Mail/.myspam] 316[0]> frm `!!`
frm `ls -t | head`
Uvula Q. Postbox      gain 3in. to your manhood Dman or your money back.
Rod Webb              Discount Internet Pharmacy - FREE Prescriptions Written w
peggie karnes         Fwd: V _ X:A:Nax < V1(_at_)gra > Valiu+m+ S:o:ma < 
Pnter.m.in 
Dick Higgins          buy Xanax cheap - Diazepam is used to relieve anxiety, mu
Josie Moran           Online ordering is the greatest
Della May             Get Ambienn today from London's DrugOutlet
Aurelio Morse         Windows 2000 Server $60  mitten                    
georgianna Bullett    invention   Pa1n r3lief    present
nicolasNikolas Sievers  Fwwd: Earn huge moneey quickly from hhome...  (locate s
Barney Hairston       

See the fourth, fifth, and sixth Subject?  All about pharmacological
relief? :-) All without word-munging?  Probably all sent by the same
spammer in the same run.  I could cull through the headers more
carefully, but who cares?  Anyway, QUADRAPHONY (which looks for fake
repeating dotted-quads in the lowest Received) usually gets from 3-8%
of my spam, but right now got 14 of the last 100.  So it may be that this
particular spammer is on the loose this hour.  :-)

 2:04pm [~/Mail] 324[0]> distro | grep QUAD
  14 UBE.RC.QUADRAPHONY

As I posted last week (to Kai), actually about 70% of my spam can be
corralled with the top few headers-only recipes.  Here's the current distro,
top bit:

 2:06pm [~/Mail] 326[1]> distro | head
Finding distribution for "^X-Recipe-ID: " within selected file(s) (default: [*])
  55 UBE.TRUST<LOWEST
  45 UBE.XM.NONBULK+PIPELINED
  42 UBE.RC.MYUPSTREAM
  30 UBE.RC.BOTTOMFEEDERS
  28 UBE.RC.DODGEY
  26 UBE.RC.LOW_COUNT+TO.!ME+TRUST<HIGH
  26 UBE.VH.!HOTHOO
  19 UBE.FR+RC.DELTA-TLD
  17 UBE.RC.SPLIT


I'll end this with a display of a section of the log, running a recent spam
through my test harness (uses same rc, but sends log to my screen):

 2:10pm [~/Mail] 330[1]> rctest SPAMPLE

  [snip of most of log result]

  : We're exiting Section SPAMSNAG 
  : We're entering Section DELIVERY 

          > Recipe-ID: UBE.TRUST<LOWEST, UBE.RC.MYUPSTREAM, UBE.SJ.PUNKY, 
UBE.RC.LOW_COUNT+TO.!ME+TRUST<HIGH, UBE.DT.!RC.DATE_SPOTTY:0, 
UBE.DT.!FR_.DATE_SPOTTY:FUTUREDAY, UBE.RC.DODGEY, UBE.XM.NONBULK+PIPELINED < 
    
    From foo(_at_)uk(_dot_)com  Wed Jun  2 00:50:47 2004
 Subject: softwa;re for c;heap
  Folder:                                                                  1316


All right, that's it for now.

-- 
dman

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail