Two recipe suggestions



I have a couple of recipe ideas that I'll probably implement to deal
with spam, but I wanted to throw them out here to see if people have
done them. If so, maybe I can be saved the work.


1) match on some regexp like this 

   "reply (with)? (the)? (word)? ['"]?remove['"]? in the (subject|body)"

   and put such messages wherever you like. this could be modified to
   not match lines which had this as included text, to allow people to
   send me mail that included a copy of some spam.

   i wonder how well a recipe like this would do. if it were general
   enough, it would certainly clobber quite a bit of the spam i get.



2) look for regexps like these: 

     "^[^a-z]*!!!![     ]*$"

     "!!!![     ]*$"
     "!!![      ]*$"
     "!![       ]*$"
     "![        ]*$"

     "!!!!"
     "!!!"
     "!!"
     "!"

   i'd like to look for these in (something like) this order, and
   weight them too. what i have above is not correct, but it should
   give the idea:

       penalize the hell out of lines that 
              ARE ALL UPPERCASE WITH MANY TRAILING !!!!!

       penalize them slightly less (but still alot) for fewer trailing !!!s
       penalize slightly less lines with mixed case which end with many !!!s

       and then just penalize !'s in general, depending on repetition counts.

   i suppose this could be done with the weighted scoring mechanism in
   procmail, but it might be better to write a standalone ! detector.
   an even simpler approach would be to invoke something that simply
   counted the !'s in a mail and returned 0 if the density was below
   some threshold.


   my guess is that with this one mechanism in place, i could clobber
   a LOT of spam, and pretty efficiently too (no need to fgrep on 800
   domain names and add more domains as they multiply). at least in my
   case, i don't have many people sending me mail which is as
   liberally sprinkled with !'s as almost all the spam i get is.



this leads me to realize that what i might be looking for is a
standalone spam detector. a program which reads stdin, produces
nothing on stdout (unless you say -v which makes it give its reasons),
but which exits 0 or 1 depending on whether it guess stdin was spam.

it could also look for density of uppercase in the message body, known
spam domains, known bad email addresses etc.

this of course will blossom into a complete expert system.



although i'm not too much in favor of monolithic programs, having one
such as this would probably make all this spam processing much faster,
and would make spam detection/rejection via procmail much easier to
set up for new procmail users. just give them the source and a single
simple procmail recipe.


i'd be happy to receive comments. if no one has such a standalone
spam detector, i'll write one.


regards,
Terry Jones (terry(_at_)teclata(_dot_)es).