procmail
[Top] [All Lists]

Re: No good spamming bastards are using new tricks to get by the filters

2003-01-20 03:16:40
Louis LeBlanc <leblanc+procmail(_at_)keyslapper(_dot_)org> wrote:

Is there a relatively easy way to filter every HTML message thru a dump
to eliminate any HTML tags?

Yes:

  :0:
  * ^Content-Type:.*htm
  $TRASH

works pretty well, without even searching the body.  :-)

There's a large false-negative rate, which I'm more worried about than
the false-positives.  $TRASH is not /dev/null.  I try to configure any
sources I have a modicum of control over, e.g., Yahoo groups, at least
not to convert messages to html.  Most of the html-containing messages
are of type multipart/alternative or mixed, but those I've gotten that
try to defeat content filters with html comments, were pure text/html.

I see no reason to tie my own hand behind my back by not searching, at
least the headers, for telltale verbiage.  I rely on whitelists, after
virus and dupe checks, as my defense against false-positives.  Then, I
don't care if I'm a little brutal detecting the likely spam.  It seems
to me that advertisers want to get past the filters, but they want the
recipient to know what they're offering.  In most cases, their product
is mentioned or alluded to directly or nearly-so right in the subject.
When the subject is innocuous, the sender's address or name hints, not
too subtly, at their intent.  I look for these pointed cues like this:

  :0:
  * $ 1^0 ^From |^(From|Sender|Reply-To|Return-Path):.*($ADNAMES)
  * $ 1^0 ^(Subject|X-[]A-Za-z_0-9.*=#|$?!/&~[^+-]+):.*($ADWORDS)
  $TRASH  

By that time, I've already snagged on other tripwires, like a bad year
(not 2003), no subject, and my favorite (not my invention), <20% lower
case in the subject.  This not only catches shouting, but non-letters,
not the least of which would be the non-English quoted-printable crap.

  :0 D: # SJ.CAPS subject 80% non-LC (snags quoted hibit stuff too)
  * $ $GO^0 ! ^Subject:[$WS]*\<\/.*
  *     1^1    MATCH ?? [^a-z]
  *    -4^1    MATCH ?? [a-z]
  $TRASH

Those familiar with Dallman's style will easily recognize $GO and $WS.
I believe someone else originally gave Dallman this idea.  It gives me
this satisfaction not so much because it's effective, which it is, but
because the messages it snags are the most annoying.  :-)

Mike

-- 
--       Con    In                                       Hanc marginis
    --   Tact   Formation                                    exiguitas
    --                                                     non caperet.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail