procmail
[Top] [All Lists]

RE: rule to catch a certain number of characters

2007-05-26 13:01:34
wolfgang schrieb on Saturday, May 26, 2007 8:58 PM:

They are spam mails, Content-Type: text/html;

They start with:
<title>
lineofmorethanonehundredthousandcharacters
</title>
and then the HTML spam message starts, containing links to URLs and 
images on the web plus DIV and FONT tags.

I think they are designed to provoke spamassassin/spamc
timeouts which works on elder machines with weak processors in
my case.

I want to match that one overly long unbroken line without
spaces. MIME or uuencode would contain linebreaks wouldn't they
- so I assumed my "egrep style" algorithm wouldn't catch those.

So, how would I - not familiar with scoring so far - match that long 
line?

The starting place would be "man procmailsc", but this is
an interesting challenge, and I will try to help more specifically
now.

First of all, we could look for "<title>" and capture from there
including newlines up to "</title>" using the match token.
But 100,000 chars is a lot to capture, and implies a big
LINEBUF and thereby a big memory footprint.  (Same reason
the spmmer is using the technique against SpamAssassin, I presume.)

Or we could just use scoring of non-spaces in lines.  But then
we'd have to cycle through each line looking for our max with
a recursive INCLUDERC.  Messy.  And, again, we'd be using 
MATCH and would need a big LINEBUF.

I have an idea.  Can you imagine a legitimate message
with even a 1,000-char title string without whitespace?  I can't.
So why not trash at that level instead of looking for 100,000?

First of all, let's not bother unless it's that Content-Type.
Then, let's not bother unless it's bigger than 100K.

  SPACE = ' '
  TAB   = '     '
  WS    = $SPACE$TAB

  xWS8    = [^$WS][^$WS][^$WS][^$WS][^$WS][^$WS][^$WS][^$WS]
  xWS64   = $xWS8$xWS8$xWS8$xWS8$xWS8$xWS8$xWS8$xWS8 xWS8 #unset   8
  xWS384  = $xWS64$xWS64$xWS64$xWS64$xWS64$xWS64    xWS64 #unset  64
  xWS1152 = $WS384$WS384$WS384                     xWS384 #unset 384

  :0:
  * ^Content-Type:.*/html
  *   B ?? > 100000
  * $ B ?? $xWS1152.*$*.*<\title>
  spampile

Actually, we could look for the full 100K if we wanted to now
without needing more LINEBUF.  Last condition above would simply
be:

  * $ B ?? $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152

But this seems unnecessary to me.  (The count there
is 100,224, for what it's worth.)

Dallman


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail