wolfgang schrieb on Saturday, May 26, 2007 8:58 PM:
They are spam mails, Content-Type: text/html;
They start with:
<title>
lineofmorethanonehundredthousandcharacters
</title>
and then the HTML spam message starts, containing links to URLs and 
images on the web plus DIV and FONT tags.
I think they are designed to provoke spamassassin/spamc
timeouts which works on elder machines with weak processors in
my case.
I want to match that one overly long unbroken line without
spaces. MIME or uuencode would contain linebreaks wouldn't they
- so I assumed my "egrep style" algorithm wouldn't catch those.
So, how would I - not familiar with scoring so far - match that long 
line?
The starting place would be "man procmailsc", but this is
an interesting challenge, and I will try to help more specifically
now.
First of all, we could look for "<title>" and capture from there
including newlines up to "</title>" using the match token.
But 100,000 chars is a lot to capture, and implies a big
LINEBUF and thereby a big memory footprint.  (Same reason
the spmmer is using the technique against SpamAssassin, I presume.)
Or we could just use scoring of non-spaces in lines.  But then
we'd have to cycle through each line looking for our max with
a recursive INCLUDERC.  Messy.  And, again, we'd be using 
MATCH and would need a big LINEBUF.
I have an idea.  Can you imagine a legitimate message
with even a 1,000-char title string without whitespace?  I can't.
So why not trash at that level instead of looking for 100,000?
First of all, let's not bother unless it's that Content-Type.
Then, let's not bother unless it's bigger than 100K.
  SPACE = ' '
  TAB   = '     '
  WS    = $SPACE$TAB
  xWS8    = [^$WS][^$WS][^$WS][^$WS][^$WS][^$WS][^$WS][^$WS]
  xWS64   = $xWS8$xWS8$xWS8$xWS8$xWS8$xWS8$xWS8$xWS8 xWS8 #unset   8
  xWS384  = $xWS64$xWS64$xWS64$xWS64$xWS64$xWS64    xWS64 #unset  64
  xWS1152 = $WS384$WS384$WS384                     xWS384 #unset 384
  :0:
  * ^Content-Type:.*/html
  *   B ?? > 100000
  * $ B ?? $xWS1152.*$*.*<\title>
  spampile
Actually, we could look for the full 100K if we wanted to now
without needing more LINEBUF.  Last condition above would simply
be:
  * $ B ?? $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
           $xWS1152$xWS1152
But this seems unnecessary to me.  (The count there
is 100,224, for what it's worth.)
Dallman
____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail