wolfgang schrieb on Saturday, May 26, 2007 8:58 PM:
They are spam mails, Content-Type: text/html;
They start with:
<title>
lineofmorethanonehundredthousandcharacters
</title>
and then the HTML spam message starts, containing links to URLs and
images on the web plus DIV and FONT tags.
I think they are designed to provoke spamassassin/spamc
timeouts which works on elder machines with weak processors in
my case.
I want to match that one overly long unbroken line without
spaces. MIME or uuencode would contain linebreaks wouldn't they
- so I assumed my "egrep style" algorithm wouldn't catch those.
So, how would I - not familiar with scoring so far - match that long
line?
The starting place would be "man procmailsc", but this is
an interesting challenge, and I will try to help more specifically
now.
First of all, we could look for "<title>" and capture from there
including newlines up to "</title>" using the match token.
But 100,000 chars is a lot to capture, and implies a big
LINEBUF and thereby a big memory footprint. (Same reason
the spmmer is using the technique against SpamAssassin, I presume.)
Or we could just use scoring of non-spaces in lines. But then
we'd have to cycle through each line looking for our max with
a recursive INCLUDERC. Messy. And, again, we'd be using
MATCH and would need a big LINEBUF.
I have an idea. Can you imagine a legitimate message
with even a 1,000-char title string without whitespace? I can't.
So why not trash at that level instead of looking for 100,000?
First of all, let's not bother unless it's that Content-Type.
Then, let's not bother unless it's bigger than 100K.
SPACE = ' '
TAB = ' '
WS = $SPACE$TAB
xWS8 = [^$WS][^$WS][^$WS][^$WS][^$WS][^$WS][^$WS][^$WS]
xWS64 = $xWS8$xWS8$xWS8$xWS8$xWS8$xWS8$xWS8$xWS8 xWS8 #unset 8
xWS384 = $xWS64$xWS64$xWS64$xWS64$xWS64$xWS64 xWS64 #unset 64
xWS1152 = $WS384$WS384$WS384 xWS384 #unset 384
:0:
* ^Content-Type:.*/html
* B ?? > 100000
* $ B ?? $xWS1152.*$*.*<\title>
spampile
Actually, we could look for the full 100K if we wanted to now
without needing more LINEBUF. Last condition above would simply
be:
* $ B ?? $xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152$xWS1152$xWS1152$xWS1152\
$xWS1152$xWS1152
But this seems unnecessary to me. (The count there
is 100,224, for what it's worth.)
Dallman
____________________________________________________________
procmail mailing list Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail