At 21:01 2009-12-01 -0500, Eric Wood wrote:
I've been running a similar rule for years. Sometimes a MS Office document
would get caught but that's rare for me.
## Check for emails which have many float: and single char divs
:0
* -11^0
* 1^1 ()(float |div> . <)
I'm fairly certain you missed a B flag or a B ?? in the condition.
I would think you would want to ensure the float was part of a style (sort
of like how I'm doing it), and the div should be inclusive of the open
tag. Your spaces, BTW will be interpreted as mandatory spaces in there.
* 1^1 B ?? ()<div[^>]*>[ ]?.[ ]?<
That excluded character class allows the div to contain optional
identifiers, such as a style, class, or id. Eliminate the bracketed
space+tab if they're actually not wanted.
I'll set that up in an analysis filter to watch for how often it matches
new mail, but I threw a sizeable spam corpus at the above div recipe (sans
float), and I only got single event matches on a few messages (i.e. not
enough to overcome the negative prep). I'm suspecting the single character
div doesn't occur enough to got *ZERO* hits on it. Oh, and I ran it to
count hits on your recipe (sans float) and my revision, and yours doesn't
match anything -- because the divs that were matching actually had only a
single character between them and the next tag (which wasn't a div closure,
BTW):
<div align=center> <a href="
Also worth noting, but would be virtually impossible to check for reliably
in a procmail recipe, is that the divs on those messages were not balanced
- there were more opening divs than closing ones. Spammers can't even
craft HTML with proper syntax. What ever has become of the work ethic?
A side observation is that with only a couple of exceptions all of the
messages that did have any events had the same basic subject line involving
pharma and a varying percentage off. This wasn't merely a scatter of hits
for one or two days either.
I was recently experimenting with something to try to weigh how many short
words (really, letter jumbles) there were in a message as compared to
longer ones. I've seen a certain amount of spew which has in the text
portion a lot of 2-4 character jumbles in a paragraph, with very few longer
jumbles. However, they tended to be the text portion of a multipart which
included an HTML portion which, surprise, used float...
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
____________________________________________________________
procmail mailing list Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail