procmail
[Top] [All Lists]

Re: HTML float style

2009-12-01 23:53:43
At 21:01 2009-12-01 -0500, Eric Wood wrote:
I've been running a similar rule for years. Sometimes a MS Office document would get caught but that's rare for me.
## Check for emails which have many float: and single char divs
:0
* -11^0
* 1^1 ()(float |div> . <)

I'm fairly certain you missed a B flag or a B ?? in the condition.

I would think you would want to ensure the float was part of a style (sort of like how I'm doing it), and the div should be inclusive of the open tag. Your spaces, BTW will be interpreted as mandatory spaces in there.

* 1^1 B ?? ()<div[^>]*>[        ]?.[    ]?<

That excluded character class allows the div to contain optional identifiers, such as a style, class, or id. Eliminate the bracketed space+tab if they're actually not wanted.

I'll set that up in an analysis filter to watch for how often it matches new mail, but I threw a sizeable spam corpus at the above div recipe (sans float), and I only got single event matches on a few messages (i.e. not enough to overcome the negative prep). I'm suspecting the single character div doesn't occur enough to got *ZERO* hits on it. Oh, and I ran it to count hits on your recipe (sans float) and my revision, and yours doesn't match anything -- because the divs that were matching actually had only a single character between them and the next tag (which wasn't a div closure, BTW):

        <div align=center> <a href="

Also worth noting, but would be virtually impossible to check for reliably in a procmail recipe, is that the divs on those messages were not balanced - there were more opening divs than closing ones. Spammers can't even craft HTML with proper syntax. What ever has become of the work ethic?

A side observation is that with only a couple of exceptions all of the messages that did have any events had the same basic subject line involving pharma and a varying percentage off. This wasn't merely a scatter of hits for one or two days either.

I was recently experimenting with something to try to weigh how many short words (really, letter jumbles) there were in a message as compared to longer ones. I've seen a certain amount of spew which has in the text portion a lot of 2-4 character jumbles in a paragraph, with very few longer jumbles. However, they tended to be the text portion of a multipart which included an HTML portion which, surprise, used float...

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>