procmail
[Top] [All Lists]

Re: HTML float style

2009-12-03 04:22:33


Professional Software Engineering wrote:

I had some aliases which had been forwarding mail to other users of a
domain hosted on one of my servers, and one of them recently
complained about how much spew was being forwarded (despite a few
DNSBLs).  Since they were forwarded directly from the MTA, and not
locally delivered, they weren't subject to filtering, and my extreme
procmail filtering is on my own host, not on the one I run friends
mail through.

Anyway, changed the aliases to pipe through procmail before forwarding
(with an appropriate envelope change) and in the process set things up
so I could examine some of the spew before adding a few choice filters.

I noted a fair number of the HTML based spams are using span tags
along with a float style - intended to split commonly filtered pharma
words so that you can't match them easily.  However, I don't see this
technique applied to legitimate messages.  To protect against
accidentally flagging a legit message that might happen to use FLOAT
for its intended purpose, I give the recipe an initial negative score
-- all the spams have a *LOT* of these floats, while a legitimate
message that happens to use float in a span probaby won't use it a lot.

    The number of null-meaning HTML combinations can be unlimited.  I
hear some
    programs to tight/clean HTML files.   Maybe you can pipe the body to
program
    like this and check the success?     Say,  if the tighted document
was shrinked
    by  80%   you can look it as a SPAM.

Bye,
  Udi


Within my own corpus, the spams tend to have enough other indicators
that they've been classified as spam without this, but this is a
pretty good test by itself - enough so that this and a couple of other
tests tacked into the procmailrc for the forwarded messages seems to
be catching most of the stuff that gets through the DNSBLs (though one
of the tests is a "does the relay appear on more than x of these
secondary DNSBLs?" <g>).  It's a bit heavy because the body scan, but
that could be mitigated by employing a message size condition.

Generally, I don't like to have to dip into the message body, but
increasingly that's necessary to get all of the goop out.

:0
* -10^0
* 1^1 B ?? ()<span[^>]*style="[^"]*FLOAT:
{
    # HTML spam breaking filtered words up using float to move text
around
}
---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer:
<http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the
list.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail
____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>