Re: matching words that are laced with html

At 12:21 2003-10-30 +0100, Dallman Ross wrote:
[snip]

ALthough I didn't spot it anywhere, I believe what Dallman is saying isthat rather than expecting to match on the drug keyword, the fact that youmatch a lot of HTML COMMENTS in an EMAIL should be sufficient to tag it asspam. Go ahead and look for your drug keywords - but separatley, check for"abundance of comments" (I've seen a few HTML mails that still had a fewlegit comments in them) - then it doesn't matter that you didn't match onthe drug keywords.

The recipe I use for this (note I use it as a heavily-weighted spammishnessfactor, but there's ALWAYS a few other things wrong with these messages):



# Hokey HTML commenting
# We threshold at 10 comments, and we avoid running this on mammoth messages
# NOTE: if you're a webdev and someone is sending you an email with a new
# page layout, this could be a problem.  Of course, it's a good idea to
# simply greenlist your dev team...
:0
* < 25000
* -10^0
* 1^1 B ?? (<!)
{
        SPAMVAL="+175"
        SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"

SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} Advisory - abundance of HTMLcomment constructs${NL}"

This however DOES NOT identify bogus HTML tags - that is, word interspacingbreaks using something other than a valid HTML tag (or for that matter, aVALID one), such as:


<randomsequence>ban<differentrandomsequence>k mort<anotherrandomsequence>gages

If it's a need, then it's worth noting that piping the message through LynxDOES eliminate these tags.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail