procmail
[Top] [All Lists]

Re: matching words that are laced with html

2003-10-30 09:44:35
At 12:21 2003-10-30 +0100, Dallman Ross wrote:
[snip]

ALthough I didn't spot it anywhere, I believe what Dallman is saying is that rather than expecting to match on the drug keyword, the fact that you match a lot of HTML COMMENTS in an EMAIL should be sufficient to tag it as spam. Go ahead and look for your drug keywords - but separatley, check for "abundance of comments" (I've seen a few HTML mails that still had a few legit comments in them) - then it doesn't matter that you didn't match on the drug keywords.

The recipe I use for this (note I use it as a heavily-weighted spammishness factor, but there's ALWAYS a few other things wrong with these messages):


# Hokey HTML commenting
# We threshold at 10 comments, and we avoid running this on mammoth messages
# NOTE: if you're a webdev and someone is sending you an email with a new
# page layout, this could be a problem.  Of course, it's a good idea to
# simply greenlist your dev team...
:0
* < 25000
* -10^0
* 1^1 B ?? (<!)
{
        SPAMVAL="+175"
        SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"
SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} Advisory - abundance of HTML comment constructs${NL}"
}


This however DOES NOT identify bogus HTML tags - that is, word interspacing breaks using something other than a valid HTML tag (or for that matter, a VALID one), such as:

<randomsequence>ban<differentrandomsequence>k mort<anotherrandomsequence>gages

If it's a need, then it's worth noting that piping the message through Lynx DOES eliminate these tags.
---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>