At 12:21 2003-10-30 +0100, Dallman Ross wrote:
[snip]
ALthough I didn't spot it anywhere, I believe what Dallman is saying is
that rather than expecting to match on the drug keyword, the fact that you
match a lot of HTML COMMENTS in an EMAIL should be sufficient to tag it as
spam. Go ahead and look for your drug keywords - but separatley, check for
"abundance of comments" (I've seen a few HTML mails that still had a few
legit comments in them) - then it doesn't matter that you didn't match on
the drug keywords.
The recipe I use for this (note I use it as a heavily-weighted spammishness
factor, but there's ALWAYS a few other things wrong with these messages):
# Hokey HTML commenting
# We threshold at 10 comments, and we avoid running this on mammoth messages
# NOTE: if you're a webdev and someone is sending you an email with a new
# page layout, this could be a problem. Of course, it's a good idea to
# simply greenlist your dev team...
:0
* < 25000
* -10^0
* 1^1 B ?? (<!)
{
SPAMVAL="+175"
SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"
SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} Advisory - abundance of HTML
comment constructs${NL}"
}
This however DOES NOT identify bogus HTML tags - that is, word interspacing
breaks using something other than a valid HTML tag (or for that matter, a
VALID one), such as:
<randomsequence>ban<differentrandomsequence>k mort<anotherrandomsequence>gages
If it's a need, then it's worth noting that piping the message through Lynx
DOES eliminate these tags.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail