I have a couple of recipe ideas that I'll probably implement to deal
with spam, but I wanted to throw them out here to see if people have
done them. If so, maybe I can be saved the work.
1) match on some regexp like this
"reply (with)? (the)? (word)? ['"]?remove['"]? in the (subject|body)"
and put such messages wherever you like. this could be modified to
not match lines which had this as included text, to allow people to
send me mail that included a copy of some spam.
i wonder how well a recipe like this would do. if it were general
enough, it would certainly clobber quite a bit of the spam i get.
2) look for regexps like these:
"^[^a-z]*!!!![ ]*$"
"!!!![ ]*$"
"!!![ ]*$"
"!![ ]*$"
"![ ]*$"
"!!!!"
"!!!"
"!!"
"!"
i'd like to look for these in (something like) this order, and
weight them too. what i have above is not correct, but it should
give the idea:
penalize the hell out of lines that
ARE ALL UPPERCASE WITH MANY TRAILING !!!!!
penalize them slightly less (but still alot) for fewer trailing !!!s
penalize slightly less lines with mixed case which end with many !!!s
and then just penalize !'s in general, depending on repetition counts.
i suppose this could be done with the weighted scoring mechanism in
procmail, but it might be better to write a standalone ! detector.
an even simpler approach would be to invoke something that simply
counted the !'s in a mail and returned 0 if the density was below
some threshold.
my guess is that with this one mechanism in place, i could clobber
a LOT of spam, and pretty efficiently too (no need to fgrep on 800
domain names and add more domains as they multiply). at least in my
case, i don't have many people sending me mail which is as
liberally sprinkled with !'s as almost all the spam i get is.
this leads me to realize that what i might be looking for is a
standalone spam detector. a program which reads stdin, produces
nothing on stdout (unless you say -v which makes it give its reasons),
but which exits 0 or 1 depending on whether it guess stdin was spam.
it could also look for density of uppercase in the message body, known
spam domains, known bad email addresses etc.
this of course will blossom into a complete expert system.
although i'm not too much in favor of monolithic programs, having one
such as this would probably make all this spam processing much faster,
and would make spam detection/rejection via procmail much easier to
set up for new procmail users. just give them the source and a single
simple procmail recipe.
i'd be happy to receive comments. if no one has such a standalone
spam detector, i'll write one.
regards,
Terry Jones (terry(_at_)teclata(_dot_)es).