Re: A few rule questions
At 15:57 2003-12-14 -0500, JoeHill wrote:
Note that as I explain on my website, I use what's called "SPAMMISHNESS" -
a score threshold I define that says a particular message is spam. Then,
different rules which are intended to identify spammy recipes can add a
number to the running total, allowing me to use less definate things as
indicators - I wouldn't heave a piece of mail just because the date stamp
shows it a day older than when I received it, but it's certainly good for
adding a few points towards the message being identified as spam. Just as
freemail messages which don't appear to actually come through the
identified freemail service.
Through the use of spammishness, you can make use of criteria which isn't
always an indicator of spam - perhaps you're mother is one of those
idiots^H^H^H^H^Hpeople who send HTML-only email (instead of plain text with
an alternate HTML copy attached, which is how a legit mailer would do
it). You can still use that as a spammish indicator (even without having
to whitelist your mother), so long as the score you assign to it doesn't
exceed the threshold. I have a threshold of like 250, and quite a few
recipes that add only 25 to 40 points to the spam score - these are there
to push the bigger offenders over the limit on some other criteria, or if
there's a LOT of such little things wrong with a message, there is another
rule that'll add more points just because of the number of problems.
But I *could* make them into seperate recipes, one for To, one for Cc, though
this would not be as elegant as below, of course, no?
It wouldn't allow you to acurratley _count_ the matches, or allow for
variable number of contributors between the two headers. Yes, if you
wanted to check for three in either header INDIVIDUALLY, you could do that,
but that's not the same as a _total_ of three or more.
That's a beaut, thanks. I *never* get legit mail which is addressed/cc'd
to more than 2 people in the Sympatico domain. In fact, I can't remember
the last time I got a legit mail which was addressed/cc'd to *only* 2
people in the Sympatico domain. 99% of my mail is from lists or people who
run their own mailservers (ie. not newbs like me).
I have a recipe sequence that identifies the recpient name in an email and
checks for duplication of that username in other recipents - some spammers
send messages to a series of "joe(_at_)domain, joe(_at_)domain2, joe(_at_)domain3" addresses.
Bogus dates! Brilliant! If a piece of mail takes 3 days to get to me, it
probably ain't worth reading anyway, right? Love it.
Well, there's old or advance-dated mail, then there's mail where the Date:
field can't be parsed by the unix 'date' program as valid. Both are useful
indicators. I even score higher for WAY out of range dates, and have an
allowance for list-delivered email to lag more. >18 hours is iffy, worth
100 points, and >72 is 100 more points. If more than 2 hours ADVANCED
clock (and time zones are already factored), there's 100 points. Invalid
date header is worth 175.
So, with the setting I'm using, a bogus date header all by itself isn't
enough to trash a message as spam - but there's almost _ALWAYS_ a number of
other spammish characteristics about a spam message.
...and of course as soon as I implement this rule, this particular piece
of spam will die out...heh. So far I'm getting one a day though.
It'll probably resurface again though. I can't say I've ever received any
which conformed to that.
Good point. That explains why even though I have some rules that check for
"viagra" in the body (a lot simpler, you would think), they still come
Checking the body is also costly, processor wise, as you're scanning it
over and over again looking for each keyword.
You'll enjoy a perusal through the procmail list archives (which are
searchable - see the link on the procmail homepage), where you'll find a
great many antispam rules. Abundance of symbols or runs of whitespace in
the subject; recipient username identified in the subject; apparent website
in the subject; etc.
I think, based on your advice, I'll leave the body checks out :-)
Well, there are times they are useful, and times they aren't. When you get
more familiar with procmail, you can do things like:
# only for messages less than 30K in size
* < 30000
* CLEANBODY ?? (plain|keywords)
No, there isn't a ready-made html_base65_scrubber kicking around, though
there are some adaptable programs, such as lynx and mimencode.
lifetime of learning, at least for me. Main point is, if I can just keep
of the delete key down to once or twice a day, I'll consider it a victory!
I receive on average from 600-700 email messages into my inbox each day
(well, that many which actually reach my MUA). In there, I get 6-8 spams a
month, and that number has been petering off. I use DNSBLs at the MTA
level, and my own collection of procmail recipes.
I also don't receive viruses, because executable attachment types are
shuttled off in a server-global procmailrc with an advisory notice
forwarded to the indended recipient.
BTW, I had a good chuckle over the Red Hat comments in the disclaimer page,
though I'm hesitant to ask what you think of Mandrake...;-)
Mandrake's main liability is funding (go commercial or remain fully open
source - the somewhere-in-between state isn't really good). I've used
Slackware for most of my own boxes for a very long time (and I compile
everything - I don't use packages or RPMs), while I administer a small
fleet of FreeBSD systems and a Debian box or two.
My beefs with RedHat hinges around their desire to do everything
_different_ than everybody else, which doesn't make for a portable of
familiar setup, and their repeated demonstrations of an inability to
produce a thoroughly tested distro (shipping a bogus version of a C
compiler in a boxed version is inexcuseable in my book).
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
procmail mailing list