At 08:39 2003-09-12 -0400, R A Lichtensteiger wrote:
[originally offlist, but we've agreed to pop it back onlist]
I wrote some recipes like that a while ago and found that the one that
nails future dates is a good indicator, but that past dates were almost
always a false indicator.
How does your experience compare?
I score date problems as "spammish", and as a result, a funky date in and
of itself isn't enough to identify something as junk, so even false
positives aren't a problem - it's all taken in conjunction with other
characteristics of the message. This allows me to be a bit more arbitrary
about my use of the filter - it doesn't have to be 100% because I'm really
not likely to lose legitimate email because of it.
I also score for INVALID date formats (which typically seem to have some
bogus text describing what the timezone is) - they seem to almost
universally be spam, though I merely score them with a higher spammishness
score.
Messages < 200K sec BEFORE reception tend to be list-delayed and twits with
erratic clocks, but I have an 18H threshold there anyway (yes, less than 3D
or 5D - but as I said, I'm using it as an indicator, not an
absolute). Bugtraq for instance seems to frequently have 140Ksec+ delays
(that list strips incoming Received: headers, so it's difficult to
determine exactly where the delay was inserted, but it isn't critical
because the single characteristic isn't enough to flag it as spam).
Very LARGE lags in the clock seem to be indicative of spam:
SPAM: +100+100 Date is suspicious at 121651249 seconds {312 00:00:49}
BEFORE reception
SPAM: +100+100 Date is suspicious at 121651249 seconds {312 00:00:49}
BEFORE reception
Curiously, both of those are from _SEPARATE_ messages from the same spammer
and are messages sent at different times.
I threshold advanced clocks at +2H, since it seems most legit mail which
has an advanced clock skew is under about 5K seconds (about 1.5 hours),
which can sometimes be attributed to morons having their machine set to the
wrong timezone.
Excepting the low thresholds, pretty much any advancement of the clock is a
consistent indicator of spam. Just reviewing filtered messages since the
beginning of this month, I see that a clock in excess of +2H has been spam
in every instance except for one, which was a bugtraq message
("SRT2003-09-11-1120 - setgid man MANPL overflow"), which because the date
characteristic is merely contributory, that message was NOT classified as
spam - however, all the others suffered from MULTIPLE spam characteristics,
for example:
SPAM: +125 Single received header for foreign sender
SPAM: +135 Advisory - relayed through backup MX
SPAM: +300 Foreign character set encoding (Windows-1250) in body.
SPAM: +100+100 Date is suspicious at 2678343 seconds {030 23:59:03} AFTER
reception
SPAM: +75 Advisory - no non-list cleartext recipient matching X-Envelope-To
SPAM: +249+58 Subject Scoring match 58
SPAM: +(249*0.75) text/html ONLY
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 1577.75
SPAM: spammishness exceeds threshold of 249
INFO: SpamFilter v03.05.00 SBS 20030517/1243
From gold(_at_)web2mail(_dot_)com Tue Sep 9 22:57:55 2003
Subject: Do YOU know how to earn lot of money on gold rate change?
Folder: gzip -9fc >> spam.gz 2440
If a message is 18H hours BEFORE or 2H AFTER reception, I add 100 to my
spammishness. If it's >72H out, I add an additional 100.
Overall, what I have has been working wonderful for me - just 5 spams so
far this month have actually gotten past my filters, and three of those
were some eBay scam received nearly concurrent to one another (for which my
spewhosts filter has been updated - a filter which adds a score based on
whether the message appears to have passed through a mailserver associated
with the domain of the From: address, used to flag potential forgeries).
In fact, of the two other spams I received, both of them would now be
tagged because I expanded some subject keyword filters (adding prostitute
and underwear), as well as having recently narrowed the advanced clock
threshold (from +18H to +2H) and bumping up the scoring for invalid date
formats.
I also recently modified the recipes to allow for a list skew of 24H if a
LISTNAME variable has been defined, so there's an automatic allowance for
delays on discussion lists (which in my system already get a boost to their
allowed spammishness threshold), which sharply reduces the number of
entries in my logfile when handling lists such as bugtraq (I have a spam
report emailed daily, and that includes messages which were spammish, not
strictly tagged as spam, so I can see how close iffy messages are).
Dates are but one characteristic of my filtering, and they've been useful
thus far.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail