Okay, lately, the only stuff that's really been getting through my defences
has been forged hotmail and lycos trash. Normally, only a few messages in
any given week, but this morning, I had _three_ spams in my inbox. Fsck that!
Here, I endeavour to provide a reasonably tuneable checker for received
lines. I had a few simple regexps before, but some large ISPs send
multiple domains through one of their domains, which made it a bit too
hit-n-miss for my taste (I had been extracting a MATCH from a collection of
domains, and then checking the Received: for that specific domain -
hotmail/msn often breaks that).
I'll readily admit that this recipe hasn't been subjected to extensive
testing at this point (though my testing thus far has resulted in _no_
false positives, and a pretty consistent hit on actual spew (as well as
spew which originally bypassed my filters), but I'm interested in feedback
from anyone interested in checking it out, including additional host pairs.
Instead of cut-n-paste from your email, you can download the necessary
files at:
<http://www.professional.org/procmail/spewhosts.html>
My .procmailrc file sets the basic framework for all of my mail filtering -
a large part of that is extracting certain frequently used bits of
information from the message headers. Like many of my recipes, this one
depends upon some of those variables to be defined. Rather than repeat
them for the umpteenth time, just refer to my information on sandboxes
(which is linked from the above document, as well as my disclaimer).
ENVFROM, CLEANFROM, and FROM_DOMAIN are my common variables which this
recipe uses.
SPAMDIR is defined in my main spam.rc file, and is where spam-specific
files reside (lists of domains, etc), in an effort to keep the procmail dir
a bit less cluttered. This rcfile expects its datafile to be in that
directory.
The script works within my "SPAMMISHNESS" model, wherein all the spam rules
are invoked, even if some would have already indicated a message was
positively spam. Some factors are merely contributory, and several of them
taken together can indicate that a message is spam. I process them in an
additive fashion (as detailed in a past procmail list post). For the
purposes of this post, detailing all of that is unimportant - suffice it to
say, you can handle your flagged mail any way you want.
LISTNAME is defined by a script I have which attempts to determine if the
message is from a recognizeable list source, using X-headers,
listname-owner and owner-listname, etc constructs. The specifics of that
recipe aren't important here, but the factoid that some lists _strip_
Received: headers off of submitted messages (even the procmail list used to
do that, which was frightfully annoying), which makes this recipe a bit
moot. Several of the technical lists I am subbed to do this, so I elected
to have a condition which disables this recipe if it is a recognized list
message.
There are two files - the hosts file and the recipe itself. The datafile
is first:
# spewhosts.list
#
# see bottom for comments.
#
altavista.com = altavista\.com
google.com = google\.com
msn.com = hotmail\.(msn\.)?com
hotmail.com +150 hotmail\.(msn\.)?com
yahoo.com = yahoo(mail)?\.com
yahoomail.com = yahoo(mail)?\.com
juno.com +150 juno\.com
rocketmail.com +150 rocketmail\.com
mailexcite.com +150 (mail)?excite.com
lycos.com +150 lycos(email)?\.com
lycosemail.com +150 lycos(email)?\.com
mailcity.com = mailcity\.com
yahoo.co.jp = yahoo\.co\.jp
aol.com = aol\.com
delphi.com = delphi\.com
prodigy.com = prodigy\.(com|net)
prodigy.net = prodigy\.(com|net)
usa.com = usa\.com
sprintmail.com = sprintmail\.com
sprynet.com = sprynet\.com
gte.net = gte.net
mci.net = mci.net
concentric.net = concentric.net
mindspring.net +50 (earthlink|mindspring)\.(com|net)
earthlink.net +50 (earthlink|mindspring)\.(com|net)
bellsouth.net = bellsouth\.net
worldnet.att.net = worldnet\.att\.net
compuserve.com = (cs|compuserve|aol)\.com
cs.com = (cs|compuserve|aol)\.com
# (comments at bottom for efficiency of matching)
#
# Note: the listing of any given host/isp in this file is *NOT* an indicator
# of spam. Quite the opposite actually - these domains are frequently FORGED,
# or can at least be expected to generally relay mail consistently through
# specific servers.
#
# This file is grepped for the string literals on the LHS. The second column
# contains a value indicating the "spammishness" to be assigned to messages
# associated with this host which fail the test. THE SIGN IS SIGNIFICANT -
# THE WAY THIS STRING IS ADDED TO THE SPAMMISHNESS STRING, IT WILL BE ASSUMED
# TO INCLUDE A MATH OPERATOR. An '=' in this column means to use the default
# as defined in the rcfile. The RHS should contain a REGEXP identifying
# expected mailservers, which should appear in the Received: lines as a
# "....by (somehost)REGEXP" type of string.
#
# The intended object of this file is to both allow you to identify services
# which might send messages through differently named mail servers, and to
# TUNE the spammishness associated with certain services.
#
# If it isn't obvious, the file supports comments, but only when STARTING
# IN THE LHS COLUMN. The egrep operation used to match the LHS token are
# anchored to the beginning of the line, and are always domain literals.
#
# end of file
Extend this list as you desire - I'd be interested in the known domains
associated with a few of the larger ISPs. Note that the above is not
well-tuned at this point, so use it at your own risk. I plan to run it
against volumes of offline mail in the next few days in an effort to better
identify hosts associated with some of the providers.
Note that similar (though not identical) logic can be applied to messageid
checks and even to your own domain mail. The use of a scoring column which
has a sign is very deliberate -- it is possible to put certain senders
(say, the envelope-sender, or From_ for a mailing list) in here to
auto-compensate for certain messaages appearing to not come through their
respective mail host. I don't choose to run my config this way, but for
those that do, this could be useful.
Now, the rcfile:
#--------------------------------------------------------------------------
# File: spewhosts.rc
# Description: procmail script for ISP not relayed through their
# own mailservers.
# Author: Sean B. Straw
# Source: <http://www.professional.org/procmail/spewhosts.html>
# Copyright: Portions copyright (c) 2000-2003, Sean B. Straw
# Disclaimer: <http://www.professional.org/procmail/disclaimer.html>
# Licensing: Free for use by the procmail community.
# Support: Visit the official procmail discussion list to ask
# procmail questions. If you need custom procmail
# work performed (including modifications to this
# rcfile), the author is available for paid consulting.
# Limitations: There is no included support for reducing a
# host.domain to the base domain.
# Requires: grep (external program)
#
#
# This is a procmail filter intended to identify (using an external datafile)
# messages where certain hosts used in the From: address should be paired
# with a specific sending host in the Received: headers.
#
# DO NOT AUTO-SUBMIT MESSAGES TO SPAM DATABASES BASED SOLELY UPON THE
# RESULTS OF THIS SCRIPT.
#
# This file is dependant upon some externally defined variables, and also
# utilizes a "SPAMMISHNESS" system described elsewhere in the authors
# writings.
# ==========================================================================
# First, skip this entire process if we're processing a recognized mailing
# list. Some strip the received lines, and that renders these checks
# invalid.
:0
* LISTNAME ?? ^^^^
{
# set a default spamlevel (overridden in the spewhosts.list file)
SPAMLEVEL=+100
# Check the envelope first (we need to extract the domain part).
:0
* ENVFROM ?? @\/.*
{
CL_MATCH=$\MATCH
:0h
WATCHEDMAIL=|egrep -i ^$CL_MATCH[[:space:]]
$SPAMDIR/spewhosts.list
}
# Now, only if WATCHEDMAIL is non-empty
:0A
* WATCHEDMAIL ?? ^[-_\.a-z0-9]+[ ]+[-+=0-9]+[ ]+\/[^ ].*
* 1^0
* $ -1^1 ^Received:.*\>by\>+[-_\.a-z0-9]*${MATCH}
{
SPAMNOTES="${SPAMNOTES}SPAM: From service doesn't appear in Received
lines${NL}"
# extract spammishness from the WATCHEDMAIL string
:0
* WATCHEDMAIL ?? ^[-_\.a-z0-9]+[ ]+\/[-+=0-9]+
* 9876543210^0 MATCH ?? ^\/[-+0-9]+
* 9876543210^0 MATCH ?? ^\/
{
SPAMMISHNESS="${SPAMMISHNESS}${MATCH:-$SPAMLEVEL}"
}
}
# then, From: - ONLY IF DIFFERENT from the envelope
# (no need to double-penalize a sender for ACTUALLY sending mail
# from the same address as their From:)
:0
* $ ! ENVFROM ?? ^^$\CLEANFROM^^
* ! FROM_DOMAIN ?? ^^^^
{
CL_MATCH=$\FROM_DOMAIN
:0h
WATCHEDMAIL=|egrep -i ^$CL_MATCH[[:space:]]
$SPAMDIR/spewhosts.list
}
# Now, only if WATCHEDMAIL is non-empty
:0A
* WATCHEDMAIL ?? ^[-_\.a-z0-9]+[ ]+[-+=0-9]+[ ]+\/[^ ].*
* 1^0
* $ -1^1 ^Received:.*\>by\>+[-_\.a-z0-9]*${MATCH}
{
SPAMNOTES="${SPAMNOTES}SPAM: From service doesn't appear in Received
lines${NL}"
# extract spammishness from the WATCHEDMAIL string
:0
* WATCHEDMAIL ?? ^[-_\.a-z0-9]+[ ]+\/[-+=0-9]+
* 9876543210^0 MATCH ?? ^\/[-+0-9]+
* 9876543210^0 MATCH ?? ^\/
{
SPAMMISHNESS="${SPAMMISHNESS}${MATCH:-$SPAMLEVEL}"
}
}
}
# ==========================================================================
# The module which includes this one should take action based on variables
# which are set in the recipes above.
# here ends the rcfile, but for the purpose of you dumping mail in a test
# scenario, this added recipe would file away the individual messages
# which are flagged (I include this when running the filter from my sandbox):
:0:
* ! SPAMNOTES ?? ^^^^
spewtum.mbx
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail