procmail
[Top] [All Lists]

Freemail/Large ISP Received checking

2003-04-24 23:26:30

Okay, lately, the only stuff that's really been getting through my defences has been forged hotmail and lycos trash. Normally, only a few messages in any given week, but this morning, I had _three_ spams in my inbox. Fsck that!

Here, I endeavour to provide a reasonably tuneable checker for received lines. I had a few simple regexps before, but some large ISPs send multiple domains through one of their domains, which made it a bit too hit-n-miss for my taste (I had been extracting a MATCH from a collection of domains, and then checking the Received: for that specific domain - hotmail/msn often breaks that).

I'll readily admit that this recipe hasn't been subjected to extensive testing at this point (though my testing thus far has resulted in _no_ false positives, and a pretty consistent hit on actual spew (as well as spew which originally bypassed my filters), but I'm interested in feedback from anyone interested in checking it out, including additional host pairs.

Instead of cut-n-paste from your email, you can download the necessary files at:
<http://www.professional.org/procmail/spewhosts.html>


My .procmailrc file sets the basic framework for all of my mail filtering - a large part of that is extracting certain frequently used bits of information from the message headers. Like many of my recipes, this one depends upon some of those variables to be defined. Rather than repeat them for the umpteenth time, just refer to my information on sandboxes (which is linked from the above document, as well as my disclaimer).

ENVFROM, CLEANFROM, and FROM_DOMAIN are my common variables which this recipe uses.

SPAMDIR is defined in my main spam.rc file, and is where spam-specific files reside (lists of domains, etc), in an effort to keep the procmail dir a bit less cluttered. This rcfile expects its datafile to be in that directory.

The script works within my "SPAMMISHNESS" model, wherein all the spam rules are invoked, even if some would have already indicated a message was positively spam. Some factors are merely contributory, and several of them taken together can indicate that a message is spam. I process them in an additive fashion (as detailed in a past procmail list post). For the purposes of this post, detailing all of that is unimportant - suffice it to say, you can handle your flagged mail any way you want.

LISTNAME is defined by a script I have which attempts to determine if the message is from a recognizeable list source, using X-headers, listname-owner and owner-listname, etc constructs. The specifics of that recipe aren't important here, but the factoid that some lists _strip_ Received: headers off of submitted messages (even the procmail list used to do that, which was frightfully annoying), which makes this recipe a bit moot. Several of the technical lists I am subbed to do this, so I elected to have a condition which disables this recipe if it is a recognized list message.


There are two files - the hosts file and the recipe itself. The datafile is first:

# spewhosts.list
#
# see bottom for comments.
#
altavista.com           =       altavista\.com
google.com              =       google\.com
msn.com         =       hotmail\.(msn\.)?com
hotmail.com             +150    hotmail\.(msn\.)?com
yahoo.com               =       yahoo(mail)?\.com
yahoomail.com           =       yahoo(mail)?\.com
juno.com                +150    juno\.com
rocketmail.com          +150    rocketmail\.com
mailexcite.com          +150    (mail)?excite.com
lycos.com               +150    lycos(email)?\.com
lycosemail.com          +150    lycos(email)?\.com
mailcity.com            =       mailcity\.com
yahoo.co.jp             =       yahoo\.co\.jp
aol.com         =       aol\.com
delphi.com              =       delphi\.com
prodigy.com             =       prodigy\.(com|net)
prodigy.net             =       prodigy\.(com|net)
usa.com         =       usa\.com
sprintmail.com          =       sprintmail\.com
sprynet.com             =       sprynet\.com
gte.net         =       gte.net
mci.net         =       mci.net
concentric.net          =       concentric.net
mindspring.net          +50     (earthlink|mindspring)\.(com|net)
earthlink.net           +50     (earthlink|mindspring)\.(com|net)
bellsouth.net           =       bellsouth\.net
worldnet.att.net        =       worldnet\.att\.net
compuserve.com          =       (cs|compuserve|aol)\.com
cs.com                  =       (cs|compuserve|aol)\.com

# (comments at bottom for efficiency of matching)
#
# Note: the listing of any given host/isp in this file is *NOT* an indicator
# of spam.  Quite the opposite actually - these domains are frequently FORGED,
# or can at least be expected to generally relay mail consistently through
# specific servers.
#
# This file is grepped for the string literals on the LHS.  The second column
# contains a value indicating the "spammishness" to be assigned to messages
# associated with this host which fail the test.  THE SIGN IS SIGNIFICANT -
# THE WAY THIS STRING IS ADDED TO THE SPAMMISHNESS STRING, IT WILL BE ASSUMED
# TO INCLUDE A MATH OPERATOR.  An '=' in this column means to use the default
# as defined in the rcfile.  The RHS should contain a REGEXP identifying
# expected mailservers, which should appear in the Received: lines as a
# "....by (somehost)REGEXP" type of string.
#
# The intended object of this file is to both allow you to identify services
# which might send messages through differently named mail servers, and to
# TUNE the spammishness associated with certain services.
#
# If it isn't obvious, the file supports comments, but only when STARTING
# IN THE LHS COLUMN.  The egrep operation used to match the LHS token are
# anchored to the beginning of the line, and are always domain literals.
#
# end of file

Extend this list as you desire - I'd be interested in the known domains associated with a few of the larger ISPs. Note that the above is not well-tuned at this point, so use it at your own risk. I plan to run it against volumes of offline mail in the next few days in an effort to better identify hosts associated with some of the providers.

Note that similar (though not identical) logic can be applied to messageid checks and even to your own domain mail. The use of a scoring column which has a sign is very deliberate -- it is possible to put certain senders (say, the envelope-sender, or From_ for a mailing list) in here to auto-compensate for certain messaages appearing to not come through their respective mail host. I don't choose to run my config this way, but for those that do, this could be useful.


Now, the rcfile:

#--------------------------------------------------------------------------

#       File:           spewhosts.rc
#       Description:    procmail script for ISP not relayed through their
#                       own mailservers.
#       Author:         Sean B. Straw
#       Source:         <http://www.professional.org/procmail/spewhosts.html>
#       Copyright:      Portions copyright (c) 2000-2003, Sean B. Straw
#       Disclaimer:     <http://www.professional.org/procmail/disclaimer.html>
#       Licensing:      Free for use by the procmail community.
#       Support:        Visit the official procmail discussion list to ask
#                       procmail questions.  If you need custom procmail
#                       work performed (including modifications to this
#                       rcfile), the author is available for paid consulting.
#       Limitations:    There is no included support for reducing a
#                       host.domain to the base domain.
#       Requires:       grep (external program)
#
#
# This is a procmail filter intended to identify (using an external datafile)
# messages where certain hosts used in the From: address should be paired
# with a specific sending host in the Received: headers.
#
# DO NOT AUTO-SUBMIT MESSAGES TO SPAM DATABASES BASED SOLELY UPON THE
# RESULTS OF THIS SCRIPT.
#
# This file is dependant upon some externally defined variables, and also
# utilizes a "SPAMMISHNESS" system described elsewhere in the authors
# writings.

# ==========================================================================

# First, skip this entire process if we're processing a recognized mailing
# list.  Some strip the received lines, and that renders these checks
# invalid.
:0
* LISTNAME ?? ^^^^
{
        # set a default spamlevel (overridden in the spewhosts.list file)
        SPAMLEVEL=+100

        # Check the envelope first (we need to extract the domain part).
        :0
        * ENVFROM ?? @\/.*
        {
                CL_MATCH=$\MATCH
                :0h
                WATCHEDMAIL=|egrep -i ^$CL_MATCH[[:space:]] 
$SPAMDIR/spewhosts.list
        }

        # Now, only if WATCHEDMAIL is non-empty
        :0A
        * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+[-+=0-9]+[    ]+\/[^  ].*
        * 1^0
        * $ -1^1 ^Received:.*\>by\>+[-_\.a-z0-9]*${MATCH}
        {
SPAMNOTES="${SPAMNOTES}SPAM: From service doesn't appear in Received lines${NL}"

                # extract spammishness from the WATCHEDMAIL string
                :0
                * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+\/[-+=0-9]+
                * 9876543210^0 MATCH ?? ^\/[-+0-9]+
                * 9876543210^0 MATCH ?? ^\/
                {
                        SPAMMISHNESS="${SPAMMISHNESS}${MATCH:-$SPAMLEVEL}"
                }
        }

        # then, From: - ONLY IF DIFFERENT from the envelope
        # (no need to double-penalize a sender for ACTUALLY sending mail
        # from the same address as their From:)
        :0
        * $ ! ENVFROM ?? ^^$\CLEANFROM^^
        * ! FROM_DOMAIN ?? ^^^^
        {
                CL_MATCH=$\FROM_DOMAIN
                :0h
                WATCHEDMAIL=|egrep -i ^$CL_MATCH[[:space:]] 
$SPAMDIR/spewhosts.list
        }

        # Now, only if WATCHEDMAIL is non-empty
        :0A
        * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+[-+=0-9]+[    ]+\/[^  ].*
        * 1^0
        * $ -1^1 ^Received:.*\>by\>+[-_\.a-z0-9]*${MATCH}
        {
SPAMNOTES="${SPAMNOTES}SPAM: From service doesn't appear in Received lines${NL}"

                # extract spammishness from the WATCHEDMAIL string
                :0
                * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+\/[-+=0-9]+
                * 9876543210^0 MATCH ?? ^\/[-+0-9]+
                * 9876543210^0 MATCH ?? ^\/
                {
                        SPAMMISHNESS="${SPAMMISHNESS}${MATCH:-$SPAMLEVEL}"
                }
        }
}

# ==========================================================================

# The module which includes this one should take action based on variables
# which are set in the recipes above.


# here ends the rcfile, but for the purpose of you dumping mail in a test
# scenario, this added recipe would file away the individual messages
# which are flagged (I include this when running the filter from my sandbox):

:0:
* ! SPAMNOTES ?? ^^^^
spewtum.mbx
---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>
  • Freemail/Large ISP Received checking, Professional Software Engineering <=