Freemail/Large ISP Received checking

Okay, lately, the only stuff that's really been getting through my defenceshas been forged hotmail and lycos trash. Normally, only a few messages inany given week, but this morning, I had _three_ spams in my inbox. Fsck that!

Here, I endeavour to provide a reasonably tuneable checker for receivedlines. I had a few simple regexps before, but some large ISPs sendmultiple domains through one of their domains, which made it a bit toohit-n-miss for my taste (I had been extracting a MATCH from a collection ofdomains, and then checking the Received: for that specific domain -hotmail/msn often breaks that).

I'll readily admit that this recipe hasn't been subjected to extensivetesting at this point (though my testing thus far has resulted in _no_false positives, and a pretty consistent hit on actual spew (as well asspew which originally bypassed my filters), but I'm interested in feedbackfrom anyone interested in checking it out, including additional host pairs.

Instead of cut-n-paste from your email, you can download the necessaryfiles at:

<http://www.professional.org/procmail/spewhosts.html>

My .procmailrc file sets the basic framework for all of my mail filtering -a large part of that is extracting certain frequently used bits ofinformation from the message headers. Like many of my recipes, this onedepends upon some of those variables to be defined. Rather than repeatthem for the umpteenth time, just refer to my information on sandboxes(which is linked from the above document, as well as my disclaimer).

ENVFROM, CLEANFROM, and FROM_DOMAIN are my common variables which thisrecipe uses.

SPAMDIR is defined in my main spam.rc file, and is where spam-specificfiles reside (lists of domains, etc), in an effort to keep the procmail dira bit less cluttered. This rcfile expects its datafile to be in thatdirectory.

The script works within my "SPAMMISHNESS" model, wherein all the spam rulesare invoked, even if some would have already indicated a message waspositively spam. Some factors are merely contributory, and several of themtaken together can indicate that a message is spam. I process them in anadditive fashion (as detailed in a past procmail list post). For thepurposes of this post, detailing all of that is unimportant - suffice it tosay, you can handle your flagged mail any way you want.

LISTNAME is defined by a script I have which attempts to determine if themessage is from a recognizeable list source, using X-headers,listname-owner and owner-listname, etc constructs. The specifics of thatrecipe aren't important here, but the factoid that some lists _strip_Received: headers off of submitted messages (even the procmail list used todo that, which was frightfully annoying), which makes this recipe a bitmoot. Several of the technical lists I am subbed to do this, so I electedto have a condition which disables this recipe if it is a recognized listmessage.

There are two files - the hosts file and the recipe itself. The datafileis first:


# spewhosts.list
#
# see bottom for comments.
#
altavista.com           =       altavista\.com
google.com              =       google\.com
msn.com         =       hotmail\.(msn\.)?com
hotmail.com             +150    hotmail\.(msn\.)?com
yahoo.com               =       yahoo(mail)?\.com
yahoomail.com           =       yahoo(mail)?\.com
juno.com                +150    juno\.com
rocketmail.com          +150    rocketmail\.com
mailexcite.com          +150    (mail)?excite.com
lycos.com               +150    lycos(email)?\.com
lycosemail.com          +150    lycos(email)?\.com
mailcity.com            =       mailcity\.com
yahoo.co.jp             =       yahoo\.co\.jp
aol.com         =       aol\.com
delphi.com              =       delphi\.com
prodigy.com             =       prodigy\.(com|net)
prodigy.net             =       prodigy\.(com|net)
usa.com         =       usa\.com
sprintmail.com          =       sprintmail\.com
sprynet.com             =       sprynet\.com
gte.net         =       gte.net
mci.net         =       mci.net
concentric.net          =       concentric.net
mindspring.net          +50     (earthlink|mindspring)\.(com|net)
earthlink.net           +50     (earthlink|mindspring)\.(com|net)
bellsouth.net           =       bellsouth\.net
worldnet.att.net        =       worldnet\.att\.net
compuserve.com          =       (cs|compuserve|aol)\.com
cs.com                  =       (cs|compuserve|aol)\.com

# (comments at bottom for efficiency of matching)
#
# Note: the listing of any given host/isp in this file is *NOT* an indicator
# of spam.  Quite the opposite actually - these domains are frequently FORGED,
# or can at least be expected to generally relay mail consistently through
# specific servers.
#
# This file is grepped for the string literals on the LHS.  The second column
# contains a value indicating the "spammishness" to be assigned to messages
# associated with this host which fail the test.  THE SIGN IS SIGNIFICANT -
# THE WAY THIS STRING IS ADDED TO THE SPAMMISHNESS STRING, IT WILL BE ASSUMED
# TO INCLUDE A MATH OPERATOR.  An '=' in this column means to use the default
# as defined in the rcfile.  The RHS should contain a REGEXP identifying
# expected mailservers, which should appear in the Received: lines as a
# "....by (somehost)REGEXP" type of string.
#
# The intended object of this file is to both allow you to identify services
# which might send messages through differently named mail servers, and to
# TUNE the spammishness associated with certain services.
#
# If it isn't obvious, the file supports comments, but only when STARTING
# IN THE LHS COLUMN.  The egrep operation used to match the LHS token are
# anchored to the beginning of the line, and are always domain literals.
#
# end of file

Extend this list as you desire - I'd be interested in the known domainsassociated with a few of the larger ISPs. Note that the above is notwell-tuned at this point, so use it at your own risk. I plan to run itagainst volumes of offline mail in the next few days in an effort to betteridentify hosts associated with some of the providers.

Note that similar (though not identical) logic can be applied to messageidchecks and even to your own domain mail. The use of a scoring column whichhas a sign is very deliberate -- it is possible to put certain senders(say, the envelope-sender, or From_ for a mailing list) in here toauto-compensate for certain messaages appearing to not come through theirrespective mail host. I don't choose to run my config this way, but forthose that do, this could be useful.



Now, the rcfile:

#--------------------------------------------------------------------------

#       File:           spewhosts.rc
#       Description:    procmail script for ISP not relayed through their
#                       own mailservers.
#       Author:         Sean B. Straw
#       Source:         <http://www.professional.org/procmail/spewhosts.html>
#       Copyright:      Portions copyright (c) 2000-2003, Sean B. Straw
#       Disclaimer:     <http://www.professional.org/procmail/disclaimer.html>
#       Licensing:      Free for use by the procmail community.
#       Support:        Visit the official procmail discussion list to ask
#                       procmail questions.  If you need custom procmail
#                       work performed (including modifications to this
#                       rcfile), the author is available for paid consulting.
#       Limitations:    There is no included support for reducing a
#                       host.domain to the base domain.
#       Requires:       grep (external program)
#
#
# This is a procmail filter intended to identify (using an external datafile)
# messages where certain hosts used in the From: address should be paired
# with a specific sending host in the Received: headers.
#
# DO NOT AUTO-SUBMIT MESSAGES TO SPAM DATABASES BASED SOLELY UPON THE
# RESULTS OF THIS SCRIPT.
#
# This file is dependant upon some externally defined variables, and also
# utilizes a "SPAMMISHNESS" system described elsewhere in the authors
# writings.

# ==========================================================================

# First, skip this entire process if we're processing a recognized mailing
# list.  Some strip the received lines, and that renders these checks
# invalid.
:0
* LISTNAME ?? ^^^^
{
        # set a default spamlevel (overridden in the spewhosts.list file)
        SPAMLEVEL=+100

        # Check the envelope first (we need to extract the domain part).
        :0
        * ENVFROM ?? @\/.*
        {
                CL_MATCH=$\MATCH
                :0h
                WATCHEDMAIL=|egrep -i ^$CL_MATCH[[:space:]] 
$SPAMDIR/spewhosts.list
        }

        # Now, only if WATCHEDMAIL is non-empty
        :0A
        * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+[-+=0-9]+[    ]+\/[^  ].*
        * 1^0
        * $ -1^1 ^Received:.*\>by\>+[-_\.a-z0-9]*${MATCH}
        {

SPAMNOTES="${SPAMNOTES}SPAM: From service doesn't appear in Receivedlines${NL}"


                # extract spammishness from the WATCHEDMAIL string
                :0
                * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+\/[-+=0-9]+
                * 9876543210^0 MATCH ?? ^\/[-+0-9]+
                * 9876543210^0 MATCH ?? ^\/
                {
                        SPAMMISHNESS="${SPAMMISHNESS}${MATCH:-$SPAMLEVEL}"
                }
        }

        # then, From: - ONLY IF DIFFERENT from the envelope
        # (no need to double-penalize a sender for ACTUALLY sending mail
        # from the same address as their From:)
        :0
        * $ ! ENVFROM ?? ^^$\CLEANFROM^^
        * ! FROM_DOMAIN ?? ^^^^
        {
                CL_MATCH=$\FROM_DOMAIN
                :0h
                WATCHEDMAIL=|egrep -i ^$CL_MATCH[[:space:]] 
$SPAMDIR/spewhosts.list
        }

        # Now, only if WATCHEDMAIL is non-empty
        :0A
        * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+[-+=0-9]+[    ]+\/[^  ].*
        * 1^0
        * $ -1^1 ^Received:.*\>by\>+[-_\.a-z0-9]*${MATCH}
        {

SPAMNOTES="${SPAMNOTES}SPAM: From service doesn't appear in Receivedlines${NL}"


                # extract spammishness from the WATCHEDMAIL string
                :0
                * WATCHEDMAIL ?? ^[-_\.a-z0-9]+[        ]+\/[-+=0-9]+
                * 9876543210^0 MATCH ?? ^\/[-+0-9]+
                * 9876543210^0 MATCH ?? ^\/
                {
                        SPAMMISHNESS="${SPAMMISHNESS}${MATCH:-$SPAMLEVEL}"
                }
        }
}

# ==========================================================================

# The module which includes this one should take action based on variables
# which are set in the recipes above.


# here ends the rcfile, but for the purpose of you dumping mail in a test
# scenario, this added recipe would file away the individual messages
# which are flagged (I include this when running the filter from my sandbox):

:0:
* ! SPAMNOTES ?? ^^^^
spewtum.mbx
---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail