Re: stochastic spam detection

Attached is a ~/.procmailrc fragment that evaluates e-mail header and
body construction of e-mail messages to determine the probability that
an e-mail message is spam.

The numbers were derived by analyzing two large e-mail repositories;
one containing only spam, the other containing only non-spam. Let A be
the number of messages in the spam archive that have a certain header
characteristic divided by the total number of messages in the spam
archive, and B be the number of messages in the non-spam archive that
have the same characteristic divided by the total number of messages
in the non-spam archive. (A is the likelihood of a spam message
containing the characteristic, and B is the likelihood of a false
positive.)

Do this for many characteristics, dividing A by B, and picking the
dozen, or so, largest values to determine which characteristics will
be used.

Multiply the negative of the natural logarithm of B by 10000, and use
the integer value as a conditional test in the attached recipe.

Example:

    For a spam archive, of 1000 messages, where 600 have a common
    characteristic; A = 600 / 1000 = 0.6.

    For a non-spam archive, of 2000 messages, where 2 have the same
    common characteristic; B = 2 / 2000 = 0.001.

    A / B = 0.6 / 0.001 = 600. If 600 is larger than any other A / B
    values for the dozen other characteristics, then one of the other
    characteristics should be replaced with this one.

    The value used in the recipe would be ln (B) = ln (0.001) =
    -6.90775527898, and when multiplied by -10000:

        * 69078^0 test_for_some_characteristic

    which should be included in the recipe below.

The way it works is that the conditional values, (69078 in the
example,) for each characteristic that is true for a message are added
together. Since the values are logarithms, this sum is a product. And,
since the values represent probabilities of false positives, the
chances of a total false positive can be compared against a standard
deviation, (sigma,) certainty requirement, and "bined" accordingly,
(assuming the characteristics are statistically independent.)

Note that no single characteristic (orbs, etc.,) is capable of
generating a false positive, resulting in an inappropriately trashed
message.

It turns out to be about 90-95 percent effective, (but one can tinker
with the sigma limits, and probably tweek a little more out of it.)

The characteristics used in the attached recipe require:

    1) A sendmail.cf, (or equivalent,) configuration that puts
       "... (unknown ...  in a "Received: " record if RDNS fails
       for the HELO.

    2) Some method of verifying whether the IP address contained in a
       "Received: " record is a black listed host. (I used,
       http://www.johncon.com/john/receivedIP/index.html, but
       rblcheck(1), etc., can be used just as well.)

    3) The name of your smtp host, somesmtpserverdomain.com.

    4) Your e-mail address, someone(_at_)somedomain(_dot_)com(_dot_)

but can be modified to suit.

        John
--

John Conover        Tel. 408.370.2688  conover(_at_)rahul(_dot_)net
631 Lamont Ct.      Fax. 408.379.9602  http://www.johncon.com/
Campbell, CA 95008

######################################################################
#
# General conditionals. (All mailing list traffic, corporate traffic,
# etc., should have been disposed if before the following
# fragment is executed.) There are two variables, HEADERSCORE, and
# BODYSCORE, that will be incremented by a specific value for each
# condition that is true.
#
# Save the trusted return address.
#
:0 whc
SENDER=| formail -rtzx To:
#
# Save the machine generated return address.
#
:0 whc
FROM=| formail -rzx To:
#
# Save the domain name.
#
:0 whc
DOMAIN=| formail -rzx To: | sed 's/^.*@//'
#
HEADERSCORE="0"
#
# Any "Received: " header IP address black listed?
#
# (Note: this uses the off line black list from:
# http://www.johncon.com/john/receivedIP/index.html. Use any convenient
# method, like rblcheck(1), etc.; if black listed, set the HEADERSCORE
# to 60713.)
#
:0
* ? test -f "${HOME}/.procmail.reject"
* 60713^0 ? /usr/local/bin/receivedIPdb "${HOME}/.procmail.reject"
{
    HEADERSCORE=$=
}
#
# Evaluate the headers in the message, set HEADERSCORE to the total,
# (210000 is beyond 6 sigma, i.e., always reject if condition is true).
# The conditions are, no "To: " header, a "To: " header with "<>",
# a "To: " header with undisclosed recipients, a "Cc: " header with
# the recipient list not shown, different trusted and untrusted return
# addresses, the "Message-ID: " header not containing the domain name
# of the trusted return address, no "Received: " record containing the
# domain name, a "Subject: " header with an exclamation mark, the
# existence of an "X-Advertisement: " record, and a "Subject: " record
# containing "adv:":
#
:0
* $$HEADERSCORE^0
* 63590^0 !^to:
* 210000^0 ^to:.*< *>
* 83733^0 ^to:.*undisclosed.*recipient
* 67357^0 ^cc:.*recipient.*list.*not.*shown
* 74576^0 ? test "${SENDER}" != "${FROM}"
* 13755^0 ?! formail -c -x "message-id:" | fgrep -i -s -e "${DOMAIN}"
* 12809^0 ?! formail -c -x "received:" | fgrep -i -s -e "${DOMAIN}"
* 32669^0 ^subject:.*!
* 210000^0 ^x-advertisement:
* 210000^0 ^subject:.*adv(ertise(ment)?.*)?(:|$)
{
    HEADERSCORE=$=
}
#
BODYSCORE="0"
#
# Evaluate the body of the message, set BODYSCORE to the total. The
# conditions are, the words "delete", "mailing", "mailto:";, "remove",
# "unsolicited", "unsubscribe", (i.e., opt-out words,) or if the
# message is base64 encoded.
#
:0
* < 250000
{
    :0 B
    * 37714^0 base64
    * 32291^0 delete
    * 69903^0 mailing
    * 35245^0 "mailto:
    * 26728^0 remove
    * 51985^0 unsolicited
    * 76834^0 unsubscribe
    {
        BODYSCORE=$=
    }

}
#
######################################################################
#
# Context sensitive score, conditions are account and machine
# dependent. One required per account. Sendmail(1) specific, and
# if the smtp server had to generate a "Message-ID: " record, (and
# the message was not from someone in the smtp server's domain,) or
# the smtp server's sendmail(1) RDNS lookup failed, increase the
# HEADERSCORE total, and evaluate the total:
#
:0
* $$HEADERSCORE^0
* !^(from|reply-to):.*somesmtpserverdomain\.com
* 38741^0 ^message-id:.*somesmtpserverdomain\.com
{
    HEADERSCORE=$=
}
#
:0
* $$HEADERSCORE^0
* $$BODYSCORE^0
* 23101^0 !^(to|cc):(_dot_)*someone(_at_)somedomain\(_dot_)com
* 50597^0 ^received:.*\(unknown +.*by.*somesmtpserverdomain\.com
{
    SPAMSCORE=$=
    #
    :0 wfh
    | formail -A "X-Audit-Log: $SPAMSCORE = $HEADERSCORE + $BODYSCORE"
    #
    # Sigma values:
    #
    # 10000 * ln (1 sigma False Positive = 1 in 6.30297437513) =
    #         10000 * -1.84102164502 = -18410
    # 10000 * ln (2 sigma False Positive = 1 in 43.9557890318) =
    #         10000 * -3.78318433404 = -37832
    # 10000 * ln (3 sigma False Positive = 1 in 740.796695584) =
    #         10000 * -6.60772622151 = -66077
    # 10000 * ln (4 sigma False Positive = 1 in 31,574.3873622) =
    #         10000 * -10.3601014878 = -103601
    # 10000 * ln (5 sigma False Positive = 1 in 3,488,555.79111) =
    #         10000 * -15.0649983951 = -150650
    # 10000 * ln (6 sigma False Positive = 1 in 1,013,594,748.34) =
    #         10000 * -20.7367690058 = -207368
    #
    # Greater than 5 sigma chance of false positive can be safely
    # trashed. Raising the value from 5 sigma to 4 sigma increases
    # spam rejection, at the risk of false positives.
    #
    :0
    * -150650^0
    * $$SPAMSCORE^0
    /dev/null
    #
    # Greater than 1 sigma chance of false positive, but less than 5
    # sigma, is filed in the junk folder for evaluation; less than 1
    # sigma has a significant probability of being non-spam. Raising
    # the value from 1 sigma to 0 increases spam rejection, at the
    # risk of false positives, which will be filed in the junk folder.
    # Decreasing the value to 2 sigma increases the risk of false
    # negatives, which will be passed.
    #
    :0:
    * -18410^0
    * $$SPAMSCORE^0
    junk
}
#
# Probably not spam, continue.
#
######################################################################

Empirically, the 69 lines of procmail code, (with the sigma limit
values shown,) filtered 97.442455243% of the spam, with 64.9616368%
being automatically trashed.

The cost is that 43.597263% of non-spam e-mail will be false
positives, (and end up in the junk folder, to be sorted by hand, from
the remaining spam.)

The sigma limit values, (150650, and 18410,) were chosen to maximize
the detection of spam, while at the same time minimizing the automatic
trashing of false positives, and setting the number of false positives
and false negatives filed in the junk folder about equal, at about
50%-its just the way I set the numbers, (i.e., to protect my Internet
visable e-mail in folder from spam, while not trashing any legitimate
e-mail, either.)

If the junk folder is changed to one's personal folder, (or the
complete conditional/action removed,) then 65% of the spam will be
trashed, i.e., it would filter about 2 out of 3 spam messages from the
folder, and have virtually no effect on legitimate e-mail.

The sigma limit values, (150650, and 18410,) can be varied by about a
factor of 2, either way, depending on what one wants to do.

BTW, one can use the binary search facility, look(1), to construct
a very fast lookup for e-mail addresses:

    :0:
    * ? test -f "${HOME}/.procmail.accept"
    * ? look -f "${SENDER}" "${HOME}/.procmail.accept"
    {
      # handle e-mail from a known, legitimate address
    }

where the procmail.accept file is a list of legitimate return
addresses of people one expects mail from, (i.e., like anyone and
everyone who has sent e-mail to you,) which should go before the above
~/.procmailrc fragment, right after the SENDER variable is set. The
only e-mail that would go through the spam detection fragment would
then be new e-mail from folks one does not know. The list of return
addresses is constructed by running something like:

    formail -rtzx To:

on your e-mail archive, (I use SmartList and rel(1)-there are procmail
scripts in the rel source distribution at
http://www.johncon.com/nformatix/ to make relevance searchable e-mail
archives.) The above code fragment could use more aggressive sigma
limit values since it does not have to contend with trashing e-mail
from someone one knows.

(If you do not have look(1), there is a binary search program,
bsearchtext, in the receivedIP distribution at
http://www.johncon.com/john/receivedIP/index.html, that does the same
thing.)

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail