The spamometer...


Here's something that people might want to use from their
~/.procmailrc

You can pick it up at

           ftp://ftp.teclata.es/pub/terry/spamometer.tar.gz


This is really just a first hack. I am sure there are typos. There may
well be other errors. I know there are many things that could be added
(in particular, what I consider the most important piece). This is
basically a framework I wrote on which to hang the kind of suggestion
I made yesterday.

If anyone wants to re-write my perl into something decent, I'd be very
happy to receive a copy.

If you write your own spamometer functions, feel free to send them to
me and I'll either incorporate them or add your functions to the
distribution as further examples. They're easy to do, even if you only
know minimal perl.


I've included parts of the README file.


Terry Jones (terry(_at_)teclata(_dot_)es).



--------------------------------- DESCRIPTION --------------------------------

The spamometer is a perl script (you'll need perl 5) that attempts to
guess whether a mail message (read from STDIN) is spam. Most people
will want to call it from something like procmail.


It allows you to do very sophisticated mail processing. In fact, you
should be able to do anything you want. The .spamometerrc file
included with this distribution shows just a few examples. I'll write
more soon, this was a one-night hack. These include functions that
look for mail messages that

   - have too many initial lines that end in ! or !! or !!!
   - are addressed to too many people
   - have a ratio of upper to lower case that exceeds a threshold

This is not meant to be a replacement for procmail in any way. You can
do some things easily that are difficult or impossible in procmail
(without an external helper program, such as this one). In particular,
the spamometer knows nothing about delivering mail!



The spamometer knows nothing at all about what constitutes spam. All
of this is decided by user-supplied spamometer functions which the
spamometer calls according to user-defined priorities. These functions
are of 3 types:

  1) Functions that get called when only the header of the mail
     message has been read.

  2) Functions that get called after each line of the body is
     read. These functions also have the full header at their
     disposal.

  3) Functions that get called only when the entire mail body is
     read. These functions are passed a file whose contents is the body
     of the message. These functions also have the full header at their
     disposal.


Users supply files of these functions on the command line. This allows
people to run the spamometer with large collections of functions for
spam detection. Typically, once this thing gets further along, users
will not need to be writing their own spamometer functions.


--------------------------------- INVOCATION ---------------------------------

Invocation is 

  spamometer [-v] [-i] [function-files]  < file
     
     (or through procmail, see the file INSTALL).

     -v means be verbose. You'll see why the spamometer considers the
        mail to be spam, plus you'll get a header message that includes
        the Message-ID: from the mail.

     -i means do not read the $HOME/.spamometerrc file


     Additional arguments will be taken as spamometer function
     files. These will be read with perl's 'do' statement. For what
     these files may contain, see SPAMOMETER FUNCTION FILES below.


--------------------------------- DIAGNOSTICS --------------------------------

Normal output (produced from -v) appears on STDOUT. Error output
appears on STDERR.

By default, the spamometer produces no output. The exit status (see
below) of the proces indicates spam or non-spam.

--------------------------------- EXIT STATUS --------------------------------

The spamometer exits with status

  0 if it believes stdin contains spam
  1 if it believes stdin does not contains spam
  2 if you get the usage wrong
  3 if there is a problem with a function file

This usage is convenient for procmail. If the spamometer cannot
determine things one way or another, it exits with the non-spam value.


-------------------------- SPAMOMETER FUNCTION FILES -------------------------
     
The spamometer function files (of which $HOME/.spamometerrc is the
default) contain perl functions that look at mail headers and bodies
and try to figure out if the mail is spam. These functions can
interact with one another, they can use common subroutines, they can
maintina state with normal perl variables etc.

The $HOME/.spamometerrc will always be the first function file
included (using perl's 'do' statement). Avoid this with the -i option.


The idea of the design is for the spamometer to deal with getting the
mail, calling the perl functions (in an approrpiate order) in the
spamometer function file, and dealing with their results. You get to
simply concentrate on the function files. You can include spamometer
functions from anyone you please. You should be able to do almost
anything you want. You shouldn't need to modify the spamometer to add
new tests for spam.

In order to have one of your functions run at the appropriate time,
you must write it, place it in a file that you pass to the spamometer,
and register the function. Registration looks like this

  register("subject_re_test",     $HEADER_FUNC,    priority);
  register("too_many_bangs_test", $BODY_FUNC,      priority);
  register("uppercase_test",      $FULL_BODY_FUNC, priority);

The first arg is the (string) name of the function you want to
register. Next is the spamometer function type (see above), there are
only 3, use the predefined variables to indicate which yours is.

Lastly, give a priority. Priorities run from 0 (or $HIGHEST_PRIORITY)
up to 100 ($LOWEST_PRIORITY). You may also use $DEFAULT_PRIORITY to
get something in between. The priority is used to determine the order
in which the functions (of the same type) will be called.

The highest priority $HEADER_FUNC spamometer functions are called
following the reading of the header. **If no $HEADER_FUNC functions
are registered, the header is simply skipped and will not be available
to any other functions.** This can easily be avoided by defining a
$HEADER_FUNC function that simply returns $IS_NOT_SPAM.

Then, lower priority header functions are invoked. Then, assuming the
message has not yet been classified as spam or non-spam, the body is
read line by line. After each line, all $BODY_FUNC functions are
called, from highest to lowest priority.

Finally, when the entire message body has been read, $FULL_BODY_FUNC
functions are called, also from highest to lowest priority.


It might be useful to think of priorities as expected return times. If
a spamometer function can do its work very quickly (e.g., by simply
looking for a regexp match in one header line), give it a low
priority. If a function does a grep through 800 domain names, giving
it a higher priority will ensure that simple and faster tests (if any)
will be run first.



Spamometer functions are all expected to return one of the following
values:

  $IS_SPAM
  $IS_NOT_SPAM
  $NO_OPINION
  $I_GIVE_UP      ($BODY_FUNC functions only)
  (0.0 ... 1.0)   (i.e., a real value greater than 0.0 and less than 1.0)

These have the following semantics:

  $IS_SPAM      = The message is spam with probability 1.0. Exit.
  $IS_NOT_SPAM  = The message is not spam with probability 1.0. Exit.
  $NO_OPINION   = I don't know anything (yet).
  $I_GIVE_UP    = I don't know, and I give up. Please don't call me again.
  (0.0 ... 1.0) = I estimate this as the probability this message is spam.

The last option is currently ignored (see the section on FUTURE below).


In the case of $BODY_FUNC functions, a return value of $I_GIVE_UP will
cause the spamometer to stop calling that function on subsequent body
lines. This allows for spamometer functions that attempt to identify
spam by looking at the start of the body of a message. If they cannot
identify the message as spam, they return $I_GIVE_UP to eliminate
themselves. This is a big advantage, since the entire body of the mail
message need not be read. Typically, such functions will either make a
quick decision that a mail is or is not spam, and will give up after
that. For an example, see the too_many_bangs_test in the distributed
.spamometerrc file.



The calling interface of these 3 function types is as follows:

    $HEADER_FUNC:    (%headers, $verbose)
    $BODY_FUNC:      ($line, $nlines, $nchars, %headers, $verbose)
    $FULL_BODY_FUNC: ($file, $nlines, $nchars, %headers, $verbose)


In all cases, $verbose indicates whether the user specified -v (in
which case messages can be printed (to STDOUT) indicating reasons for
rejection as spam, or otherwise). $line is the just-read body
line. $nlines and $nchars are the number of lines and chars read so
far. In the case of the $FULL_BODY_FUNC functions, this will be the
total number of lines and chars in the file. These are passed in the
hope that they will save on recomputation (or independent computation
by several spamometer functions). $file contains the name of the file
with the body of the mail message in it. This file needs to be opened,
and you should remember to close it (there may be many functions
reading this file after yours).


The %headers variable is an associative array that holds the header of
the mail message. This contains keys such as $headers{'subject:'} and
$headers{'to:'}. Note that the : is left in the key and the key is
always lower case. The colon is left in to allow you to look at both
$headers{from:'} and $headers{from'}.

If the spamometer finds continuation lines in the header, these are
simply concatenated (with the newline removed). The extra whitespace
is left intact.


----------------------------------- FUTURE -----------------------------------

The spamometer is basically a piece of scaffolding to support the
program I really wanted to write.

The piece that is missing is treatment of probabilities other than 0.0
and 1.0.

Someone with experience in Bayesian inference (or similar) might be
able to help me out here. The basic idea is to accumulate evidence
that a piece of mail is spam (or otherwise).  So, for example, if you
know that roughly half the mail you receive with a subject line that
is all uppercase and which ends with a ! is spam, you could write a
tiny $HEADER_FUNC that returned 0.5 if this were the case. The
evidence would be incorporated with evidence from other sources to
determine (or, more likely, guess educatedly) if the mail were
spam. This evidence might be combined with the presence of 3 or more $
signs in the first 10 lines, plus the high probability that mail
coming from a domain that looked like "@.*sales.*\.com" was
spam. These sorts of things would be offset by low probabilities when
your functions encountered reassuring things in the mail (for example,
the presence of a subject line that started with "Re:", the presence
of a "In-Reply-To" header, or mail that comes from a .edu domain).

As the arms race of spammer vs spamee evolves, a program such as the
spamometer can be easily modified to reflect the current state of the
art. For example, where perhaps it was once a reliable indicator that
mail was not spam if it contained a "Comments: Authenticated Sender is
.*+@" header, this can now be taken as a red flag. The spamometer (if
what I want to get done ever gets implemented by someone who knows
how) can be nicely adjusted by simply altering probabilities or adding
simple functions.