Re: Matching # of recipients in To:?

dattier(_at_)wwa(_dot_)com (David W. Tamkin) writes:

Dan Smith wrote,

| Here's what I use in my spam heuristics, courtesy of David Tamkin.

Thanks, Dan.  As I recall, though, Philip Guenther and at least one other
person deserve shares in the credit.


Looking through my procmail outbox I think I found the original thread.
It appears it was raised by Ken Marsh, and in the end David and I ended
up with the following recursive INCLUDERC to count the comma separated
items in any number of headers.  This is more generic than the solutions
presented recently as it will work whether or not there are multiple
To: or Cc: headers, and will correctly (for some values of correctly)
handle Resent-* headers.

Here's the last I mail I have from the thread, with one small bugfix
applied.  In this case the goal was to bounce any message with more
than 19 recipients.

To: dattier(_at_)wwa(_dot_)com (David W. Tamkin)
cc: procmail(_at_)informatik(_dot_)rwth-aachen(_dot_)de (Procmail Mailing List)
Subject: Re: Counting score program exit code and negation 
In-reply-to: Your message of "Thu, 06 Feb 1997 14:13:50 CST."
            <m0vsaCd-000k8qC(_at_)miso(_dot_)wwa(_dot_)com> 
Date: Thu, 06 Feb 1997 16:19:25 -0600
From: Philip Guenther <guenther(_at_)gac(_dot_)edu>

dattier(_at_)wwa(_dot_)com (David W. Tamkin) writes:

Well, ok, I've deconstructed it.  (You knew I would.)


To summarize for those trying to follow David's deconstruction,
into the 'calling' procmailrc, put:


# Put the regexp of the lines to examine in REGEXP
# Also count Apparently-To: headers, just because they're obnoxious.
:0
* ^Resent-(From|Date|To|Cc|Message-Id):
{ REGEXP = "^(Resent-(To|Cc)|Apparently-To):" }
:0E
{ REGEXP = "^((Apparently-)?To|Cc):" }


# Put the string to match against into $HEADERLINES
:0 # H is implicit
* $ ()\/$REGEXP(.*$)*
{
 # Set up the initial run.
 HEADERLINES = $MATCH

 # What's the maximum number of items allowed?  This passes, one more
 # get's torched.
 EXCESS = -19

 # Let it rip.
 INCLUDERC = count_comma_sep.rc

:0 # count was predefined and prejudiced at -19 
* $ $EXCESS^0
{ EXITCODE=77 HOST }
}



And in the count_comma_sep.rc file:


# Goal: count the number of comma-separated items in $HEADERLINES in
# lines that begin with $REGEXP.  EXCESS will start at the negative of
# the maximum allowed, then count up towards zero.

# Clear MATCH, then count the items in the first match, if any.
MATCH=
:0
* 1^0 HEADERLINES ?? $ $REGEXP\/.*
* 1^1 MATCH ?? ,
{
   # Okay, increment EXCESS by $=.
   :0
   * $ $EXCESS^0
   * $ $=^0
   { EXCESS = $= }

   # Now, do we need to recurse?  Only if there's still another
   # match to $REGEXP.  If so, we reset $HEADERLINES to contain only
   # the second match and beyond, and then we take another ride.
   :0
   * HEADERLINES ?? $ $REGEXP(.*$)*\/$REGEXP(.*$)*
   {
      HEADERLINES = $MATCH

      # Recurse!
      INCLUDERC = $_
   }
}


The one optimization that I can still see is to add the following
condition to the recursion recipe:

   :0
   * EXCESS ?? ^[1-9]
   * HEADERLINES ?? $ $REGEXP(.*$)*\/$REGEXP(.*$)*
   {
      ...
   }

That'll avoid the recursion if we've already gone positive.  Note that
this is a Good Thing if just to avoid the copies that the very next
condition will cause in coping the matched text back and forth from
HEADERLINES to MATCH and back again.

HOWEVER...

With this added condition, it becomes more complicated if you want to
get the total count, with no upper limit criteria as is true in the
original goal.  If you think it unlikely that you'll need to do that,
or the extra bother of having to set EXCESS to negative 10 million
doesn't bother you, go ahead and add the condition.



Philip Guenther