procmail
[Top] [All Lists]

More Bouncing off of Infinity

2003-01-12 16:06:55
I wrote this morning:

Meanwhile, I woke up inspired, because I have a further extension of
the "infinity bounce" (IB) in mind.  I'll document it and send it in
shortly.

Here it is.  I will preface this by saying that my personal procmail
strategy is to avoid body-greps as much as I possibly can.  This has
turned into a symbolic campaign for me, by which I mean that my
distaste for body-greps is probably exaggerated in comparison to 
the actual load they impose on the mail system.  Nevertheless,
by refraining from their use in all but the most egregious
circumstances, I do find that I am forced to keep my options open 
and my creative juices flowing.  It's better for my code, in other
words.

So it is that at the very end of the spam-snag section of my
.procmailrc,
I deign to allow a couple of body-greps; but *only if no other recipe
of mine has characterized the message as spam*.  These are my
last-chance
babies, as it were.  Since my spam outweighs my non-spam by about 3:1,
and since most non-spam is trickled off early out of the sieve by
my whitelists, and since my header-only spam snaggers are very
effective, I almost never actually have to rely on these last-chance 
body-grep recipes to catch something.  Thus, their impact on my system 
is pretty low overall.

Nevertheless, sometimes body-greps are useful.  Moreover, if I can 
make them less invasive, then I'd feel free to use them more broadly.

The IB technique now allows me to do that.

Consider a typical body-grep action with scoring.  As soon as you
start counting, as in, for example,

        * B ?? 1^1  ()\<FREE\>

you're normally going to have to swallow the entire body before 
you're through.  Feeding the body of a largish message to the pipe 
is one aspect of the load -- and one we won't be able to 
escape regardless of our trickery.  And running procmail's 
egreppy operation on the whole megillah is another aspect
of the load.  This second aspect, we can now limit, in cases
where we can accept an upper boundary on the count and then
be willing to stop counting.

One of my "just-in-case" reserved body-grep recipes has been this:

 :0  # 021216 () where's the "multipart"?  There's just one encoded part
  * CTYPE    ?? ^^multipart/mixed
  *  2^0  B  ??  ^Content-Transfer-Encoding:(.*\<)?base64
  * -1^1  B  ??  ^Content-Type:
  { RX = "${RX:+$RX, }UBE.B.BASE64" }

I only run that near the end, if and when I haven't yet determined by 
less invasive means (header checks) that a message is spam.  If I get 
a series of long multipart-encoded messages, this recipe would have to 
scour all the full bodies.  Using the "IB" technique, though, I can 
limit the search somewhat.  I can stop as soon as I hit the second 
message part (of the "multipart, remember?), skipping any subsequent 
parts and all text below the section marker.  For big multiparts, this 
is an advantage.

The only place I *won't* be able to cut back is on messages with
only one part.  Those are almost always going to be the very spam
that I am looking for in the first place.  They won't have two parts
despite the multipart moniker; so the search mechanism will keep
going until the end.  Luckily, most of these messages are pretty
small in size.  (I just checked the last-100 spam messages that I
always keep on-hand, and fifteen of them would have been caught by
the above recipe.  All but one is under 5K in size.)

Here's the same recipe rewritten to take advantage of IB:

 :0  # 030112 () where's the "multipart"?  There's just one encoded part
  *                    CTYPE ??  ^^multipart/mixed
  *            2^0  B        ??
^Content-Transfer-Encoding:(.*\<)?base64
  * $ -$SUPREMUM^0
  *           -1^1  B        ??  ^Content-Type:
  * $  $SUPREMUM^0
  { RX = "${RX:+$RX, }UBE.B.BASE64" }


What we're doing is place-shifting the count.  The initial recipe set
the score to two and then incremented down.  If it found two (or more)
instances of ^Content-Type: in the message, then the "multipart"
assertion
in the header was not a lie.  In the new version, instead of counting
down
from plus-two, we initialize our count at minus-"infinity"-plus-two.
Then we start decrementing.  Remember: when weighted recipes reach
positive or negative infinity, all other weighted conditions in the
recipe are skipped.  If I find two ^Content-Type: markers, the recipe
has satisfied itself that, at least insofar as "multipart" goes, all's
well.  Since, with the place-shifting, two instances of our marker
brings
us up against minus-infinity, the recipe exits quietly.

I never have to count past two markers in my big, multipart messages.
I can feel better about employing this recipe more freely, and about
limiting the impact of the body-grep as much as possible.

Hope this is useful to some.

Addendum:  $CTYPE contains what was in the Content-Type: header;
and $SUPREMUM has been set to 2147483647, about which see the
procmailsc man pages.

P.S.  Another good use for "IB" will be to save a preset, manageable
number of chars to a MATCH variable for later manipulation and
testing.  We couldn't do that before without resorting to shell
tools to help with the parsing; and that was costly.  An example
of this use would be in searches for viruses, whose characteristic
signatures are typically found near the front of message bodies.


-- 
Dallman Ross

"If you find a path with no obstacles, it probably does not lead to
anywhere."
        Thoughts of Rev. Sunnan Kubose, from _Zen in the Markets_ 


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>