At 12:01 2003-11-27 -1000, Michael J Wise wrote:
On Nov 27, 2003, at 11:45 AM, Dallman Ross wrote:
That seems tremendously wasteful to me.
It isn't.
And it allows you to whitelist based on ANYTHING in the headers at all.
If you want to greenlist on _anything_, then it isn't a traditional 
address-based greenlist.  That's not necessarily bad (as detailed below, I 
do something similar, but I really don't consider it a greenlist).
Procmail has already stored it all in memory.
Groovy, but you still get to regexp through it all.
The biggest performance hit is in launching fgrep in the first place.
That's not always the case - as a list grows in size, the search time seems 
to grow exponentially.
It is in YOUR power to decide what matches, Sir.
If you specify a poorly-chosen regexp, please don't blame me.
Blame the author of the regexp.
Since _fgrep_ isn't going to be performing much of a REGEXP evaluation, 
you're going to be matching on simple strings.
The Subject: of this thread is the "Simplest Whitelist".
As I've mentioned many times on this list, I extract common fields from 
messages into variables - one time, right at the top of the main 
procmailrc.  That way, they're available to all the recipes, which then 
don't have to individually handle extracting commonly extracted 
elements.  Sure, I take the hit on extraction on each message even when a 
message might not be subjected to a recipe needing to evaluate for those 
fields, but I don't end up with duplication of effort, and the whole 
process is kept tidy.  One of those is the components of the return address 
field (minus the Reply-To field).  To wit:
        # get the From: address as an address component ONLY (no comments)
        :0 h
        CLEANFROM=|formail -IReply-To: -rtzxTo:
        # username portion
        :0
        * CLEANFROM ?? ^\/[^(_at_)]+
        {
                FROM_USER=$MATCH
        }
        # domain portion
        :0
        * CLEANFROM ?? @\/.*
        {
                FROM_DOMAIN=$MATCH
        }
Armed with that, you can relatively easily check that address against a 
whitelist if you want, WITHOUT obscure substring mismatches.  $CLEANFROM is 
the whole address (without username/comment stuff), and $FROM_DOMAIN is 
handy for domain-based whitelists.  While there's some setup at the top of 
the rcfile, this I think really represents one of the "simplest" yet 
_non_vague_ matching greenlist implementations:
:0
* ! ? grep ^${CLEANFROM}$ $greenlistfile
{
        # not in the greenlist (see ! in condition), which presumably means
        # you're going to subject the message to more rigorous evaluation.
}
Alternatley (and this is how I actually do a simple NOSPAM bypass, which 
isn't precisely a greenlist, since I separatley check for twits, which are 
categorically different than spam):
:0h
FAILKEY=| (formail -ISubject: -ITo: -ICc: | $MEGAGREP -i -f $NOSPAMLIST)
# If failkey is blank, we didn't match anything in the greenlist
# note that failkey conveniently contains the entry which matched in the list,
# which means we can use it for logging, or other purposes.
:0
* FAILKEY ?? ^^^^
{
        LOCKFILE=$TEMP/spamrc$LOCKEXT
        INCLUDERC=$SPAMDIR/spam.rc
        LOCKFILE
}
The spam.rc can be *VERY* intensive, so in order to reduce loading, there's 
an explicit LOCKFILE operation around the INCLUDERC.
MEGAGREP isn't really a GREP, but an AVL tree parser I wrote for this 
specialized purpose - each line in the greenlist file is loaded into an AVL 
tree, and then the material passed in on STDIN is broken into certain 
address-friendly word-boundary tokens and searched against that AVL 
tree.  The searches are EXTREMELY fast.  Plus, unlike greps, the passed 
file isn't built into some obscene sized construct when it is loaded into 
memory (though the file IS loaded entirely into memory).  Run grep sometime 
with some large files and monitor memory usage on your system - it gets 
really UGLY when you're using the -w argument.  While my greenlist is 
relatively small (94 entries), the same sort of invocation is used for 
checking a private namebased blacklist, which exceeds 200K entries (about 
3MB in size), so speed and memory consumption is a concern making a 
traditional GREP wholly unsuitable to the purpose.
Anyway, the benefit to parsing the message data second (otherwise, I'd load 
the message text into the AVL and parse through the greenlist entry by 
entry, which would be a lot less memory intensive) is that the address 
tokens can receive special sub-parsing: first a search is performed for a 
complete string, say:
        first(_dot_)last(_at_)host(_dot_)domain(_dot_)tld
If that fails, and the string is found to contain an @, then the string is 
chopped there and the search re-performed:
        host.domain.tld
then:
        domain.tld
and finally:
        tld
The only reason tld would match in the greenlist (or blacklist, depending 
upon how you're using it), is if you entered that into the list, which 
might be useful if you were blacklisting com.tw or somesuch.
Of course, you can redirect whatever you want at the megagrep operation - I 
happen to check the complete headers (minus a few) because any of several 
headers might contain reference to a blacklisted domain (such as a 
received: header), but when checking for greenlisting, I eliminate 
reference to Subject, To: and Cc: lines, where the spammer may have 
identified an address of someone who is in the greenlist.
---
 Sean B. Straw / Professional Software Engineering
 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail