Re: Simplest Whitelist?

At 12:01 2003-11-27 -1000, Michael J Wise wrote:

On Nov 27, 2003, at 11:45 AM, Dallman Ross wrote:

That seems tremendously wasteful to me.


It isn't.
And it allows you to whitelist based on ANYTHING in the headers at all.

If you want to greenlist on _anything_, then it isn't a traditionaladdress-based greenlist. That's not necessarily bad (as detailed below, Ido something similar, but I really don't consider it a greenlist).

Procmail has already stored it all in memory.


Groovy, but you still get to regexp through it all.

The biggest performance hit is in launching fgrep in the first place.

That's not always the case - as a list grows in size, the search time seemsto grow exponentially.

It is in YOUR power to decide what matches, Sir.
If you specify a poorly-chosen regexp, please don't blame me.
Blame the author of the regexp.

Since _fgrep_ isn't going to be performing much of a REGEXP evaluation,you're going to be matching on simple strings.

The Subject: of this thread is the "Simplest Whitelist".

As I've mentioned many times on this list, I extract common fields frommessages into variables - one time, right at the top of the mainprocmailrc. That way, they're available to all the recipes, which thendon't have to individually handle extracting commonly extractedelements. Sure, I take the hit on extraction on each message even when amessage might not be subjected to a recipe needing to evaluate for thosefields, but I don't end up with duplication of effort, and the wholeprocess is kept tidy. One of those is the components of the return addressfield (minus the Reply-To field). To wit:


        # get the From: address as an address component ONLY (no comments)
        :0 h
        CLEANFROM=|formail -IReply-To: -rtzxTo:

        # username portion
        :0
        * CLEANFROM ?? ^\/[^(_at_)]+
        {
                FROM_USER=$MATCH
        }

        # domain portion
        :0
        * CLEANFROM ?? @\/.*
        {
                FROM_DOMAIN=$MATCH
        }

Armed with that, you can relatively easily check that address against awhitelist if you want, WITHOUT obscure substring mismatches. $CLEANFROM isthe whole address (without username/comment stuff), and $FROM_DOMAIN ishandy for domain-based whitelists. While there's some setup at the top ofthe rcfile, this I think really represents one of the "simplest" yet_non_vague_ matching greenlist implementations:


:0
* ! ? grep ^${CLEANFROM}$ $greenlistfile
{
        # not in the greenlist (see ! in condition), which presumably means
        # you're going to subject the message to more rigorous evaluation.
}

Alternatley (and this is how I actually do a simple NOSPAM bypass, whichisn't precisely a greenlist, since I separatley check for twits, which arecategorically different than spam):



:0h
FAILKEY=| (formail -ISubject: -ITo: -ICc: | $MEGAGREP -i -f $NOSPAMLIST)

# If failkey is blank, we didn't match anything in the greenlist
# note that failkey conveniently contains the entry which matched in the list,
# which means we can use it for logging, or other purposes.
:0
* FAILKEY ?? ^^^^
{
        LOCKFILE=$TEMP/spamrc$LOCKEXT
        INCLUDERC=$SPAMDIR/spam.rc
        LOCKFILE
}

The spam.rc can be *VERY* intensive, so in order to reduce loading, there'san explicit LOCKFILE operation around the INCLUDERC.

MEGAGREP isn't really a GREP, but an AVL tree parser I wrote for thisspecialized purpose - each line in the greenlist file is loaded into an AVLtree, and then the material passed in on STDIN is broken into certainaddress-friendly word-boundary tokens and searched against that AVLtree. The searches are EXTREMELY fast. Plus, unlike greps, the passedfile isn't built into some obscene sized construct when it is loaded intomemory (though the file IS loaded entirely into memory). Run grep sometimewith some large files and monitor memory usage on your system - it getsreally UGLY when you're using the -w argument. While my greenlist isrelatively small (94 entries), the same sort of invocation is used forchecking a private namebased blacklist, which exceeds 200K entries (about3MB in size), so speed and memory consumption is a concern making atraditional GREP wholly unsuitable to the purpose.

Anyway, the benefit to parsing the message data second (otherwise, I'd loadthe message text into the AVL and parse through the greenlist entry byentry, which would be a lot less memory intensive) is that the addresstokens can receive special sub-parsing: first a search is performed for acomplete string, say:


        first(_dot_)last(_at_)host(_dot_)domain(_dot_)tld

If that fails, and the string is found to contain an @, then the string ischopped there and the search re-performed:


        host.domain.tld

then:

        domain.tld

and finally:

        tld

The only reason tld would match in the greenlist (or blacklist, dependingupon how you're using it), is if you entered that into the list, whichmight be useful if you were blacklisting com.tw or somesuch.

Of course, you can redirect whatever you want at the megagrep operation - Ihappen to check the complete headers (minus a few) because any of severalheaders might contain reference to a blacklisted domain (such as areceived: header), but when checking for greenlisting, I eliminatereference to Subject, To: and Cc: lines, where the spammer may haveidentified an address of someone who is in the greenlist.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail