procmail
[Top] [All Lists]

Re: Simplest Whitelist?

2003-11-28 12:52:03
At 12:01 2003-11-27 -1000, Michael J Wise wrote:
On Nov 27, 2003, at 11:45 AM, Dallman Ross wrote:

That seems tremendously wasteful to me.

It isn't.
And it allows you to whitelist based on ANYTHING in the headers at all.

If you want to greenlist on _anything_, then it isn't a traditional address-based greenlist. That's not necessarily bad (as detailed below, I do something similar, but I really don't consider it a greenlist).

Procmail has already stored it all in memory.

Groovy, but you still get to regexp through it all.

The biggest performance hit is in launching fgrep in the first place.

That's not always the case - as a list grows in size, the search time seems to grow exponentially.

It is in YOUR power to decide what matches, Sir.
If you specify a poorly-chosen regexp, please don't blame me.
Blame the author of the regexp.

Since _fgrep_ isn't going to be performing much of a REGEXP evaluation, you're going to be matching on simple strings.

The Subject: of this thread is the "Simplest Whitelist".

As I've mentioned many times on this list, I extract common fields from messages into variables - one time, right at the top of the main procmailrc. That way, they're available to all the recipes, which then don't have to individually handle extracting commonly extracted elements. Sure, I take the hit on extraction on each message even when a message might not be subjected to a recipe needing to evaluate for those fields, but I don't end up with duplication of effort, and the whole process is kept tidy. One of those is the components of the return address field (minus the Reply-To field). To wit:

        # get the From: address as an address component ONLY (no comments)
        :0 h
        CLEANFROM=|formail -IReply-To: -rtzxTo:

        # username portion
        :0
        * CLEANFROM ?? ^\/[^(_at_)]+
        {
                FROM_USER=$MATCH
        }

        # domain portion
        :0
        * CLEANFROM ?? @\/.*
        {
                FROM_DOMAIN=$MATCH
        }


Armed with that, you can relatively easily check that address against a whitelist if you want, WITHOUT obscure substring mismatches. $CLEANFROM is the whole address (without username/comment stuff), and $FROM_DOMAIN is handy for domain-based whitelists. While there's some setup at the top of the rcfile, this I think really represents one of the "simplest" yet _non_vague_ matching greenlist implementations:

:0
* ! ? grep ^${CLEANFROM}$ $greenlistfile
{
        # not in the greenlist (see ! in condition), which presumably means
        # you're going to subject the message to more rigorous evaluation.
}


Alternatley (and this is how I actually do a simple NOSPAM bypass, which isn't precisely a greenlist, since I separatley check for twits, which are categorically different than spam):


:0h
FAILKEY=| (formail -ISubject: -ITo: -ICc: | $MEGAGREP -i -f $NOSPAMLIST)

# If failkey is blank, we didn't match anything in the greenlist
# note that failkey conveniently contains the entry which matched in the list,
# which means we can use it for logging, or other purposes.
:0
* FAILKEY ?? ^^^^
{
        LOCKFILE=$TEMP/spamrc$LOCKEXT
        INCLUDERC=$SPAMDIR/spam.rc
        LOCKFILE
}


The spam.rc can be *VERY* intensive, so in order to reduce loading, there's an explicit LOCKFILE operation around the INCLUDERC.

MEGAGREP isn't really a GREP, but an AVL tree parser I wrote for this specialized purpose - each line in the greenlist file is loaded into an AVL tree, and then the material passed in on STDIN is broken into certain address-friendly word-boundary tokens and searched against that AVL tree. The searches are EXTREMELY fast. Plus, unlike greps, the passed file isn't built into some obscene sized construct when it is loaded into memory (though the file IS loaded entirely into memory). Run grep sometime with some large files and monitor memory usage on your system - it gets really UGLY when you're using the -w argument. While my greenlist is relatively small (94 entries), the same sort of invocation is used for checking a private namebased blacklist, which exceeds 200K entries (about 3MB in size), so speed and memory consumption is a concern making a traditional GREP wholly unsuitable to the purpose.

Anyway, the benefit to parsing the message data second (otherwise, I'd load the message text into the AVL and parse through the greenlist entry by entry, which would be a lot less memory intensive) is that the address tokens can receive special sub-parsing: first a search is performed for a complete string, say:

        first(_dot_)last(_at_)host(_dot_)domain(_dot_)tld

If that fails, and the string is found to contain an @, then the string is chopped there and the search re-performed:

        host.domain.tld

then:

        domain.tld

and finally:

        tld

The only reason tld would match in the greenlist (or blacklist, depending upon how you're using it), is if you entered that into the list, which might be useful if you were blacklisting com.tw or somesuch.

Of course, you can redirect whatever you want at the megagrep operation - I happen to check the complete headers (minus a few) because any of several headers might contain reference to a blacklisted domain (such as a received: header), but when checking for greenlisting, I eliminate reference to Subject, To: and Cc: lines, where the spammer may have identified an address of someone who is in the greenlist.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>