procmail
[Top] [All Lists]

Re: Thoughts on speed-up?

1999-12-02 12:01:16
At 16:56 1999-12-02 +0200, Brock Rozen wrote:
Hi,

I have so many list filters that it can take up to a minute for an email
to get processed by them all and finally hit my INBOX (in the assumption
that it matches none of them).

That's annoying.

I realized that while I may have (for example) 5 filters, it might be of
benefit to have one "pre-filter" that does the exact same thing all of
them do. If it doesn't match that one "pre-filter" then it can move along,

Along the same lines, perhaps a "white list" is what you're looking for. I have some serious anti-spam stuff going down (lets say each and every message header is grepped against a ~4MB sized file containing bad domains), as well as a slew of spam and twit rules.

What I do is have a "notwit.rc" wrapped around an INCLUDERC for the twits.rc, (and similar for nospam.rc). Those grep the FROM: header against a whitelist of addresses - either individuals or lists, which I deem safe. Note that I filter for twits BEFORE spam, so if there is an obnoxious fellow on a list, he'll be eliminated even if the list is on the spam-safe exclusion list, AND since twit filtering is considerably less costly than spam filtering, it is quicker to do it that way.

The spam and twit filtering are also both enveloped in manually-specified lockfiles to keep multiple-instances from running concurrently (since the grep operation is a major memory pig).

I strive to order filters in the order in which the most frequently received messages will be filtered earlier, while still maintaining a balance against quicker filtering, although less frequent messages.

So it would look like this:

[snip]

I guess this _could_ speed things up a fraction, but really only if you have several such blocks of rules. Basically, you'd be appling something like a n-nary search principle to your filters: if you had 16 filters, and broke them down into four groups of four, your worst case match (the very last rule), would take four 'envelopes' (these grouped blocks of rules) plus four individual tests. Best time would be one 'envelope' plus one specific (the very first one). So 2-8, versus 1-16 -- a bit more overhead for the first one (actually, for all the rules matched in the first block), but by the second block, you're ahead of the game - the first rule in the second block would have taken three matches to arrive at (two envelopes, plus that specific rule), whereas in a linear environment, it would have taken five.

Phillips use of $MATCH is of course going to be much faster (and, IMO, the ordering issue he points out shouldn't be much of a concern: if you're filtering for two lists or recipients, the To: is _probably_ the one you want to consider prime anyway).

Grouping filters into blocks AND using the $MATCH could greatly speed things up. IF you're always searching on the same headers (at least in any one grouping), and the order can be maintained easily. The more complex your filters get, the less likely you can take advantage of these schemes.

BTW, in answer to your questions posted in a later article:

The \/ in the rule line demarks the beginning of the regexp which will be stored in $MATCH. This is why his filter will filter quicker - it is only processing the header once for the blocked set of rules - that ONE time, instead of once for each ^TO_ rule.

No, the word won't be matched for in the body. Two reasons for that: the 'B' flag isn't present (so you're only filtering the header by default), and the ^TO_ macro is looking for common addressee formats (prefixed by To: and Cc: at the BEGINNING of the line (^)) - while they could appear in the body, there would be few instances (forwards of mailbox files perhaps) that might match, but even for those cases, you'd still have to have the 'B' flag for that match to happen.

---
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

 Sean B. Straw / Professional Software Engineering
 Post Box 2395 / San Rafael, CA  94912-2395

<Prev in Thread] Current Thread [Next in Thread>