Re: Thoughts on speed-up?

At 16:56 1999-12-02 +0200, Brock Rozen wrote:

Hi,

I have so many list filters that it can take up to a minute for an email
to get processed by them all and finally hit my INBOX (in the assumption
that it matches none of them).

That's annoying.

I realized that while I may have (for example) 5 filters, it might be of
benefit to have one "pre-filter" that does the exact same thing all of
them do. If it doesn't match that one "pre-filter" then it can move along,

Along the same lines, perhaps a "white list" is what you're looking for. Ihave some serious anti-spam stuff going down (lets say each and everymessage header is grepped against a ~4MB sized file containing baddomains), as well as a slew of spam and twit rules.

What I do is have a "notwit.rc" wrapped around an INCLUDERC for thetwits.rc, (and similar for nospam.rc). Those grep the FROM: header againsta whitelist of addresses - either individuals or lists, which I deemsafe. Note that I filter for twits BEFORE spam, so if there is anobnoxious fellow on a list, he'll be eliminated even if the list is on thespam-safe exclusion list, AND since twit filtering is considerably lesscostly than spam filtering, it is quicker to do it that way.

The spam and twit filtering are also both enveloped in manually-specifiedlockfiles to keep multiple-instances from running concurrently (since thegrep operation is a major memory pig).

I strive to order filters in the order in which the most frequentlyreceived messages will be filtered earlier, while still maintaining abalance against quicker filtering, although less frequent messages.

So it would look like this:


[snip]

I guess this _could_ speed things up a fraction, but really only if youhave several such blocks of rules. Basically, you'd be appling somethinglike a n-nary search principle to your filters: if you had 16 filters, andbroke them down into four groups of four, your worst case match (the verylast rule), would take four 'envelopes' (these grouped blocks of rules)plus four individual tests. Best time would be one 'envelope' plus onespecific (the very first one). So 2-8, versus 1-16 -- a bit more overheadfor the first one (actually, for all the rules matched in the first block),but by the second block, you're ahead of the game - the first rule in thesecond block would have taken three matches to arrive at (two envelopes,plus that specific rule), whereas in a linear environment, it would havetaken five.

Phillips use of $MATCH is of course going to be much faster (and, IMO, theordering issue he points out shouldn't be much of a concern: if you'refiltering for two lists or recipients, the To: is _probably_ the one youwant to consider prime anyway).

Grouping filters into blocks AND using the $MATCH could greatly speedthings up. IF you're always searching on the same headers (at least in anyone grouping), and the order can be maintained easily. The more complexyour filters get, the less likely you can take advantage of these schemes.


BTW, in answer to your questions posted in a later article:

The \/ in the rule line demarks the beginning of the regexp which will bestored in $MATCH. This is why his filter will filter quicker - it is onlyprocessing the header once for the blocked set of rules - that ONE time,instead of once for each ^TO_ rule.

No, the word won't be matched for in the body. Two reasons for that: the'B' flag isn't present (so you're only filtering the header by default),and the ^TO_ macro is looking for common addressee formats (prefixed by To:and Cc: at the BEGINNING of the line (^)) - while they could appear in thebody, there would be few instances (forwards of mailbox files perhaps) thatmight match, but even for those cases, you'd still have to have the 'B'flag for that match to happen.


---
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

 Sean B. Straw / Professional Software Engineering
 Post Box 2395 / San Rafael, CA  94912-2395