procmail
[Top] [All Lists]

Re: stumped by egrep -f for whitelist

2002-06-20 17:14:30
At 00:26 2002-06-21 +0200, Eric Smith did say:
I am using egrep -if to test for the presence of email addresses
in a whitelist.  Once I get to about 200 entries in my whitelist,
then the process time is about 4 seconds but doubles with
250 entries.

It's a known issue that grep pigs out on memory. You should see what grep will do to a system when your -f expressionlist file is 3+ MB in size, and you're using -w on the input. Now, contemplate doing THAT on several thousand messages a day.

Whitelist could easily grow to thousands of entries.

My blacklist of domains (besides the dnsbl I maintain, which is handled at the MTA), is 202K+ lines. It takes my megagrep tool about 3 seconds to scan the whole list (which needs to be loaded into memory and parsed into a tree) against a message of about any arbitrary size. Megagrep is optimized to the way I parse headers -- I match word tokens.

I archived the source code off someplace else (just the compiled binary is on my servers right now), but it's based on some other material of mine, and is a compiled C++ tool.

Someone suggests look that does binary searches and this looks
like a good solution but does not support regexes.

My AVL tree code doesn't do regexp's (well, I could link a library and diddle with that, but then the AVL wouldn't gain me much).

What is the best solution here?

One approach might be to maintain two separate greenlist files - one that is plain non-regexp strings, and (a presumably much smaller) one with regexps. Don't hand grep anything more than you want it to be parsing.

The alternative approach is to write a specialized tool that does little more than EXACTLY what you need it to do, and to do that extremely efficiently.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>