procmail
[Top] [All Lists]

Re: Not scaling well

2003-09-15 10:26:41
On 15 Sep, Enzo wrote:
| Wondering if anyone sees any inefficiencies in this beast.  I've trolled 
| the FAQ's and message boards, but just seem to break it whenever I try 
| to improve it.  It worked fine for 250 users for a number of years, but 
| now we're pushing a 1000 and, as we all know, the level of spam has 
| skyrocketed.   It's running globally out of the main .procmailrc file on 
| a QMail-LDAP server on FreeBSD 4.6.  Watching top, it's just filled with 
| 'procmail' and 'egrep' processes chewing up all available CPU.  
| 
| The user defined white and blacklists contain various possible entries 
| including 'spammer(_at_)test(_dot_)com', 'test.com', etc,etc all the way down 
to 
| single word matching like 'spammer' which would catch it anywhere in the 
| address.  A few users have lists with several hundred entries.
| 
| We originally found this script on a procmail FAQ/examples page 
| somewhere, and it's worked great, but now we're hoping we can avoid 
| throwing faster hardware at the problem.  Searching around I know there 
| has to be a better way, but can't seem to come up with the magic 
| incantation. Hoping someone out there might see something in there and 
| be able to lend a hand.
| 
| Thanks in advance!
| 
| 
| > # Test if the email's sender is in user definded whitelist, if so 
| > deliver it.
| > :0
| > * ? formail -x"From" -x"From:" -x"Sender:" -x"Reply-To:" 
| > -x"Return-Path:" -x"To:" | egrep -is -f 
| > /usr/local/apache/htdocs/secure/usermaint/nobounce/${USER}
| > ${MY_MAILDIR}
| >
| > # Test if the email's sender is in user definded blacklisted
| > # if so, send it to back to sender w/ bogus user unknown
| > # mark with "Recipient Refusal" so it can be traced back
| >
| > #Define getting the sender's address, Discard any leading and trailing 
| > whitespaces
| > FROM_=`formail -rt -xTo: \
| >   | expand | sed -e 's/^[ ]*//g' -e 's/[ ]*$//g'`
| >  
| > #Return certain blacklisted email
| > :0
| > * ? formail -x"From" -x"From:" -x"Sender:" -x"Reply-To:" 
| > -x"Return-Path:" -x"To:" | egrep -is -f 
| > /usr/local/apache/htdocs/secure/usermaint/blacklist/${USER}
| > # Avoid forgeries that pretend to be from my own site
| > * ! $ ? echo ${FROM_} | fgrep -is 'boothcreek.com'
| > * $ ? echo ${FROM_} | fgrep -is '.'
| > * $ ? echo ${FROM_} | fgrep -is '@'
| > # Avoid email loops
| > * ! ^X-Loop: postmaster(_at_)mydomain\(_dot_)com
| > {
| >   # Make a temporary file of the message to be returned
| >   :0c:formail.lock
| >   # Discard whitespaces, insert a leading blank
| >   | expand | sed -e 's/[ ]*$//g' | sed -e 's/^/ /' > return.tmp
| >   # Prepare and send the rejection
| >   :0:formail.lock
| >   | (formail -r -I"Subject: Rejected mail: Recipient refusal" \
| >     -I"From: ${ALTFROM}" \
| >     -I"Return-Path: noreply(_at_)mydomain(_dot_)com" \
| >     -A"X-Loop: postmaster(_at_)mydomain(_dot_)com" ; \
| >     echo "" ; \
| >     echo "    This user has choosen not to receive emails from this 
| > address." ; \
| >     echo "    Please contact them in a different manner (#5.1.1)" ; \
| >     echo "  " ; \
| >     echo "--- below is a copy of the rejected mail ---" ; \
| >     echo " " ; \
| >     cat return.tmp ; \
| >     echo "--- end rejected mail ---" ; \
| >     rm -f return.tmp) \
| >     | /usr/sbin/sendmail -t
| > } 
| 
| 

There's at least two obvious bottlenecks above.

1. You assign to a variable FROM_ then echo it and pipe it to fgrep. 
These are unnecessary processes that procmail can handle itself.  It's
probably not the cause of the slowdown, but it doesn't help.  Rewrite
the 3 conditions as:

* ! FROM_ ?? boothcreek\.com
*   FROM_ ?? ()\.
*   FROM_ ?? @

2. Sending the notices has to slow things down considerably.  On top
of that, the lock file (formail.lock) throttles procmail processes so
that only one can execute that code at a time.  Any others running (i.e.
multiple messages arrive in short order) are put in a holding pattern
until they can get a lock.  My suggestion is skip the notification and
deliver the message to /dev/null.  Some will argue against that, but
since you're not saving a copy anyway, there's nothing lost there. 
It's a separate debate whether that's advisable or not.  Notices get to
a person who cares about them probably less than 1% of the time.  The
rest of them are either undeliverable or go people who's addresses have
been forged and don't need the added insult of the deluge of such
notices. It's almost always the wrong thing to auto-ack spam, especially
from procmail. If you can't do it from the MTA, it's too late. If you
insist on sending these notices, which IMO calls to question your
competence as a mail admin, use a unique file name for the tmp files
(e.g. return$$.tmp) and omit the lock file.

If none of these things help, you might consider breaking the external
files that grep uses into smaller pieces.  I don't have personal
experience with it, but recall reading that grep -f can be a bottleneck
when the external file gets large.  (Sean Straw may have some more
meaningful input here.)  I use a directory for each users' whitelists.
They are encouraged to break them into smaller pieces and to consider
the individual file names so that more frequent matches are in files
likely to be processed earlier.  It has the beneficial side effect of
providing more fine-grained information as to why a message was
whitelisted.  For example it might be "approved deh news" or "approved
deh Friend" or "approved deh wcpss", etc.


-- 
Email address in From: header is valid  * but only for a couple of days *
This is my reluctant response to spammers' unrelenting address harvesting



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>