Re: NFS screwup -> hung procmail

From: "Aaron D. Turner" <aturner(_at_)best(_dot_)com>
Subject: NFS screwup -> hung procmail

Hmm...  Got a kinda situation here:  We've got a NFS server that provides
the mail directories and inbox folders for our accounts on another server. 
On this server where the accounts are, I run procmail, but over the
weekend the two servers had a lot of "issues".  Mail wasn't properly being
delievered after this.  About 1/2 hour ago, mail was restored and the
queue started empting, only now there's 47 procmail instancies of mine
that seem "hung" and aren't delivering my mail.  I probably have even more
email that procmail hasn't even started processing.

I noticed that there were a bunch of lock files- I figured they were stale
from earlier, so I deleted them.  That didn't seem to help.  What should I
do? Kill the processes? HUP them? It's pretty important that I don't loose
any email.


We've seen a similar situation, I believe.  Here's a simplified version
of what we have:

fileserver - runs NFS only, has home directories and mail spools
mailserver - runs sendmail with local delivery to NFS mounted spools
             via procmail
client     - runs sendmail in queue process mode only, user applications
             read the NFS mounted spool files

There seems to be a bug where locks between the client (running AIX)
and the fileserver (Auspex) get lost.  To completely clear the problem,
we have to:

0) figure out which user account is stuck (look for lots of procmail's
   just hanging out)
1) kill sendmail, all children, and all procmail's on mailserver
2) kill any UMA's on client
3) kill rpc.statd and rpc.lockd on client
4) cp the affected user's spool file to a new file, rm the old file,
   and mv the copy to the original name (spool file now has a new
   inode number so none of the lockd's can believe that they have a
   lock for it)
5) restart rpc.statd and rpc.lockd on client
6) restart sendmail on mailserver

Fortunately, this doesn't happen too often as it is a real pain.  Mail
delivery can stop if enough procmail's accumulate, people get upset,
we do steps 0 through 6, the queues have to flush, and calm returns.

I would like to have an alarm timer around the procmail locking calls
so that the lock attempt would only block for a limited time.  If the
lock attempt fails, the mail could be requeued, and (even better) log
a warning.  Does anyone know if something like this is possible with
the current code?

-- 
Keith Pyle
Systems/Network Engineering
Motorola Somerset PowerPC Design Center
keith(_at_)ibmoto(_dot_)com