procmail
[Top] [All Lists]

Re: mail corruption with dotlock/nfs

2008-03-02 10:32:19
On Sat, Mar 1, 2008 at 12:04 PM, Fletcher Mattox 
<fletcher(_at_)cs(_dot_)utexas(_dot_)edu> wrote:
 Looking at the code, I notice procmail does this before
 writing to the mailbox:

        open(a,O_WRONLY|O_APPEND|O_CREAT,NORMperm);
        ...
        lasttell=lseek(s,(off_t)0,SEEK_END);

 That is, it opens the file O_APPEND *and* seeks to eof.  Isn't the
 seek redundant?  Perhaps it is done only to compute "lasttell",
 and the seek is considered just a harmless side effect.  But what
 if, for some reason, it increases the file pointer too far?

Even if it did increase the pointer too far, that would still be a bug
somewhere other than in procmail.

       O_APPEND
              The file is opened in append mode. Before each write, the  file
              pointer is positioned at the end of the file, as if with lseek.

So if an explicit lseek were going to cause a problem, the implicit
one would too.  On my RHEL4 box the doc goes on:

              O_APPEND may lead to corrupted files on  NFS  file  systems  if
              more  than one process appends data to a file at once.  This is
              because NFS does not support appending to a file, so the client
              kernel  has  to simulate it, which can't be done without a race
              condition.

That explains your procmail log corruption.

What they don't say is that truncate and append at the same time
causes similar problems, and with async mounts you never know when the
"the same time" is.

 > A change in the operating system on either the client or the server,
 > in the way the NFS filesystem is mounted, or even in the email readers
 > (or pop/imap server, etc.) that people are using, could cause this to
 > begin happening even though you've never seen it before.

 I agree, these explanations seems far more likely, but I am at the end of
 my rope trying to find it on the NFS client (and our NetApp admin assures
 me nothing has changed on the server). [...] So it seems *something* has
 changed in the way linux and NetAPP do NFS, but what?

Other things that might affect this are the size and/or layout of the
data on the disks, or the number of processes accessing the files.  If
for any reason -- more network traffic, even -- the server is taking
longer to process a given write than it was before, you could start
losing a race that previously was always being won.

 filer4b:/vol/vol18/v18q001 on /v/filer4b/v18q001 type nfs \
 
(rw,sync,tcp,rsize=8192,wsize=8192,intr,grpid,quota,retry=2,noac,addr=x.x.x.x)

 Can you see any thing else I can do to improve this?

I don't, but my days of involvement with NFS administration for an
installation of any significant size are some years behind me.

 Oh.  One more question.  Do you see any significance in the procmail
 logfiles being corrupted in exactly the same way?  (Described in my
 original mail.)  This seems important to me, since procmail does not
 seem to use dotlocking for the logfile.  i.e. maybe file locking is
 not involved at all?

Well, file locking is meant to enforce that only one process appends
to the file at a time, so with no locking it's not surprising at all
that the logfile sometimes gets corrupted.  Even on local disks you
can get interleaved logs.

The dotlocking scheme is supposed to enforce this order of events:

Process X creates the lock file
Server creates lock file inode
Process Y encounters the lock file and waits for it to be removed
X opens the mailbox
X writes the message to the file
X closes the file
Server flushes the write to disk
X removes the lock file
Server removes the lock file inode
Y creates the lock
(repeat open/write/close/flush/unlock)

The problem is that with an async mount, the server is allowed to
delay creating the lock file inode or to change the order of "flushes
the write" and "removes the inode", either of which can cause Y to
open and write the file before the server flushes X's changes, and
then all bets are off.  This is especially problematic if X and Y are
on different NFS clients, where the state of the file may not even
appear the same when they begin the operation.
____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail