Re: mail corruption with dotlock/nfs

Bart Schaefer writes:

On Fri, Feb 29, 2008 at 2:44 PM, Fletcher Mattox 
<fletcher(_at_)cs(_dot_)utexas(_dot_)edu> wrote:

 The corruption seems to always be of the form:

        \000 ... \000From user(_at_)dom(_dot_)ain date

 where \000 is a the null byte.  That is, we are seeing a series of nulls
 (30 to 3000+) prepended to the message (or perhaps appended from the
 previous message).

 We deliver to the user's home directory mounted via NFS from a Network
 Appliance file server.


This is probably related to some sort of caching issue.  It's most
likely to occur when a mail reader truncates the file and releases the
lock, and then procmail grabs the lock, seeks to the end, and begins
writing.  If the NFS server hasn't completed processing the file
truncation when the write begins, everything between the truncation
point and the seek point will be filled with NUL bytes.


Understood.  And you are probably right.  But I wonder about another
possibility.  Looking at the code, I notice procmail does this before
writing to the mailbox:

        open(a,O_WRONLY|O_APPEND|O_CREAT,NORMperm);
        ...
        lasttell=lseek(s,(off_t)0,SEEK_END);

That is, it opens the file O_APPEND *and* seeks to eof.  Isn't the
seek redundant?  Perhaps it is done only to compute "lasttell",
and the seek is considered just a harmless side effect.  But what
if, for some reason, it increases the file pointer too far?
Admittedly, I do not know how this could happen, but wouldn't that
result in *exactly* the type of corruption I am seeing?

A change in the operating system on either the client or the server,
in the way the NFS filesystem is mounted, or even in the email readers
(or pop/imap server, etc.) that people are using, could cause this to
begin happening even though you've never seen it before.


I agree, these explanations seems far more likely, but I am at the end of
my rope trying to find it on the NFS client (and our NetApp admin assures
me nothing has changed on the server).  I have done a full restore to the
day before the corruption was first reported, and the problem persists!
I do not know what else to do.

Another factor which seems to support your theory is that the corruption
happens only on our linux mail servers.  Solaris remains unaffected
(this is the only way we have been able to survive two weeks without
users mutiny).  So it seems *something* has changed in the way linux and
NetAPP do NFS, but what?

The fix is to be sure that the NFS filesystem is mounted noasync and
with attribute caching etc. turned off.  You may also need to tune the
read or write block sizes.  Depending on the version of NFS, the mount
may also need to be "hard" rather than "soft".


Hmm.  Thanks for saying that.  I was mildly surprised that linux
NFS defaults to async and ac.  (I always thought async was especially
dangerous, but most of my NFS experience is limited to Solaris).  At least
when I add "sync,noac" to the mount options, the seemed to stick, and
were not present before, so now the complete set reads:

filer4b:/vol/vol18/v18q001 on /v/filer4b/v18q001 type nfs \
(rw,sync,tcp,rsize=8192,wsize=8192,intr,grpid,quota,retry=2,noac,addr=x.x.x.x)

Can you see any thing else I can do to improve this?  I will test it
for mail corruption next week.  Thanks for the suggestion!

Oh.  One more question.  Do you see any significance in the procmail
logfiles being corrupted in exactly the same way?  (Described in my
original mail.)  This seems important to me, since procmail does not
seem to use dotlocking for the logfile.  i.e. maybe file locking is
not involved at all?

Fletcher
____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail