procmail
[Top] [All Lists]

Re: mail corruption with dotlock/nfs

2008-03-09 17:15:47
Ariel Biener writes:
     We're using postfix+procmail with NFS on netapp for
many years, on what I believe to be a largish setup
for a NFS environment, about 60,000 users. We never
saw mail corruption except on one case, that is, when
one of the mounts was using UDP and the filer hit
100% cpu for more than a few minutes. 
    I never used the `sync' mount option, due to it's
enormous hit on performance. Also, I believe that the examples
and data quoted about "data corruption when using async"
was related to exporting filesystems over NFS from a Linux
NFS server(!) with the sync/async option enabled in the
export.

Hi Ariel,

I am not sure what "examples and data" you are referring to, but I hope
you are not thinking of anything I wrote.  The only corruption, other
than our own, which I mentioned was described in NetApp documentation.
I believe it is fairly safe to assume they are not talking about a Linux
NFS server.  :)  That's not really an example or data anyway, but rather
a prediction about their own NFS servers.  To paraphrase, the NetApp
document went something like this:

    NetApp does not support asynchronous NFS writes.
    If an NFS client requests an async mount, the
    mount request may or may not succeed.  If the mount
    succeeds, the filer will perform synchronous writes.
    Such a mount can lead to data corruption.

I consider this a dangerous policy by NetApp.  Why not simply refuse
all async mount requests?

Are you using NFS over UDP ?  Are you using soft mounts ?

I assume you meant to write "TCP", not "NFS".  Yes. we use TCP hard
mounts.  (I would never consider useing a soft mount for anything I
cared about).

Our setup:

2 postfix mail servers
2 imap servers (used to be UW, now using dovecot)
2 pop servers (using UW ipop3d).

All 6 servers mount the same /var/spool/mail, and
the same home directory trees, all via NFS from the
same filer (FAS3050, not clustered). All servers are
active (load balanced via an external load balancer),
so a mailbox may be open from some least 3 locations
(2 mails delivered to the mailbox at the same time
by mailserver1 or 2, and also an active pop or imap
session for the user).

We use NFSv3 only, and on the netapp we're not allowing
mounts with rsize/wsize greater than 8k (due to switches
having buffering problems with larger packets).

We have similar environments.  Yours is larger, but it is qualitatively
similar.  We use sendmail,procmail,UW imap/pop, FAS3050 to serve about
1500 users from 600+ ubuntu clients with 200+ automounted file systems.
One difference is that we put the inbox in home directories rather
than /var/spool/mail.  This adds one more point of possible concurrency.
i.e. users who prefer mail readers which access their inbox directly. (We
have shown, however, that this is not necessary for corruption.  In fact
we can get corruption with two procmail processes on different hosts,
no imap/pop is needed).

Our mount options:

rw,rsize=8192,wsize=8192,lock,tcp,nfsvers=3,hard,intr,bg,nosuid,nodev

Again, we are similar.  From /proc/mounts:

   rw,vers=3,rsize=8192,wsize=8192,hard,intr,proto=tcp,
   timeo=600,retrans=2,sec=sys,addr=udb 0 0

Question: why doesn't linux explicitly list "async" here (and in
/etc/mtab)?  It took me a long time to learn this really was an async
mount.  I come from a long history of NFSv2 experience on Solaris, where
we would never dream of using an async mount for anything important,
let alone to use it for a default.  I will admit to being shocked when
I finally discovered this was an async mount.

On the netapp, our options for NFS (and other relevant stuff) are:

[ deleted ]

Nothing stands out in your config.  Our settings are similar.  But that
is not surprising.  I imagine there are a thousand others just like
us for whom NFS locking with procmail has worked fine for many years.
After all, it had worked flawlessly for us ever since our first NetApp
in 2001.  But something suddenly changed in our environment three weeks
ago which broke it, and I cannot determine what it was.  Even if you
assume, as I do, that our problem is a mismatch in NFS client and server
mount options, why did it work flawlessly for seven years before breaking?

In any event, I am now convinced this is NOT a procmail issue, and
will pursue the problem with NFS developers.  But we will probably
end up either (1) moving all our mail servers to Solaris, or (2) living
with syncronous I/O on our linux mail servers.

Fletcher
____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail