The MRDs were up but not responding, due to full log filesystems.
The xproxies would time out on an ldapsearch() call, and then
try to reconnect.  They would then hang in an ldapbind() call.
Chris
Shufei Wen wrote:
Chris,
I might have missed your point. What's the conclusion of the root cause 
of the problem. Were the mrds up or down at the time?
First you said xproxy processes hang when mrd were shutdown and latter 
you indicated that mrd returns ACKs?
Thanks,
Shufei
Chris Eastlund wrote:
Steve,
The problem was with the MRD, which filled the log filesystem, which 
halted the MRD.  Both MRD machines filled their logs.
The proxy processes were hung waiting on the MRD.  When the MRD was 
shut down, the proxy processes all recorded an MRD bind failure and a
lot of sessions (5000) with sl=63000 or so.  And then logins started
again.
The question is why any proxy process kept trying a mrd query
for 63,000 seconds or 17.5 hours.
The ldap search call has a timeout (default for m2k of 90 seconds)
and gets tried twice by xproxy, and once (for timeout fails)
in the libstdxdir library.  This should take 3 minutes, max.
After the search fails, the proxy will attempt to close and reopen the 
session.  That's where I think things hang.  The MRD system returns
an ACK, so the connection seems up and the connection timeout doesn't 
apply.
When ldapsearch is run against such a listener, truss shows:
    connect()  # returns EINPROGRESS
        pollsys()
        time()
        write(4,....) # seems to be the login sequence
        pollsys(0xFFBFF0E8,5,0,0) #
I think this means a poll() call with no timeout.  I can't find a 
pollsys() man page, as it's a Solaris internal call.
There are web pages noting this problem from about 2002, and our
version of openldap is older than that.
Chris
Steve Prisco wrote:
including George
 
Steve
------------------------------------------------------------------------
*From:* Al Robinson [mailto:awr(_at_)maillennium(_dot_)att(_dot_)com]
*Sent:* Thursday, January 08, 2009 3:39 PM
*To:* 'M2K Development Team'
*Cc:* Mail Testers; PRISCO, STEVE (ATTLABS)
*Subject:* PXC02 OS patch
Patrick was running a load test on lzfwpxc02 last night and it was 
running fine until it wasn't.
 
It currently thinks all pop proxy processes on the blpop interface 
are busy. At least that's
what the logs are saying and the response from mailman when a new 
connection is attempted.
 
 
 
The offered load was 2161 simultaneous sessions. For most of the 
night, mailman reported
hiwater at approximately 3500/5000. At 4:32 it reported a hiwater of 
4439/5000 followed by
an XSFLOOD at 4:34. At this point, it doesn't seem to respond any 
more. Subsequent XSTAT
logs with the load still active report hiwater of 0/5000 and 500k+ 
xnconns. The latest XSTAT showed a hiwater of 0/5000 with 347 xnconns.
 
We need development to look at the server and determine if it is a 
mailman problem or a problem with the OS patch.
 
I assume a core file will be needed, but we haven't touched the 
system yet.
 
Al Robinson
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf