Re: PXC02 OS patch

The MRDs were up but not responding, due to full log filesystems.

The xproxies would time out on an ldapsearch() call, and then
try to reconnect.  They would then hang in an ldapbind() call.

Chris

Shufei Wen wrote:

Chris,
I might have missed your point. What's the conclusion of the root causeof the problem. Were the mrds up or down at the time?First you said xproxy processes hang when mrd were shutdown and latteryou indicated that mrd returns ACKs?
Thanks,
Shufei

Chris Eastlund wrote:
Steve,
The problem was with the MRD, which filled the log filesystem, whichhalted the MRD. Both MRD machines filled their logs.
The proxy processes were hung waiting on the MRD. When the MRD wasshut down, the proxy processes all recorded an MRD bind failure and a
lot of sessions (5000) with sl=63000 or so.  And then logins started
again.

The question is why any proxy process kept trying a mrd query
for 63,000 seconds or 17.5 hours.

The ldap search call has a timeout (default for m2k of 90 seconds)
and gets tried twice by xproxy, and once (for timeout fails)
in the libstdxdir library.  This should take 3 minutes, max.
After the search fails, the proxy will attempt to close and reopen thesession. That's where I think things hang. The MRD system returnsan ACK, so the connection seems up and the connection timeout doesn'tapply.
When ldapsearch is run against such a listener, truss shows:
    connect()  # returns EINPROGRESS
        pollsys()
        time()
        write(4,....) # seems to be the login sequence
        pollsys(0xFFBFF0E8,5,0,0) #
I think this means a poll() call with no timeout. I can't find apollsys() man page, as it's a Solaris internal call.
There are web pages noting this problem from about 2002, and our
version of openldap is older than that.

Chris


Steve Prisco wrote:
including George
Steve
------------------------------------------------------------------------
*From:* Al Robinson [mailto:awr(_at_)maillennium(_dot_)att(_dot_)com]
*Sent:* Thursday, January 08, 2009 3:39 PM
*To:* 'M2K Development Team'
*Cc:* Mail Testers; PRISCO, STEVE (ATTLABS)
*Subject:* PXC02 OS patch
Patrick was running a load test on lzfwpxc02 last night and it wasrunning fine until it wasn't.It currently thinks all pop proxy processes on the blpop interfaceare busy. At least that'swhat the logs are saying and the response from mailman when a newconnection is attempted.The offered load was 2161 simultaneous sessions. For most of thenight, mailman reportedhiwater at approximately 3500/5000. At 4:32 it reported a hiwater of4439/5000 followed byan XSFLOOD at 4:34. At this point, it doesn't seem to respond anymore. Subsequent XSTATlogs with the load still active report hiwater of 0/5000 and 500k+xnconns. The latest XSTAT showed a hiwater of 0/5000 with 347 xnconns.We need development to look at the server and determine if it is amailman problem or a problem with the OS patch.I assume a core file will be needed, but we haven't touched thesystem yet.Al Robinson

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf