ietf
[Top] [All Lists]

Re: PXC02 OS patch

2009-01-09 12:48:06
The MRDs were up but not responding, due to full log filesystems.

The xproxies would time out on an ldapsearch() call, and then
try to reconnect.  They would then hang in an ldapbind() call.

Chris

Shufei Wen wrote:
Chris,

I might have missed your point. What's the conclusion of the root cause of the problem. Were the mrds up or down at the time? First you said xproxy processes hang when mrd were shutdown and latter you indicated that mrd returns ACKs?

Thanks,
Shufei

Chris Eastlund wrote:
Steve,

The problem was with the MRD, which filled the log filesystem, which halted the MRD. Both MRD machines filled their logs.

The proxy processes were hung waiting on the MRD. When the MRD was shut down, the proxy processes all recorded an MRD bind failure and a
lot of sessions (5000) with sl=63000 or so.  And then logins started
again.

The question is why any proxy process kept trying a mrd query
for 63,000 seconds or 17.5 hours.

The ldap search call has a timeout (default for m2k of 90 seconds)
and gets tried twice by xproxy, and once (for timeout fails)
in the libstdxdir library.  This should take 3 minutes, max.

After the search fails, the proxy will attempt to close and reopen the session. That's where I think things hang. The MRD system returns an ACK, so the connection seems up and the connection timeout doesn't apply.

When ldapsearch is run against such a listener, truss shows:
    connect()  # returns EINPROGRESS
        pollsys()
        time()
        write(4,....) # seems to be the login sequence
        pollsys(0xFFBFF0E8,5,0,0) #

I think this means a poll() call with no timeout. I can't find a pollsys() man page, as it's a Solaris internal call.

There are web pages noting this problem from about 2002, and our
version of openldap is older than that.

Chris


Steve Prisco wrote:
including George
Steve

------------------------------------------------------------------------
*From:* Al Robinson [mailto:awr(_at_)maillennium(_dot_)att(_dot_)com]
*Sent:* Thursday, January 08, 2009 3:39 PM
*To:* 'M2K Development Team'
*Cc:* Mail Testers; PRISCO, STEVE (ATTLABS)
*Subject:* PXC02 OS patch

Patrick was running a load test on lzfwpxc02 last night and it was running fine until it wasn't. It currently thinks all pop proxy processes on the blpop interface are busy. At least that's what the logs are saying and the response from mailman when a new connection is attempted. The offered load was 2161 simultaneous sessions. For most of the night, mailman reported hiwater at approximately 3500/5000. At 4:32 it reported a hiwater of 4439/5000 followed by an XSFLOOD at 4:34. At this point, it doesn't seem to respond any more. Subsequent XSTAT logs with the load still active report hiwater of 0/5000 and 500k+ xnconns. The latest XSTAT showed a hiwater of 0/5000 with 347 xnconns. We need development to look at the server and determine if it is a mailman problem or a problem with the OS patch. I assume a core file will be needed, but we haven't touched the system yet. Al Robinson

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf

<Prev in Thread] Current Thread [Next in Thread>
  • Re: PXC02 OS patch, Chris Eastlund <=