From: SMTP%"lauri@elwing.fnal.gov" 29-JUN-1994 14:30:01.96 To: EVERHART CC: Subj: Re: Problem with Logging In From: lauri@fndcd.fnal.gov (Laurelin of Middle Earth) X-Newsgroups: comp.os.vms Subject: Re: Problem with Logging In Date: 29 Jun 1994 14:48:48 GMT Organization: Fermi National Accelerator Lab Lines: 69 Message-ID: <2us1kh$hgv@fnnews.fnal.gov> Reply-To: lauri@elwing.fnal.gov NNTP-Posting-Host: dcd00.fnal.gov To: Info-VAX@CRVAX.SRI.COM X-Gateway-Source-Info: USENET In article <1994Jun29.164710.5@southpower.co.nz>, buxtonr@southpower.co.nz writes: > Hi Folks, > > Has anybody experienced a problem whereby the system is running fine but no > one can log in. All attempts stop at Username, the password is not > requested. On another node within the cluster all is well and you can log > into the same account okay. The Username request does not time out. ... And others have joined in with culprits like SYSUAF, VMSMAIL_PROFILE, etc. I'd like to add RIGHTSLIST to the list of Files That Can Be Locked And Prevent People From Logging In (FTCBLAPPFLI, tm ;-). > > We've experienced this twice recently each time we've had to reboot the > machine. On the first occasion the shutdown stalled while trying to shut > down the Queue Manager. Once we'd manually killed this everything burst > into life. We continued with the shut down. > On the second occasion, killing the Queue Manager made no difference. On > both occasions we couldn't perform queue functions. We see this quite freqently, and refer to it as Creeping Cluster Grunge. The first symptom is that queue commands start hanging on *some* nodes (this includes any SUBMIT, PRINT, some MAIL (to SMTP addresses via MULTINET, which uses queues) and any SHOW QUEUE commands). On other nodes things are fine -- for a while. Then little by little more and more nodes are affected. AND, a real troubling phenomena -- OTHER commands start hanging as well, nice innocent little commands like SHOW DEVICE and DIRECTORY. We suspect, but have very little in the way of proof, that what is happening involves disk problems and I/O locks. Some disk starts flaking out (going into MountVerify, or disappearing entirely, or what not). This disk contains a file referenced by the queue manager -- a .COM file being submitted, a .LOG file specification, etc. The queue manager can't complete some I/O on some node and hangs that node. Little by little the queue manager shows the I/O problems first -- but as I/O gets locked throughout the cluster, other commands involving I/O also start locking up. Needless to be needless, it is *quite* a painful experience if you don't catch it early. (Well, for us it is -- we have 107 nodes to reboot if all solutions fail). Our first and foremost line of attack is MOVE THE QUEUE MANAGER. Move it quickly, at the first sign of trouble! Sometimes you might need to move it a couple of times, back and forth from node to node. But it seems to clear the problems *most* of the time. For the uninitiated, the exact sequence of steps to take: - find the node running QUEUE_MANAGER. - from a *DIFFERENT* node, $ START/QUEUE/MANAGER/ON=different-node:: - wait for that command to complete. - start doing SHOW QUEUE commands cluster-wide to see if things are ok again. (SPAWNing these commands can be life-saving, by the way; you don't want to lock up all of your active sessions!) - repeat until queue commands are ok again. -- lauri /-----------------------------------------------------------------------------\ | Lauri Loebel Carpenter "All that is gold does not glitter, | | lauri@elwing.fnal.gov Not all those who wander are lost..." - JRRT | | #include /* I only speak for myself */ | \-----------------------------------------------------------------------------/