From:	SMTP%"lauri@elwing.fnal.gov" 29-JUN-1994 14:30:01.96
To:	EVERHART
CC:	
Subj:	Re: Problem with Logging In

From: lauri@fndcd.fnal.gov (Laurelin of Middle Earth)
X-Newsgroups: comp.os.vms
Subject: Re: Problem with Logging In
Date: 29 Jun 1994 14:48:48 GMT
Organization: Fermi National Accelerator Lab
Lines: 69
Message-ID: <2us1kh$hgv@fnnews.fnal.gov>
Reply-To: lauri@elwing.fnal.gov
NNTP-Posting-Host: dcd00.fnal.gov
To: Info-VAX@CRVAX.SRI.COM
X-Gateway-Source-Info: USENET

In article <1994Jun29.164710.5@southpower.co.nz>, 
buxtonr@southpower.co.nz writes:
> Hi Folks,
> 
> Has anybody experienced a problem whereby the system is running fine but no 
> one can log in. All attempts stop at Username, the password is not 
> requested. On another node within the cluster all is well and you can log 
> into the same account okay. The Username request does not time out.
...

And others have joined in with culprits like SYSUAF, VMSMAIL_PROFILE, etc.  I'd
like to add RIGHTSLIST to the list of Files That Can Be Locked And Prevent
People From Logging In (FTCBLAPPFLI, tm ;-).

> 
> We've experienced this twice recently each time we've had to reboot the 
> machine. On the first occasion the shutdown stalled while trying to shut 
> down the Queue Manager. Once we'd manually killed this everything burst 
> into life. We continued with the shut down. 
> On the second occasion, killing the Queue Manager made no difference. On
> both occasions we couldn't perform queue functions.

We see this quite freqently, and refer to it as Creeping Cluster Grunge.  The
first symptom is that queue commands start hanging on *some* nodes (this
includes any SUBMIT, PRINT, some MAIL (to SMTP addresses via MULTINET, which
uses queues) and any SHOW QUEUE commands).  On other nodes things are fine --
for a while.  Then little by little more and more nodes are affected.  AND, a
real troubling phenomena -- OTHER commands start hanging as well, nice innocent
little commands like SHOW DEVICE and DIRECTORY.

We suspect, but have very little in the way of proof, that what is happening
involves disk problems and I/O locks.  Some disk starts flaking out (going into
MountVerify, or disappearing entirely, or what not).  This disk contains a file
referenced by the queue manager -- a .COM file being submitted, a .LOG file
specification, etc.  The queue manager can't complete some I/O on some node and
hangs that node.  Little by little the queue manager shows the I/O problems
first -- but as I/O gets locked throughout the cluster, other commands
involving I/O also start locking up.

Needless to be needless, it is *quite* a painful experience if you don't catch
it early.  (Well, for us it is -- we have 107 nodes to reboot if all solutions
fail).

Our first and foremost line of attack is MOVE THE QUEUE MANAGER.  Move it
quickly, at the first sign of trouble!  Sometimes you might need to move it a
couple of times, back and forth from node to node.  But it seems to clear the
problems *most* of the time.  

For the uninitiated, the exact sequence of steps to take:

     - find the node running QUEUE_MANAGER.

     - from a *DIFFERENT* node,
       	$ START/QUEUE/MANAGER/ON=different-node::

     - wait for that command to complete.

     - start doing SHOW QUEUE commands cluster-wide to see if things are
       ok again.  (SPAWNing these commands can be life-saving, by the way; you
       don't want to lock up all of your active sessions!)

     - repeat until queue commands are ok again.

-- lauri
/-----------------------------------------------------------------------------\
| Lauri Loebel Carpenter        "All that is gold does not glitter,           |
| lauri@elwing.fnal.gov          Not all those who wander are lost..." - JRRT |
| #include <std.disclaimer>           /* I only speak for myself */           |
\-----------------------------------------------------------------------------/