Article 166662 of comp.os.vms:
In article: <1997Feb27.122232.1@eisner>  cornelius@eisner.decus.org (George 
Cornelius) writes:
> 
> In article <3314B094.6E1D@spf.nsc.com>, Tim Hewitt <thewitt@spf.nsc.com> 
writes:
> > You might consider the SYSGEN parameter PE1, which affects the scope of
> > dynamic lock remastering. If you have a large number of locks and they
> > are moving from node 1 to node 2, this may account for the stall.
> 
> Classic symptoms of dynamic lock remastering, and you can in fact repair
> things by setting PE1 low.  DEC recommends setting it at 100 when you have
> a problem like this so small lock trees will dynamically remaster and large
> ones will remain statically mastered.  I currently have PE1 at 1000.
> 
> Don't forget to set LOCKDIRWT to zero on underpowered nodes, though, and
> you may want to consider setting it to larger values on the more powerful
> nodes.
> 
> One of the things I found worked for this problem long ago - before I had
> heard Digital's recommendations - was to intentionally imbalance the load.
> Since my users were all LAT based and my primary node could handle most
> of the load, I intentionally set CPU_RATING under LATCP to a low value on
> the secondary node so the bulk of the database users wound up on the 
primary,
> and that seemed to solve the problem as well.
> 
> These days I just use PE1.
> 
> --
> George Cornelius          cornelius@eisner.decus.org
>                           cornelius@mayo.edu

That's the problem - and the solution!

I spent an hour watching MONITOR LOCK on both nodes. During that time one node 
had typically 18000 locks while the other had 24000. Three times in that hour 
the numbers of locks changed over, and each time the machines 'froze' to 
users.

I set PE1 to 1000 as suggested and that seems to have stopped the effect 
completely.

Many thanks.
-- 
Dave Pickles