Everhart, Glenn
From:	andrew.harrison@uk.sun.nospam.com [andrew.harrison@uk.sun.com]
Sent:	Friday, January 08, 1999 5:57 AM
To:	Info-VAX@Mvb.Saic.Com
Subject:	Re: OVMS Sales.
Rob Young wrote:
> 
> In article <3694EC43.1D40F235@uk.sun.com>, "andrew.harrison@uk.sun.nospam.com" <andrew.harrison@uk.sun.com> writes:
> > Rob Young wrote:
> >>
> >
> > This may be how the gfs group intned to implement a UNIX based
> > Cluster Filesystem, it isn't the way Sun is doing it.
> >
> > Sun intends to be able to maintain consistency at the disk
> > block level using the Cluster File System.
> >
> 
>         Your method looks more realistic as getting disparate
>         disk drive manufacturers to support DLOCK *correctly* looks like
>         a stretch.
> 
> > Using exclusive locking of drives would only result in
> > a very granular system.
> >
> 
>         But the DLOCK method seems like a hope for the Linux crowd
>         to gain clustering ... seems to be the thrust.
> 
I suppose it has the virtue of being relatively simple to 
architect, though implementation and subsequent performance 
may well be more interesting.

> > http://www.sun.com/software/white-papers/wp-sunclusters/sunclusterswp.pdf
> >
> > Gives you a high level view of what the system will do but
> > skips any implementation details, reading up on doors might
> > give you some idea of how the consistency of the global filesystem
> > and access to the global device pool will be managed.
> >
> > Most UNIXs do currently support via Oracle Parallel Server or
> > Informix XPS multiple nodes simultaneously accessing
> > the same disk device, access to the device being
> > managed by a cluster volume manager and a Distributed
> > Lock manager.
> >
> 
>         You do have a tremendous challenge there to support existing
>         APIs and to ensure apps run unmodified.  I guess you are
>         sparing yourselves the pain of a true DLM and this method is
>         scalable enough to suit your needs.  Not to be too disparging
>         but it looks like a glorified NFS server.. looking at page
>         22:
> 

Implementing a DLM would be possible and actually not that difficult 
to do, but it would need to be implemented for all the filesystems that 
the OS supports and these are numerous. The proxy layer approach adds 
global filesystem support to any of the standard filesystems without
having 
to modify the filesystem itself. This in turn preserves the API's 
that the filesystems support since the Proxy layer perserves 
the underlying filesystems interfaces.

This approach has been used sucessfully for other products that 
add additional functionality on top of the underlying filesystem. 

Examples of this are things like UPFS which adds remote replication 
to any UNIX filesystem or the Translucent Filesystem.

There is a protection mechanism, you could call it a DLM though 
it isn't. I cannot give you any more details without shooting 
you first but reading up on Spring would give you some insights
as to how it works. 


> "Figure 12 illustrates this scenario. A client on one node makes a request to a
> server on another node. Each server has a secondary node that can be used for
> failover if necessary. The server sends "checkpoint" messages (meta data) to the
> secondary server for each write operation to guarantee the integrity of data in
> case of failover.  If the primary server fails for any reason, the secondary
> server assumes the identity of the primary, and the framework automatically
> redirects requests so that replication and failover are transparent to the
> client."
> 

It depends on the topology of the storage subsystem. In effect if 
only one node has access to a disk then that node has to access the 
device for any other node that may require a block from the device. The 
mechanism used to do this between nodes would use SCI or reflective 
memory. Reflective memory is the same technology as Memory Channel.

This is usefull for devices that cannot be connected to multiple 
nodes. 

However if both nodes have access to the disk then this is not 
necessary. FC-AL Storage area networks give you the ability to 
have multiple nodes connected to arrays or disks and currently 
most of the Storage that Sun ships is FC-AL based. In this case
each nodes access of the disk would be direct rather than via 
another node.


>         Maybe an extension of NFS serving.  How can this be a cluster common
>         file system?  Seems distributed to me.  Perhaps marketing sematics.
> 

Again it depends on the architecture of the storage substems and 
the way that the storage is connected to nodes that make up the 
cluster. Using say directly connected SCSI JBOD would result in 
what looks like NFS fileserving. It would of course be much 
lower level than NFS which is RPC based and uses an external 
data representation for all data that is moved over the network.

However the Cluster filesystem also allows you to use a Storage Area 
Network with all the storage connected to all the nodes this results in
each node directly accessing the shared storage. 


>         So let me get this straight.  All writes are checkpointed in case
>         of failover?  Why not an IO Database to track all IOs?  Then maybe
>         in the future you can move that IO Database into shared memory ;-).
> 

The checkpoint is for the metadata not the actual I/O itself, in 
effect the meta data checkpoint is the database that tracks all 
the I/Os. An example of this is where you may have two nodes 
both with direct access to a shared storage device.

Node A writes on block directly to disk and checkpoints the 
metadata, its direct disk connection then fails, the next write 
is then routed via node B and checkpointed. Of course this is 
not necessary if you use dual connected devices like FC-AL disks 
because you have two paths from each host to each disk.

>         But what I find very disturbing , calling into question the
>         reliability of the file system , is this statement:
> 
> "All disks in the cluster are replicated with mirroring (RAID1,5), so data is
> protected at the disk level as well."

This is only to protect data at a block level from disk drives failing.
> 
>         Man ... if I were you I would get ahold of the marketing folks
>         quick as a bunny, have them shrink the font of that puppy and bury it
>         in a footnote somewhere.  Or perhaps re-word it a bit:
> 

No in this case the marketing folks are right RAID 1 or 5 is simply 
to protect against disk or controller failure.

> "This new File System is lightning quick and sometimes has a tendency to
> scramble a disk ( in *very* RARE situations, but our analysis proves it is
> possible ) so what we are recommending is that you Mirror or RAID5 your data
> just in case.  We sure as heck don't want to be sued if you lose a very large
> transaction and you would like to keep your job wouldn't you?  Contact our
> marketing department for a FREE special analysis whitepaper on how to sell
> Upper-Management on RAIDing every drive you own, ask for CLUS08-RAID."
> 

You are getting it the wrong way round, mirroring storage will not 
protect you from an unreliable filesystem since the mirroring layer 
sits under the filesystem. If my filesystem writes a corrupted block 
to a mirrored disk the mirroring system simply replicates the bad 
block.

In the same way that mirroring does not help me if I delete a file 
by mistake since the file is deleted from both legs of the mirror.

In practice filesystems tend to be very reliable, unless you are 
running something like NT or Linux with async writes for data 
and metadata. You are much more likely to see corruption introduced 
into files by applications and your filesystem does not help you 
here.

As an example 
The Ultra Server that sits under my desk has been there since 1996 
it is my own personal workstation (it has a head) and it is a web 
server and fileserver. Our office has had a number of power outages 
in this time and since my server is not in the machine room with a 
UPS it has been down each time. We have not to my knowlege lost any 
data yet and it is a fairly heavily used machine. On the other 
hand a beta version of netscape trashed the contents of my .netscape
directory 6 months ago.

regards
Andrew Harrison
Enterprise IT Architect