Everhart, Glenn From: andrew.harrison@uk.sun.nospam.com [andrew.harrison@uk.sun.com] Sent: Friday, January 08, 1999 5:57 AM To: Info-VAX@Mvb.Saic.Com Subject: Re: OVMS Sales. Rob Young wrote: > > In article <3694EC43.1D40F235@uk.sun.com>, "andrew.harrison@uk.sun.nospam.com" writes: > > Rob Young wrote: > >> > > > > This may be how the gfs group intned to implement a UNIX based > > Cluster Filesystem, it isn't the way Sun is doing it. > > > > Sun intends to be able to maintain consistency at the disk > > block level using the Cluster File System. > > > > Your method looks more realistic as getting disparate > disk drive manufacturers to support DLOCK *correctly* looks like > a stretch. > > > Using exclusive locking of drives would only result in > > a very granular system. > > > > But the DLOCK method seems like a hope for the Linux crowd > to gain clustering ... seems to be the thrust. > I suppose it has the virtue of being relatively simple to architect, though implementation and subsequent performance may well be more interesting. > > http://www.sun.com/software/white-papers/wp-sunclusters/sunclusterswp.pdf > > > > Gives you a high level view of what the system will do but > > skips any implementation details, reading up on doors might > > give you some idea of how the consistency of the global filesystem > > and access to the global device pool will be managed. > > > > Most UNIXs do currently support via Oracle Parallel Server or > > Informix XPS multiple nodes simultaneously accessing > > the same disk device, access to the device being > > managed by a cluster volume manager and a Distributed > > Lock manager. > > > > You do have a tremendous challenge there to support existing > APIs and to ensure apps run unmodified. I guess you are > sparing yourselves the pain of a true DLM and this method is > scalable enough to suit your needs. Not to be too disparging > but it looks like a glorified NFS server.. looking at page > 22: > Implementing a DLM would be possible and actually not that difficult to do, but it would need to be implemented for all the filesystems that the OS supports and these are numerous. The proxy layer approach adds global filesystem support to any of the standard filesystems without having to modify the filesystem itself. This in turn preserves the API's that the filesystems support since the Proxy layer perserves the underlying filesystems interfaces. This approach has been used sucessfully for other products that add additional functionality on top of the underlying filesystem. Examples of this are things like UPFS which adds remote replication to any UNIX filesystem or the Translucent Filesystem. There is a protection mechanism, you could call it a DLM though it isn't. I cannot give you any more details without shooting you first but reading up on Spring would give you some insights as to how it works. > "Figure 12 illustrates this scenario. A client on one node makes a request to a > server on another node. Each server has a secondary node that can be used for > failover if necessary. The server sends "checkpoint" messages (meta data) to the > secondary server for each write operation to guarantee the integrity of data in > case of failover. If the primary server fails for any reason, the secondary > server assumes the identity of the primary, and the framework automatically > redirects requests so that replication and failover are transparent to the > client." > It depends on the topology of the storage subsystem. In effect if only one node has access to a disk then that node has to access the device for any other node that may require a block from the device. The mechanism used to do this between nodes would use SCI or reflective memory. Reflective memory is the same technology as Memory Channel. This is usefull for devices that cannot be connected to multiple nodes. However if both nodes have access to the disk then this is not necessary. FC-AL Storage area networks give you the ability to have multiple nodes connected to arrays or disks and currently most of the Storage that Sun ships is FC-AL based. In this case each nodes access of the disk would be direct rather than via another node. > Maybe an extension of NFS serving. How can this be a cluster common > file system? Seems distributed to me. Perhaps marketing sematics. > Again it depends on the architecture of the storage substems and the way that the storage is connected to nodes that make up the cluster. Using say directly connected SCSI JBOD would result in what looks like NFS fileserving. It would of course be much lower level than NFS which is RPC based and uses an external data representation for all data that is moved over the network. However the Cluster filesystem also allows you to use a Storage Area Network with all the storage connected to all the nodes this results in each node directly accessing the shared storage. > So let me get this straight. All writes are checkpointed in case > of failover? Why not an IO Database to track all IOs? Then maybe > in the future you can move that IO Database into shared memory ;-). > The checkpoint is for the metadata not the actual I/O itself, in effect the meta data checkpoint is the database that tracks all the I/Os. An example of this is where you may have two nodes both with direct access to a shared storage device. Node A writes on block directly to disk and checkpoints the metadata, its direct disk connection then fails, the next write is then routed via node B and checkpointed. Of course this is not necessary if you use dual connected devices like FC-AL disks because you have two paths from each host to each disk. > But what I find very disturbing , calling into question the > reliability of the file system , is this statement: > > "All disks in the cluster are replicated with mirroring (RAID1,5), so data is > protected at the disk level as well." This is only to protect data at a block level from disk drives failing. > > Man ... if I were you I would get ahold of the marketing folks > quick as a bunny, have them shrink the font of that puppy and bury it > in a footnote somewhere. Or perhaps re-word it a bit: > No in this case the marketing folks are right RAID 1 or 5 is simply to protect against disk or controller failure. > "This new File System is lightning quick and sometimes has a tendency to > scramble a disk ( in *very* RARE situations, but our analysis proves it is > possible ) so what we are recommending is that you Mirror or RAID5 your data > just in case. We sure as heck don't want to be sued if you lose a very large > transaction and you would like to keep your job wouldn't you? Contact our > marketing department for a FREE special analysis whitepaper on how to sell > Upper-Management on RAIDing every drive you own, ask for CLUS08-RAID." > You are getting it the wrong way round, mirroring storage will not protect you from an unreliable filesystem since the mirroring layer sits under the filesystem. If my filesystem writes a corrupted block to a mirrored disk the mirroring system simply replicates the bad block. In the same way that mirroring does not help me if I delete a file by mistake since the file is deleted from both legs of the mirror. In practice filesystems tend to be very reliable, unless you are running something like NT or Linux with async writes for data and metadata. You are much more likely to see corruption introduced into files by applications and your filesystem does not help you here. As an example The Ultra Server that sits under my desk has been there since 1996 it is my own personal workstation (it has a head) and it is a web server and fileserver. Our office has had a number of power outages in this time and since my server is not in the machine room with a UPS it has been down each time. We have not to my knowlege lost any data yet and it is a fairly heavily used machine. On the other hand a beta version of netscape trashed the contents of my .netscape directory 6 months ago. regards Andrew Harrison Enterprise IT Architect