Linux High Availability HOWTO: Attaching Shared Storage

7.1 Disk Technologies -- SCSI

Initially, Linux-HA will support multi-tailed SCSI attachments. One or more external storage boxes will be attached to two or more nodes.

This requires a new piece of hardware that currently doesn't exist in the mass PC market: a Y cable which attaches to the SCSI adapter in a node (see diagram) and allows for external bus termination. If internal SCSI termination is used on the adapter, an adapter or power outage will also disable the active termination on this adapter. The remainder of the SCSI installation will at best react unpredictably. The Y cable either attaches the bus terminator resistor pack or another SCSI F/W/D cable leading to the next node. The ends of the bus need to be terminated. A simple example is shown below, featuring 2 cluster nodes and 2 external disks or disk boxes.

Since the nodes need not be at the end of the SCSI chain, a symmetrical Y cable is needed which can attach either 2 cables or one cable and one terminator pack.

Only differential SCSI will be supported, the reason being the lack of stability of long single-ended cables caused by electrical crosstalk. Ideally, you will want to use Ultra or fast/wide/differential SCSI buses. Single-ended Ultra SCSI (and probably all potential upcoming parallel SCSI technologies with even higher transfer rates) is sort of unusable for multi-host attachments due to cable length restrictions.

Due to the limitations in the ANSI SCSI standard, the maximum bus length for Fast and Ultra wide/differential SCSI is 25 meters except if you use fiberoptic or other bus extenders. This has to be evaluated.

If you use SCSI, please make sure you use multiple adapters in a node, multiple buses and multiple external storage devices (potentially with disk mirroring). Otherwise, an adapter or cable outage will cause a node failover.

In multi-host attachments, make sure the adapter SCSI IDs are all different.

For BusLogic (or Adaptec), the SCSI ID can be set from AutoSCSI (or SCSI-Select), the configuration utilities that are available by typing Ctrl-B (or Ctrl-A) during the host adapter's BIOS initialization. The Host Adapter SCSI ID is stored in nonvolatile memory along with the other configurable parameters. (Leonard Zubkoff) The same is the case with the Symbios Logic 8751D card -- the SCSI ID can be changed from the ROM configuration utility (type Ctrl-C during boot) and is stored in the NVRAM.

On a W/D SCSI bus, there are only 16 usable SCSI addresses. This limits the number of devices that can be attached. External RAID boxes are probably the best bet. Make sure these boxes are designed with redundancy, that is multiple power supplies etc. Some RAID boxes are "host-based", which means they need device drivers on the host they are attached to.

When it comes to adapters, only a subset of the adapters supported by the standard Linux kernel will be supported by Linux-HA, the reason being the need for standardized error messages for certain HA relevant situations (e.g. SCSI adapter permanent failure, disk intermittent error etc.). Authors of other device drivers are invited to join in and re-write their error reporting code. This should be coordinated by the Linux SCSI maintainer (currently Leonard Zubkoff as he suggested this; he wrote me, In fact, the place we really need this is in the mid and upper levels, not so much in the drivers themselves. If the drivers return uniform results from the last resort error recovery functions, then the mid/upper levels can make sure the appropriate uniform HA-aware messages are emitted.).

Please keep in mind that the disks may also be a SPOF. You should either use RAID boxes or mirror the data using the MD driver. A setup that is definitely better than the previous one is shown below. It mirrors data across multiple disk adapters, disks and the associated cabling. Running software RAID across a single SCSI channel doesn't make very much sense, for performance and availability reasons.

There are also some RAID boxes which have multiple SCSI connectors, e.g. the ones from Kingston Technology. Thus, Y cables are not needed. (Thanks to Chel van Gennip for this hint.)

Sharing The Same TERMPWR Line

One of the downsides of twintailing external SCSI disks (or disk boxes) to multiple nodes is that those nodes and the disks need to be attached to the same electrical power source, or more so to the same phase!

The reason is that the SCSI attachment is not potential free, and the devices share the same ground lines. For differential SCSI this doesn't harm as far as the data lines are concerned (up to a certain limit) but the termination power line (TERMPWR) is always asymmetrical and protected by a fuse.

Now if one of the devices/machines attached to the bus draws much more power from the power outlet, there will be a potential difference on the individual ground lines (due to the fact that the wire resistance is not equal to zero) which can be up to several volts. This will most probably blow the TERMPWR fuses on one or more of the termination power suppliers on the SCSI bus. This has been proven by practical experience in the past ...

The bottomline is, attach all the SCSI devices including the nodes to the same power outlet, as close as possible, to reduce GND potential shifts. BUT: this power outlet will be a SPOF which has to be protected by a UPS!

7.2 Disk Technologies -- Serial Storage Architecture (SSA)

Serial Storage Architecture (SSA) is a high performance serial computer and peripheral interface that is being developed by X3T10.1, a Task Group of the X3T10 Technical Committee of Accredited Standards Committee X3. Initially developed by IBM, SSA is today an open technology promoted by the SSA Industry Association.

SSA is a serial technology which basically runs the SCSI-2 software protocol. This means device drivers for SSA adapters should be easily integrated into the existing Linux SCSI subsystem. The bottomline is, data is transferred via twisted-pair cables running at 200 MBit/s full-duplex which is a lot easier to handle than the 68-wire parallel Wide SCSI technology. For more information see the SSA Fact Sheet.

The good news about SSA compared to SCSI is:

it is far easier configured and cabled -- no termination needed!
it is built with HA features. The SSA loop architecture (as opposed to a SCSI bus) has no SPOF (see diagram below). If part of a loop fails, the device driver will automatically and transparently reconfigure itself to make sure all SSA devices can be accessed without any noticable interruption.
it uses no SCSI ID addressing which means no hassle with setting up the adapters.
the SSA loop can transport 4 times 20 MByte/s -- two independent reads and two independant writes across each loop direction. Current actual adapter implementations allow for 35 MByte/s per adapter.
SSA uses no bus arbitration as opposed to SCSI. Rather than that, a network-like scheme is used. Data is sent and received in 128 Byte packets, and all devices on the loop can request time slots independantly. SCSI in turn needs bus arbitration which can lead to performance deadlocks if an initiator doesn't release the bus in time.
SSA allows for 25 meters between each two devices. Plus, there is a fiberoptical extender which allows for data transfers across 50 Micrometer optical cables over distances up to 2.4 km. This makes it even suitable for site disaster recovery if configured properly.
Most SSA adapters support two independent loops which makes it possible to attach mirrored disks to different loops for higher availability.
The SSA loops are symmetrical, twisted-pair, potential free. No TERMPWR potential shift problem.

The simple example shown in the diagram can be enhanced by adding a second SSA adapter to each node, enabling this node to perform an intra-node adapter failover.

There are also some downsides of SSA:

Practically, SSA disks can only be purchased from IBM. Although there are commitments from other disk manufacturers that are members of the SSA Industry Association, no one has as of yet jumped on the bandwagon to actually sell disks. The disk prices are comparable to SCSI disks that can be bought from the various Unix hardware vendors -- and these prices are much higher than for disks from the mass PC market. These disks are more rugged, though, and they are tested more thoroughly before leaving the manufacturing plant. Please keep that in mind before complaining about disk prices. Customers do not want to put their businesses on risk just because a disk is a mere 1000 USD cheaper.

Linux SSA Device Driver

There is no Linux SSA device driver today, however IBM Systems Storage Division (SSD) will support us to get one written for the IBM PCI SSA adapter -- and possibly officially endorse it, at least that is the usual process. There are at least two well-known SCSI device driver authors in the Linux community who are more than keen to write the device driver. I will get loan hardware and the device driver development toolkit in time. The loan contract will only be valid for 90 days, though, after which we will probably need to return the loan hardware. Donations are more than welcome!

I also contacted Pathlight Technology Inc.. Pathlight offers a single loop PCI SSA adapter which is simpler and probably cheaper than the IBM adapter. Pathlight is also going to support us but they currently have a problem with releasing the driver source code because it contains some parts which are considered confidential.

SSA documentation is available for free on the SSA Industry Association's Documentation and Standards page or from the Symbios Logic, Inc. anonymous FTP server.

A couple of prerequisites have to be fulfilled prior to getting the SSA device driver reasonably done. A major difference of SSA compared to external RAID boxes is that all disks are seen by the operating system individually. This currently limits us to 16 SCSI/SSA disks due to the major/minor device number scheme in the Linux kernel. (It appears the current development trend is to go for a 64 bit dev_t value which would allow us to use at least 16 bits for the minor device number.)

Remark: There is also a PCI SSA RAID adapter from IBM which does RAID levels 0 (striping), 1 (mirroring) and 5 (striping with parity) in hardware. The current adapter firmware doesn't support more than one adapter in a loop, however, which makes it unusable for multi-host attachments. This could change before too long. RAID 5 organized bunches of disks are presented to the operating system as single entities, and the adapter can group up to 14 plus 1 disks in a RAID 5 configuration. Since the biggest SSA disk is currently 9.1 GB, this sums up to 14 times 9.1 GB = 127.4 GB as a single entity. Please keep in mind that due to parity sum calculations which are needed for every write operation, mirrored setups (RAID level 1) are usually faster on writes than RAID level 5.

According to the SCSI development folks, the current Linux SCSI mid layer code is too slow to handle SSA data rates efficiently. This is a known problem which will hopefully be fixed as soon as there is a SSA device driver available.

Design Request: I herewith ask the Linux SCSI developers/maintainers to fix this as soon as possible, as well as re-design the disk and device numbering scheme to support more than 16 SCSI disks.

The SSA driver will also need to support SSA Target Mode (see section Non-IP Heartbeat).

The Vicom SLIC

Another way to attach SSA disks to a Linux machine is to use the so-called Serial Loop Interface Card (SLIC) built by Vicom Systems, Inc.. This device is also offered by IBM with the feature code #7190. The SLIC basically converts a F/W/D SCSI bus to a single SSA loop which can have up to 32 disks. The SLIC presents the disks to the operating system as 32 SCSI LUNs (logical units) on a single SCSI ID (remember we still have the 16 SCSI disks limitation inside Linux). The good news is that as long as the SLIC is the only SCSI target on the SCSI bus, there is no bus arbitration overhead. This results in a maximum sustained throughput of 18 MByte/s. The current SLIC supports only one SSA initiator per loop which is not good for a twin-tailed configuration but the requirement for multiple-initiator is recognized and will eventually be fulfilled. There will also be Ultra-SCSI versions which will sort of double the throughput.

Pathlight Technology Inc.

Pathlight who was mentioned before is also going to release several SCSI-to-SSA converter tools. At NAB 97, they announced some interesting new products: a SSA 16 port hub is ready and in production, SSA Network Agent which interconnects SSA and Ultra Wide SCSI and Ethernet in one box and enables SCSI and SSA initiators and targets to transfer data between each other. Also they announced a new small black box called "Magic" that extends the SSA cable length to 100 meters (300 feet) on twisted pair copper cable.

7.3 Other Disk Technologies

Other disk technologies may be supported when they become available, e.g. Fibre Channel Arbitrated Loop (FC-AL), IEEE 1394 ("Firewire") etc.

7.4 Raw Disk Devices

Currently, all /dev/sd* and /dev/hd* devices are buffered, i.e. writes are not committed instantly (although the Linux kernel code suggests otherwise; in reality O_SYNC is there but not implemented). Instead, reads and writes are routed through the buffer cache. While this is probably nice for filesystem performance, this is not good for storage ruggedness. Several database products (e.g. Adabas D and Yard) could do better with raw devices, having entire control of what's written on the disk. I was also told that the port of Oracle for Linux was stopped due to the fact that raw devices weren't there. Plus, concurrent access is not possible at all with the current buffered devices.

I appreciate any hint on any other database which could be accomodated by raw devices.

According to the SCSI development folks, raw devices will eventually be introduced during the SCSI mid layer code review.

7.5 Accessing Disks Safely

We need to make sure disks are only accessed by the "active" node. There is no way to mount filesystems on multiple nodes at the same time since there is currently no locking mechanism. (If you need to concurrently access filesystems, I recommend to use a networked filesystem like NFS instead. NFS V2 is a bad performer, though.)

AIX has a mechanism called "varying on/off a volume group". Since we have no Logical Volume Manager yet in Linux (although there is a project in the works, see the Linux Virtual Partition System Project, see also here), there is no such mechanism. We can simulate the desired behaviour, though, if we load/unload the appropriate device driver modules during a takeover. There are two scenarios to this:

load/unload the SCSI disk module sd.o -- this is only possible if the internal disks aren't SCSI but rather EIDE.
load/unload the respective SCSI adapter device driver module -- this is only possible if potential internal SCSI disks are attached to a different adapter type, e.g. NCR vs. DPT.
activate/deactivate SCSI disk by command via the /proc/scsi/scsi interface

Module loading/unloading must not be done through kerneld because Linux-HA needs to control disk access. We will use explicit insmod/rmmod calls instead.

The same logic will apply to SSA as well.

The cluster manager daemon (or a subprocess) may run a "disk watchdog" which attempts to read/modify/write specific data blocks on the disks in short intervals to make sure they are still alive. This will also check the adapters, cables and bus termination (and an SSA SLIC if present) at a small performance penalty. This is only possible if there are raw disk devices, though, otherwise we'll always read the buffer cache instead of the disk.

SCSI disks are commonly reserved by an initiator by using the "SCSI Reserve" command. After a node failure, the Reserve Flag will still be set on the disks so that the takeover node will have to force a "break reservation". This feature isn't implemented in the SCSI code yet.

The third way of activating/deactivating disks via the /proc/scsi/scsi interface could lead to the situation where an integrating node has the disks enabled first and needs to immediately deactivate it. Plus, when handling a large number of disks, this can become very difficult and error-prone. It allows for having external and internal SCSI disks attached to the same adapter type, though.

7.6 Redundant Disk Setups

As long as there is support for just 16 SCSI disks, many users will go for external RAID boxes. On the other hand, the standard solution in the commercial AIX market is software mirroring on SSA disks. The md (multiple devices) driver which is in the standard kernel allows for mirroring, striping and concatenation on a partition basis. RAID 5 is currently under development.

If the requirement for raw disk devices (see section Raw Disk Devices) holds true, md also has to provide for them.

7.7 Filesystems

Log Structured Filesystems

The current quasi-standard Extended-2 filesystem (ext2) is pretty stable and robust on a single machine. However in case of a node failure, filesystems are not unmounted cleanly, causing time consuming filesystem checks on the takeover node. Plus, it is not guaranteed that an automatic filesystem check (fsck -a -A) detects and repairs all errors encountered. Manual intervention may be needed.

This is completely and utterly unusable in a HA environment where customers expect a takeover node to be up and running within a couple of seconds or minutes without user intervention.

Thus, we need filesystems which work in a transaction-oriented manner: so-called log-structured filesystems. Currently, 4.5 potential solutions are identified:

The Linux Log-structured Filesystem Project which was set up to write a log-structured filesystem off scratch, based on the 4.4BSD LFS. Some people say BSD LFS is a bad performer. Currently, development seems to be stuck.
The Windows NT Filesystem (NTFS) which is written and maintained by Martin v. Loewis (see http://www.informatik.hu-berlin.de/~loewis/ntfs/). NTFS is transaction oriented as far as metadata are concerned and should have the properties that are needed. Currently the Linux NTFS driver is read-only, though.
Adam Richter of Yggdrasil is working on a yet incomplete compressed log structured filesystem which in the Fall 1994 and Fall 1995 releases of Yggdrasil Plug and Play Linux in /usr/src/linux/fs/logfs. You can find a modularized but still not yet functional version of this filesystem in: ftp://ftp.yggdrasil.com/private/adam/linux-2.0.12.ygg.tar.gz. There is a matching mkfs program in: ftp://ftp.yggdrasil.com/private/adam/mklogfs.c.
The Enhanced File system project by Russell Coker. It is still in the design stage.
Purportedly, ext3fs should be journaling too (not log-structured though). Plus, Stephen Tweedie works on ext2 logging. Possibly, the changes will be made in the VFS layer so that all filesystems accessing the VFS layer will benefit from logging.

The work on a transaction-oriented filesystem probably needs a major part of the work for Linux-HA. Developers are invited to join one of the projects to get it done sooner.

In the meantime, we have to live with the ext2 filesystem running on buffered block devices. Remember there is no real O_SYNC flag! We can't even really mount filesystems in synchronous mode.

When taking over filesystems, we need to make sure /etc/fstab is in sync on all attached nodes. If the number of internal SCSI disks is the same on all nodes (or if you only have internal EIDE disks), the interesting portion of /etc/fstab simply needs to be copied around. Otherwise we will have to adjust the entries for the external disks accordingly. Alternatively, the configuration interface may include a facility to configure/match disk partitions and filesystems, and the disk takeover event scripts may explicitly mount the filesystems according to the configuration database.

Since the partition table will be read off the disks we do not need to do anything about synchronizing them across nodes. This is true even if something was changed on the active node. Only if disk partitions / filesystems are added, /etc/fstab has to be updated accordingly.

When performing a "clean" (i.e. manually controlled) failover, we should run lsof to find and possibly close all files on the external disks, as well as shut down all applications residing there. This might not always work, for example if an application has crashed and is in a zombie state. In this case, a disk may not be cleanly released by the node leaving the cluster. For this reason, it is highly recommended to not run any applications off the external storage! Only the application's user data should be placed there.

You might consider mirroring the internal disks (and possibly the root partition as well) to make sure a machine doesn't fail when an internal disk fails. This is not trivial but will hopefully be documented in a mini-HOWTO. It was discussed on the linux-raid mailing list recently.

NFS Filesystems

It is sometimes useful to cross-mount filesystems via NFS, for example if data that is mounted on one cluster node must also be accessed on other cluster nodes. Commercial HA software usually uses normal NFS mounts but there is a downside to this. If the node which has the disks mounted locally fails, the takeover node will have to unmount the NFS mounts prior to locally mounting the disks (and taking over the IP address, thus taking over the former NFS server's functionality). Since the former NFS server is gone, umount fails, and the umount process will have to wait for the RPC timeout, causing long takeover times.

It is better to use the 4.4BSD automounter amd instead. amd has two nice features which makes it suit well into this scenario:

it unmounts filesystems cleanly if a NFS server fails, not causing any RPC timeout;
it can mount a filesystems from multiple different sources according to a priority list (try to mount via NFS first, else attempt to mount locally).

Other Filesystems

I was hinted towards the Coda Filesystem Project. It promises a fault-tolerant networked filesystem, and it looks as if it could be useful to replace NFS in a HA NFS server constallation.

7.8 Other internal storage devices

The only devices for which SCSI is needed internally today are high capacity tape devices such as DAT or Exabyte. If a client/server backup solution is available, there is possibly no need for local tape drives. Internal CDROM drives can be ATAPI, attached to one of the EIDE channels on the mainboard. Internal disks can be EIDE as well, and today there are EIDE disks with more than 6 GB capacity. Two such disks, mirrored with md will provide for high capacity and redundancy. The disks should be attached to different EIDE channels on the mainboard for availability and performance reasons.