Linux High Availability HOWTO: Overall Linux-HA Structure

Mainboards used for Linux-HA enabled servers should have a sufficient number of I/O slots to ensure the number of network and disk adapters is sufficient for the redundancy needed. In detail, you will probably need at least 2 disk adapters and 2 network adapters. If all cards are PCI, this sums up to at least 4 PCI slots. For more complex setups, probably more PCI slots are needed. For slow-speed networks (10 MBit/s Ethernet, 16 MBit/s Token Ring), possibly ISA cards can be used in order to free PCI slots for the disk adapters. There are also multifunction cards around which can potentially save slots, e.g.

4-way ethernet cards by Zynx
Buslogic makes BT952 dual wide SCSI boards (not differential it seems)
Adaptec has the AHA-8945 Firewire/SCSI dual card and the Quartet 4-way ethernet adapters.
Intraserver makes the Series 4000 SCSI/Ethernet adapter. One variant is a dual-SCSI Ultra/Wide/Differential with 2 external 68-pin connectors and a Ethernet 10/100 port! And it has official Linux/Alpha support (should run on the 53c8xx driver).

The overall card is a SPOF, tough.

Symmetric Multiprocessors

SMP machines are well suited for high performance computing -- at the cost of having to reboot if a CPU fails. Statistically, a 4 CPU machine will fail 4 times more often than single CPU ones. Statistically...

Main Memory

The use of ECC capable mainboards and parity memory modules is recommended. Gabriel Paubert (paubert@iram.es) will start working on a ECC handler for the Linux kernel.

Video Cards

Since we are talking about servers systems, no PCI video cards are needed -- cheap ISA cards with simple 14 inch B&W monitors will do the job. Serial terminals may also fit but need a serial port.

6.2 Software Considerations

Need For A Standard Distribution

Since we are talking about business critical systems, there is probably no way for customers to choose their Linux distribution of choice. Distributions are too different in stability and handling to allow for that. Major Linux distribution makers are invited to talk to us about what to do.

An interesting new development is the FreeLinux Project which might suit the needs of Linux-HA. Another interesting idea is the LINNET proposal.

VAR's offering the system to end customers should only support a limited range of kernel releases. This also accomodates Pathlight's concern of not releasing source code (see section Pathlight Technology Inc.).

A different approach may be the following: major Linux distribution makers contribute changes to their distribution to make sure Linux-HA integrates smoothly. This requires some more coordination but makes life easier for users and system integrators. Or they integrate Linux-HA into their distributions and offer configuration and installation service.

Software Structure

Remark: In the following sections, a 2-node cluster will mostly be used to illustrate the concepts. Please keep in mind that we want to be open for more nodes, which means failover concepts must provide some more logic than in these simplified examples.

The main objective is to minimize application downtimes and keep the overall cluster in a consistent state. Therefore, the Linux-HA software will consist of several modules which will run on all machines:

A central "cluster manager" daemon which will handle events, supervise networks and issue error messages. The cluster manager daemon will receive error messages through a named pipe which is either written into by syslogd or by other means. The daemon will communicate with his counterparts on other cluster nodes and send/receive SNMP traps. The cluster manager daemon will be written in a compiled language (e.g. C) for performance and integrity reasons.
There will be three types of events:
- Network Adapter Events
- Network Events
- Node Events
- plus the associated completion events
There are some basic rules concerning how the cluster manager daemons will decide what to do:
- Nothing is considered to be failed until all active cluster managers agree that it has failed.
- Nothing joins the cluster until all active cluster managers agree that it is ready to join.
- When a new node joins, all previously active cluster managers must successfully run their relevant event scripts before the new node is allowed to run its event scripts.
- When a node is shut down, it will gracefully leave the cluster. It will run its relevant event scripts before exiting.
Linux-HA will use the kernel watchdog timer (read Documentation/watchdog.txt for more information), either with or without a hardware watchdog card. For more flexibility, I asked Alan Cox to change the watchdog function to allow the user to write the number of seconds and what to do into /dev/watchdog, like
echo "60 H" > /dev/watchdog
will halt (not reboot) the machine after 60 seconds, or
echo "10 R" > /dev/watchdog
will reboot it after 10 seconds. The reason for the change request is, I want Linux-HA to be able to either reboot or halt the machine, depending on the HA configuration. In some situations it makes sense to halt a machine (e.g. if something goes wrong but the network card is still active which prevents a clean IP address takeover to another card). Halting a machine will make sure all resources are released. On the other hand, when we have rotating resources, we may want to reboot the machine and start HA automatically, e.g. in a web server or firewall configuration. And the timeout will need to be set individually as well. This mechanism may seem a bit rude but the objective is clear. The overall cluster must remain in a consistent state, and if one of the bits & pieces on a node fails it is unclear whether or not this node will behave normally in case of other events. So it is better to disable it as quickly as possible to ensure another node takes over.
Network Modules, which will do the actual heartbeat in a network-specific manner (RS232 is different from IP, ATM is different from Ethernet because it cannot do IP Multicast etc.). The Network Modules will have a standard interface with the cluster manager daemon. The RS232 Network Module will initialise the serial device (/dev/ttySX) to run with standard parameters (e.g. 9600 8N1).
A set of event scripts which will do the real event work, e.g. swap failed network or disk adapters, acquire disks, start applications etc. A detailed list will be issued at a later date respectively be discussed on the developers' mailing list. The scripts can basically be written in an arbitrary scripting language (shell, Perl, Python) but we will have to make sure that customers are able to customize the scripts they want (rendering exotic scripting languages unusable), and that security aspects are being coped with accordingly. For example, it is a bad idea to run a Perl interpreter on a firewall machine because Perl is too powerful! In such cases one could compile the Perl scripts with a Perl-to-C compiler and run the executables on the production machines. There will also be hooks into the event processing to make local pre- and postprocessing easier without tampering with the genuine event scripts. The IBM terminology for this feature is "Pre- and Post-Event Scripts".
hooks into the Linux operating system to make sure all device driver and kernel related events are handled accordingly. This also involves extending or re-writing some device drivers to make sure they report errors in a standardized format which can easily be parsed by the cluster manager daemon.
An X- and/or curses-based configuration and administration tool which allows for easy and error-free setup and administration. This also includes a tool which performs consistency checks on all cluster nodes. There will also be some means of dynamically update/change the cluster during operation, i.e. without shutting down applications. This will include adding or removing nodes, resource groups and disks, updating events scripts etc.
Sort of a "configuration database" which holds all the configuration data. This database will be used by all scripts and the cluster manager daemon itself, and will be synchronized across all nodes by the config & admin tool, and will be verified by the verification tool. Other than with some commercial solutions, the remote shell facility will not be used for data synchronization because it makes systems inherently insecure (.rhosts). Instead, a secure protocol will be used, e.g. DCE RPC or the Secure Shell. The Config & Admin tool might care for proper setup of either of these on all nodes.

Starting Linux-HA

Commercial solutions allow for either starting the HA daemons on reboot (e.g. via a rc script or from /etc/inittab) or manually. In a production environment, you do not want to start HA on reboot. Consider the primary server crashing for certain reasons. The backup machine will take over, and the primary may reboot. If HA comes up automatically, the primary node re-claims the resource group and potentially crashes again and so on. Plus, you want to investigate the crash before putting the crashed node in production again. So, starting HA automatically on reboot is nice for customer demonstrations but certainly not recommended for the production environment.

The X/curses and command line interfaces will allow to set the start mode for the next system start, i.e. insert an appropriate /etc/inittab line.

Linux-HA will use cluster ID numbers which are common for all nodes which belong to a cluster. Nodes belonging to different clusters will/need to have different cluster IDs. This is handled via SNMP. When a node starts, it will query the network for any other living node with the same cluster ID number. If a living node or a cluster with the same cluster ID exists, the new node will attempt to enter the cluster but will only be allowed if all living nodes agree. If no living node answers, the new node will assume it is the first one and will acquire the resource groups for which it is the primary.

Stopping Linux-HA

There will be three ways of stopping Linux-HA:

Forced Halt

Linux-HA is forcibly halted, resources are not released

Graceful Stop

Linux-HA is gracefully stopped, resources are released but not taken over by another node

Graceful Stop With Takeover

Linux-HA is gracefully stopped, resources are released and taken over by another node

Integrating Applications

Linux-HA will initially be completely generic, that is no application will be specifically supported. Experience shows that specific adaptations make such a HA solution inflexible. The only interface between Linux-HA and applications will be the names of a start and stop script or executable which will be executed by the start_server and stop_server events, respectively.

Since you never know which application needs which permissions, everything has to run as root (keep security aspects in mind!) which means the start and stop scripts can do whatever we like them to. In principle, we can start any application that does not need user interaction during start. Everything that can be safely pushed in the background can thus be made highly available.

Stopping the application can be achieved by calling an application specific command, sending appropriate signals to processes etc.

This way, applications will not be supervised by the Linux-HA software itself, keeping it simple and flexible. The start script, however, can start a process which supervises the application and do what is needed to either attempt to restart it or initiate a controlled and graceful failover to a standby node which in turn restarts the application.

This means, integrating an application boils down to writing start and stop scripts which can run in the background. The bottom line is, if you can run these scripts from cron, they will also work inside Linux-HA.

At a later time, a clustering API might become useful which can be used by applications to control the cluster in certain ways or communicate with the cluster manager daemon.

Syslog Issues

syslogd will be used for reporting and detecting errors. In a real-life setup, make sure syslog is configured on each host to log locally and to a remote loghost. Otherwise you may not be able to find out what went wrong if a failed machine won't reboot.

It would be better to have a generic error logging interface like the AIX error logger. This way, one could handle errors from the kernel more easily. Currently, every change of a device driver or kernel error message would require a Linux-HA update.

Logging Onto A Cluster Node

It is generally a bad idea for users to log directly onto a cluster node. If you login via TCP/IP (e.g. telnet), the connection will be lost in case of a failover. If you login via serial ports, connections will be lost as well because there is no way to take over serial ports except if you use network attached terminal adapters for which the first rule applies.

Network Configuration And Reporting

SNMP will be used to enable remote agents to monitor and/or control a cluster. The package used will be the CMU SNMP toolkit.

Time Synchronization

The NTP protocol will be used to ensure consistent time on all nodes in a cluster. Otherwise, debugging events and errors will be really hard. Ideally, multiple radio receivers will be attached to some cluster nodes, and the NTP configuration file will be set up to synchronize the time from multiple sources. That way, timekeeping will have no SPOF. Public NTP servers on the Intranet or the Internet can also be used.

6.3 Configuration File Contents

Here is the skeleton of the config file format I made. Some Linux specific parts are possibly missing but this is about where I'll start off. What we need now is a library to read/write these objects. Maybe the KConfig C++ class of KDE (Kool Desktop Environment) is suitable. Samba uses the same format but isn't written in C++ (I don't know very much about C++).

The configuration utility will write the ASCII file during configuration. Configuration will be done on one node only and then distributed (synchronized) to the other nodes. Each update will require another synchronization. This way, there will be no inconsistencies. During synchronization, the config tool will also convert the ASCII file (which may contain comments for readability) into GDBM, and all the HA tools will only use the GDBM version for speed.

Please note that Linux-HA will only parse the file during start. This way, the file will only hold static information which is needed when starting the cluster manager daemon. Runtime information will be kept in memory dynamically (e.g. node, adapter and network status), represented by SNMP variables.

Comments start with a hash.

# the first class is "adapter". The class stores information about all
# sorts of network adapters (ethernet, rs232, fddi, tokenring etc. )

[adapter]
# the type -- determines which network heartbeat module to use
type = "ether"
# the network it's attached to
network = "ether1"
# the node it belongs to
nodename = "seneca"
# the IP address/name it has (i.a.w. /etc/hosts)
ip_label = "seneca"
# its function (service, boot, standby)
function = "service"
# a MAC address if appropriate
haddr = "0xdeadbeef0123"

[adapter]
type = "ether"
network = "ether1"
nodename = "seneca"
ip_label = "seneca_stby"
function = "standby"
haddr = ""

[adapter]
type = "ether" 
network = "ether1"
nodename = "seneca"
ip_label = "seneca_boot"
function = "boot"
haddr = ""

# class node stores the information of this cluster

[cluster]
# the cluster ID - must be unique within a logical network or subnet. Part
# of the information in each heartbeat packet. 
id = 1
# cluster name - just a string
name = "linuxtest"
# this node
nodename = "seneca"
# number of nodes - starts with "1"
highest_node_id = 2
# number of networks - starts with "0"
highest_network_id = 1


# this is a leftover from a real HACMP cluster. 
# I am not quite sure if we can use this. HACMP starts some daemons
# automatically, clinfo, the cluster info daemon (a SNMP client),
# the cluster smux peer daemon clsmuxpd, etc. 

[daemons]
nodename = "hawww1"
daemon = "clinfo"
type = "start"
object = "time"
value = "true"

# class events hold all the events (see the event section in the HOWTO

[event]
# the event name
name = "node_up_local"
# a description
desc = "Script run when it is the local node which is joining the clust"
# some real HACMP data, dunno whether we need it. 
setno = 0
msgno = 0
# the message number from the NLS message catalog
catalog = ""
# the executable
cmd = "/usr/sbin/cluster/events/node_up_local"
# a notify script if appropriate
notify = ""
# a pre event script if appropriate
pre = ""
# a post event script if appropriate
post = ""

# the class group holds information about resource groups

[group]
# its name
group = "linuxtest"
# type cascading, rotating, (concurrent)
type = "cascading"
# participating nodes and their priority
nodes = "seneca linha"

# class networks describes the networks
[network]
# its name
name = "rs232"
# type (serial, public, private) ("network attribute")
attr = "serial"
# the network number as known to the cluster software
network_id = 0

[network]
name = "ether"
attr = "public"
network_id = 1

# class nim describes network modules. These handle the heartbeat in a
# network specific manner (RS232 is different from IP, ATM is different
# from Ethernet because it cannot do Multicast etc.)

[nim]
name = "ether"
desc = "Ethernet Protocol"
addrtype = 0
path = "/usr/sbin/cluster/nims/nim_ether"
para = ""
grace = 30
# heartbeat rate in microseconds, well ... 
hbrate = 500000
# if 12 are missing, an alert is created. 
cycle = 12

# class node describes the individual nodes
[node]
# node name
name = "seneca"
# do logging in a verbose manner (e.g. "set -x" in the event scripts or not)
verbose_logging = "high"
# node number
node_id = 1

[node]
name = "linha"
verbose_logging = "high"
node_id = 2


# class resource describes what belongs to a resource group.

[resource]
# name
group = "linuxtest"
# which service IP label(s) are in the RG
service_label = "seneca"
# all FS in a line can make updates complicated...
# all FS to mount locally
filesystem= "/usr/local/ftp /usr/local/etc/httpd"
# all FS to export explicitly
export_filesystem = "/usr/local/ftp /usr/local/etc/httpd"
# which applications (separate class) to start/stop
applications = "linuxtest"
# acquire the resource group on a standby node if the primary isn't there
# or not. not = false, yes = true. 
inactive_takeover = "false"
# only for  concurrent access. disk fencing makes sure only active nodes
# can access the shared disks 
ssa_disk_fencing = "false"

# class server describes application servers
[server ]
# the name (referenced e.g. in class resource)
name = "linuxtest"
# the name of a start and stop script. Will run as root, in the background. 
start = "/usr/local/cluster/start_linuxtest"
stop = "/usr/local/cluster/stop_linuxtest"

6.4 Event Structure and Sequence

Events And Their Meanings

I propose the following event structure:

acquire_service_addr

This script is called when the local node joins the cluster or a remote node leaves the cluster. Does a boot -> service swap. Called by the node_up_local, node_down_remote scripts.

acquire_takeover_addr

This script is called when a remote node leaves the cluster. Does a standby_address -> takeover_address swap if a standby_address is configured and up. Called by the node_down_remote, node_up_local scripts.

config_too_long

This script is periodically called as a timeout when the current event takes too long. Is primarily used to notify an operator. Called by the cluster manager daemon.

event_error

This script is called when a running event script returns an error code != 0. Called by the cluster manager daemon.

fail_standby

This event script is called when a standby adapter goes down. Called by the cluster manager daemon.

get_disk_fs

This script activates the disks and mounts filesystems. Called by the node_up_local, node_down_remote scripts.

join_standby

This event script is called when a standby adapter goes up. Called by the cluster manager daemon.

network_down

This event script is called when a network goes down (all of the network adapters on a physical network are down or unreachable). Called by the cluster manager daemon. Has an associated complete script.

network_up

This event script is called when a network goes up. Called by the cluster manager daemon. Has an associated complete script.

node_down

This script is called when a node leaves the cluster. Called by the cluster manager daemon. Calls the respective sub-event script for local or remote. Has an associated complete script.

node_down_local

This script is called when the local node leaves the cluster. Called by node_down. Has an associated complete script.

node_down_remote

This script is called when a remote node leaves the cluster. Called by node_down. Has an associated complete script.

node_up

This script is called when a node joins the cluster. Called by the cluster manager daemon. Calls the respective sub-event script for local or remote. Has an associated complete script.

node_up_local

This script is called when the local node joins the cluster. Called by node_down. Has an associated complete script.

node_up_remote

This script is called when a remote node joins the cluster. Called by node_down. Has an associated complete script.

release_service_addr

This script is called when the local node leaves the cluster. Called by node_down_local.

release_takeover_addr

This script is called if the local node has the remote node's service address on its standby adapter, and either the remote node re-joins the cluster or the local node leaves the cluster gracefully. Called by node_down_local, node_up_remote.

release_disk_fs

This script unmounts filesystems and releases the disks.

start_server

This script starts the server application. Called when the local node is completely up or a remote node has finished leaving the cluster. Called by node_up_local_complete, node_down_remote_complete. Args: server_name.

stop_server

This script is called to stop the application server(s) when a local node leaves the cluster or a remote node joins the cluster. Called by node_down_local, node_up_remote.

swap_adapter

This event script is called when the service address of a node goes down. The cluster manager then swaps the service adapter with the standby adapter. Called by the cluster manager daemon. Has an associated complete script.

Plus, there are several sub-event scripts, some of the most important are

cl_swap_IP_address

This script is used during adapter swap and IP address takeover. Swaps either a single adapter's address (first form) or two adapters at a time (second form).

cl_swap_HW_address

This script is used to swap the MAC address of an adapter.

Event Sequence

This subsection will be filled later.

6. Overall Linux-HA Structure

6.1 Hardware Aspects

PCI Slot Issues