Previous Next Contents

6. Overall Linux-HA Structure

6.1 Hardware Aspects

PCI Slot Issues

Mainboards used for Linux-HA enabled servers should have a sufficient number of I/O slots to ensure the number of network and disk adapters is sufficient for the redundancy needed. In detail, you will probably need at least 2 disk adapters and 2 network adapters. If all cards are PCI, this sums up to at least 4 PCI slots. For more complex setups, probably more PCI slots are needed. For slow-speed networks (10 MBit/s Ethernet, 16 MBit/s Token Ring), possibly ISA cards can be used in order to free PCI slots for the disk adapters. There are also multifunction cards around which can potentially save slots, e.g.

The overall card is a SPOF, tough.

Symmetric Multiprocessors

SMP machines are well suited for high performance computing -- at the cost of having to reboot if a CPU fails. Statistically, a 4 CPU machine will fail 4 times more often than single CPU ones. Statistically...

Main Memory

The use of ECC capable mainboards and parity memory modules is recommended. Gabriel Paubert ( paubert@iram.es) will start working on a ECC handler for the Linux kernel.

Video Cards

Since we are talking about servers systems, no PCI video cards are needed -- cheap ISA cards with simple 14 inch B&W monitors will do the job. Serial terminals may also fit but need a serial port.

6.2 Software Considerations

Need For A Standard Distribution

Since we are talking about business critical systems, there is probably no way for customers to choose their Linux distribution of choice. Distributions are too different in stability and handling to allow for that. Major Linux distribution makers are invited to talk to us about what to do.

An interesting new development is the FreeLinux Project which might suit the needs of Linux-HA. Another interesting idea is the LINNET proposal.

VAR's offering the system to end customers should only support a limited range of kernel releases. This also accomodates Pathlight's concern of not releasing source code (see section Pathlight Technology Inc.).

A different approach may be the following: major Linux distribution makers contribute changes to their distribution to make sure Linux-HA integrates smoothly. This requires some more coordination but makes life easier for users and system integrators. Or they integrate Linux-HA into their distributions and offer configuration and installation service.

Software Structure

Remark: In the following sections, a 2-node cluster will mostly be used to illustrate the concepts. Please keep in mind that we want to be open for more nodes, which means failover concepts must provide some more logic than in these simplified examples.

The main objective is to minimize application downtimes and keep the overall cluster in a consistent state. Therefore, the Linux-HA software will consist of several modules which will run on all machines:

Starting Linux-HA

Commercial solutions allow for either starting the HA daemons on reboot (e.g. via a rc script or from /etc/inittab) or manually. In a production environment, you do not want to start HA on reboot. Consider the primary server crashing for certain reasons. The backup machine will take over, and the primary may reboot. If HA comes up automatically, the primary node re-claims the resource group and potentially crashes again and so on. Plus, you want to investigate the crash before putting the crashed node in production again. So, starting HA automatically on reboot is nice for customer demonstrations but certainly not recommended for the production environment.

The X/curses and command line interfaces will allow to set the start mode for the next system start, i.e. insert an appropriate /etc/inittab line.

Linux-HA will use cluster ID numbers which are common for all nodes which belong to a cluster. Nodes belonging to different clusters will/need to have different cluster IDs. This is handled via SNMP. When a node starts, it will query the network for any other living node with the same cluster ID number. If a living node or a cluster with the same cluster ID exists, the new node will attempt to enter the cluster but will only be allowed if all living nodes agree. If no living node answers, the new node will assume it is the first one and will acquire the resource groups for which it is the primary.

Stopping Linux-HA

There will be three ways of stopping Linux-HA:

Forced Halt

Linux-HA is forcibly halted, resources are not released

Graceful Stop

Linux-HA is gracefully stopped, resources are released but not taken over by another node

Graceful Stop With Takeover

Linux-HA is gracefully stopped, resources are released and taken over by another node

Integrating Applications

Linux-HA will initially be completely generic, that is no application will be specifically supported. Experience shows that specific adaptations make such a HA solution inflexible. The only interface between Linux-HA and applications will be the names of a start and stop script or executable which will be executed by the start_server and stop_server events, respectively.

Since you never know which application needs which permissions, everything has to run as root (keep security aspects in mind!) which means the start and stop scripts can do whatever we like them to. In principle, we can start any application that does not need user interaction during start. Everything that can be safely pushed in the background can thus be made highly available.

Stopping the application can be achieved by calling an application specific command, sending appropriate signals to processes etc.

This way, applications will not be supervised by the Linux-HA software itself, keeping it simple and flexible. The start script, however, can start a process which supervises the application and do what is needed to either attempt to restart it or initiate a controlled and graceful failover to a standby node which in turn restarts the application.

This means, integrating an application boils down to writing start and stop scripts which can run in the background. The bottom line is, if you can run these scripts from cron, they will also work inside Linux-HA.

At a later time, a clustering API might become useful which can be used by applications to control the cluster in certain ways or communicate with the cluster manager daemon.

Syslog Issues

syslogd will be used for reporting and detecting errors. In a real-life setup, make sure syslog is configured on each host to log locally and to a remote loghost. Otherwise you may not be able to find out what went wrong if a failed machine won't reboot.

It would be better to have a generic error logging interface like the AIX error logger. This way, one could handle errors from the kernel more easily. Currently, every change of a device driver or kernel error message would require a Linux-HA update.

Logging Onto A Cluster Node

It is generally a bad idea for users to log directly onto a cluster node. If you login via TCP/IP (e.g. telnet), the connection will be lost in case of a failover. If you login via serial ports, connections will be lost as well because there is no way to take over serial ports except if you use network attached terminal adapters for which the first rule applies.

Network Configuration And Reporting

SNMP will be used to enable remote agents to monitor and/or control a cluster. The package used will be the CMU SNMP toolkit.

Time Synchronization

The NTP protocol will be used to ensure consistent time on all nodes in a cluster. Otherwise, debugging events and errors will be really hard. Ideally, multiple radio receivers will be attached to some cluster nodes, and the NTP configuration file will be set up to synchronize the time from multiple sources. That way, timekeeping will have no SPOF. Public NTP servers on the Intranet or the Internet can also be used.

6.3 Configuration File Contents

Here is the skeleton of the config file format I made. Some Linux specific parts are possibly missing but this is about where I'll start off. What we need now is a library to read/write these objects. Maybe the KConfig C++ class of KDE (Kool Desktop Environment) is suitable. Samba uses the same format but isn't written in C++ (I don't know very much about C++).

The configuration utility will write the ASCII file during configuration. Configuration will be done on one node only and then distributed (synchronized) to the other nodes. Each update will require another synchronization. This way, there will be no inconsistencies. During synchronization, the config tool will also convert the ASCII file (which may contain comments for readability) into GDBM, and all the HA tools will only use the GDBM version for speed.

Please note that Linux-HA will only parse the file during start. This way, the file will only hold static information which is needed when starting the cluster manager daemon. Runtime information will be kept in memory dynamically (e.g. node, adapter and network status), represented by SNMP variables.

Comments start with a hash.


# the first class is "adapter". The class stores information about all
# sorts of network adapters (ethernet, rs232, fddi, tokenring etc. )

[adapter]
# the type -- determines which network heartbeat module to use
type = "ether"
# the network it's attached to
network = "ether1"
# the node it belongs to
nodename = "seneca"
# the IP address/name it has (i.a.w. /etc/hosts)
ip_label = "seneca"
# its function (service, boot, standby)
function = "service"
# a MAC address if appropriate
haddr = "0xdeadbeef0123"

[adapter]
type = "ether"
network = "ether1"
nodename = "seneca"
ip_label = "seneca_stby"
function = "standby"
haddr = ""

[adapter]
type = "ether" 
network = "ether1"
nodename = "seneca"
ip_label = "seneca_boot"
function = "boot"
haddr = ""

# class node stores the information of this cluster

[cluster]
# the cluster ID - must be unique within a logical network or subnet. Part
# of the information in each heartbeat packet. 
id = 1
# cluster name - just a string
name = "linuxtest"
# this node
nodename = "seneca"
# number of nodes - starts with "1"
highest_node_id = 2
# number of networks - starts with "0"
highest_network_id = 1


# this is a leftover from a real HACMP cluster. 
# I am not quite sure if we can use this. HACMP starts some daemons
# automatically, clinfo, the cluster info daemon (a SNMP client),
# the cluster smux peer daemon clsmuxpd, etc. 

[daemons]
nodename = "hawww1"
daemon = "clinfo"
type = "start"
object = "time"
value = "true"

# class events hold all the events (see the event section in the HOWTO

[event]
# the event name
name = "node_up_local"
# a description
desc = "Script run when it is the local node which is joining the clust"
# some real HACMP data, dunno whether we need it. 
setno = 0
msgno = 0
# the message number from the NLS message catalog
catalog = ""
# the executable
cmd = "/usr/sbin/cluster/events/node_up_local"
# a notify script if appropriate
notify = ""
# a pre event script if appropriate
pre = ""
# a post event script if appropriate
post = ""

# the class group holds information about resource groups

[group]
# its name
group = "linuxtest"
# type cascading, rotating, (concurrent)
type = "cascading"
# participating nodes and their priority
nodes = "seneca linha"

# class networks describes the networks
[network]
# its name
name = "rs232"
# type (serial, public, private) ("network attribute")
attr = "serial"
# the network number as known to the cluster software
network_id = 0

[network]
name = "ether"
attr = "public"
network_id = 1

# class nim describes network modules. These handle the heartbeat in a
# network specific manner (RS232 is different from IP, ATM is different
# from Ethernet because it cannot do Multicast etc.)

[nim]
name = "ether"
desc = "Ethernet Protocol"
addrtype = 0
path = "/usr/sbin/cluster/nims/nim_ether"
para = ""
grace = 30
# heartbeat rate in microseconds, well ... 
hbrate = 500000
# if 12 are missing, an alert is created. 
cycle = 12

# class node describes the individual nodes
[node]
# node name
name = "seneca"
# do logging in a verbose manner (e.g. "set -x" in the event scripts or not)
verbose_logging = "high"
# node number
node_id = 1

[node]
name = "linha"
verbose_logging = "high"
node_id = 2


# class resource describes what belongs to a resource group.

[resource]
# name
group = "linuxtest"
# which service IP label(s) are in the RG
service_label = "seneca"
# all FS in a line can make updates complicated...
# all FS to mount locally
filesystem= "/usr/local/ftp /usr/local/etc/httpd"
# all FS to export explicitly
export_filesystem = "/usr/local/ftp /usr/local/etc/httpd"
# which applications (separate class) to start/stop
applications = "linuxtest"
# acquire the resource group on a standby node if the primary isn't there
# or not. not = false, yes = true. 
inactive_takeover = "false"
# only for  concurrent access. disk fencing makes sure only active nodes
# can access the shared disks 
ssa_disk_fencing = "false"

# class server describes application servers
[server ]
# the name (referenced e.g. in class resource)
name = "linuxtest"
# the name of a start and stop script. Will run as root, in the background. 
start = "/usr/local/cluster/start_linuxtest"
stop = "/usr/local/cluster/stop_linuxtest"

6.4 Event Structure and Sequence

Events And Their Meanings

I propose the following event structure:

acquire_service_addr

This script is called when the local node joins the cluster or a remote node leaves the cluster. Does a boot -> service swap. Called by the node_up_local, node_down_remote scripts.

acquire_takeover_addr

This script is called when a remote node leaves the cluster. Does a standby_address -> takeover_address swap if a standby_address is configured and up. Called by the node_down_remote, node_up_local scripts.

config_too_long

This script is periodically called as a timeout when the current event takes too long. Is primarily used to notify an operator. Called by the cluster manager daemon.

event_error

This script is called when a running event script returns an error code != 0. Called by the cluster manager daemon.

fail_standby

This event script is called when a standby adapter goes down. Called by the cluster manager daemon.

get_disk_fs

This script activates the disks and mounts filesystems. Called by the node_up_local, node_down_remote scripts.

join_standby

This event script is called when a standby adapter goes up. Called by the cluster manager daemon.

network_down

This event script is called when a network goes down (all of the network adapters on a physical network are down or unreachable). Called by the cluster manager daemon. Has an associated complete script.

network_up

This event script is called when a network goes up. Called by the cluster manager daemon. Has an associated complete script.

node_down

This script is called when a node leaves the cluster. Called by the cluster manager daemon. Calls the respective sub-event script for local or remote. Has an associated complete script.

node_down_local

This script is called when the local node leaves the cluster. Called by node_down. Has an associated complete script.

node_down_remote

This script is called when a remote node leaves the cluster. Called by node_down. Has an associated complete script.

node_up

This script is called when a node joins the cluster. Called by the cluster manager daemon. Calls the respective sub-event script for local or remote. Has an associated complete script.

node_up_local

This script is called when the local node joins the cluster. Called by node_down. Has an associated complete script.

node_up_remote

This script is called when a remote node joins the cluster. Called by node_down. Has an associated complete script.

release_service_addr

This script is called when the local node leaves the cluster. Called by node_down_local.

release_takeover_addr

This script is called if the local node has the remote node's service address on its standby adapter, and either the remote node re-joins the cluster or the local node leaves the cluster gracefully. Called by node_down_local, node_up_remote.

release_disk_fs

This script unmounts filesystems and releases the disks.

start_server

This script starts the server application. Called when the local node is completely up or a remote node has finished leaving the cluster. Called by node_up_local_complete, node_down_remote_complete. Args: server_name.

stop_server

This script is called to stop the application server(s) when a local node leaves the cluster or a remote node joins the cluster. Called by node_down_local, node_up_remote.

swap_adapter

This event script is called when the service address of a node goes down. The cluster manager then swaps the service adapter with the standby adapter. Called by the cluster manager daemon. Has an associated complete script.

Plus, there are several sub-event scripts, some of the most important are

cl_swap_IP_address

This script is used during adapter swap and IP address takeover. Swaps either a single adapter's address (first form) or two adapters at a time (second form).

cl_swap_HW_address

This script is used to swap the MAC address of an adapter.

Event Sequence

This subsection will be filled later.


Previous Next Contents