= Read-Me-First =

== The Scope of this Document ==

Computer clusters can be used to provide highly available services or
resources. The redundancy of multiple machines is used to guard
against failures of many types.

This document will walk through the installation and setup of simple
clusters using the Fedora distribution, version 14.

The clusters described here will use Pacemaker and Corosync to provide
resource management and messaging. Required packages and modifications
to their configuration files are described along with the use of the
Pacemaker command line tool for generating the XML used for cluster
control.

Pacemaker is a central component and provides the resource management
required in these systems. This management includes detecting and
recovering from the failure of various nodes, resources and services
under its control.

When more in depth information is required and for real world usage,
please refer to the http://www.clusterlabs.org/doc/[Pacemaker Explained] manual.

== What Is Pacemaker? ==

Pacemaker is a cluster resource manager. It achieves maximum availability
for your cluster services (aka. resources) by detecting and recovering
from node and resource-level failures by making use of the messaging and
membership capabilities provided by your preferred cluster infrastructure
(either Corosync or Heartbeat).

Pacemaker's key features include:

 * Detection and recovery of node and service-level failures
 * Storage agnostic, no requirement for shared storage
 * Resource agnostic, anything that can be scripted can be clustered
 * Supports STONITH for ensuring data integrity
 * Supports large and small clusters
 * Supports both quorate and resource driven clusters
 * Supports practically any redundancy configuration
 * Automatically replicated configuration that can be updated from any node
 * Ability to specify cluster-wide service ordering, colocation and anti-colocation
 * Support for advanced service types
 ** Clones: for services which need to be active on multiple nodes
 ** Multi-state: for services with multiple modes (eg. master/slave, primary/secondary)
 * Unified, scriptable, cluster shell

== Pacemaker Architecture ==

At the highest level, the cluster is made up of three pieces:

 * Non-cluster aware components (illustrated in green). These pieces
   include the resources themselves, scripts that start, stop and
   monitor them, and also a local daemon that masks the differences
   between the different standards these scripts implement.

 * Resource management Pacemaker provides the brain (illustrated in
   blue) that processes and reacts to events regarding the cluster.
   These events include nodes joining or leaving the cluster; resource
   events caused by failures, maintenance, scheduled activities; and
   other administrative actions. Pacemaker will compute the ideal
   state of the cluster and plot a path to achieve it after any of
   these events. This may include moving resources, stopping nodes and
   even forcing them offline with remote power switches.

 * Low level infrastructure Corosync provides reliable messaging,
   membership and quorum information about the cluster (illustrated in
   red).

.Conceptual Stack Overview
image::images/pcmk-overview.png["Conceptual overview of the cluster stack",align="center"]

When combined with Corosync, Pacemaker also supports popular open
source cluster filesystems.
footnote:[Even though Pacemaker also supports Heartbeat, the
filesystems need to use the stack for messaging and membership and
Corosync seems to be what they're standardizing on. Technically it
would be possible for them to support Heartbeat as well, however there
seems little interest in this.]

Due to recent standardization within the cluster filesystem community,
they make use of a common distributed lock manager which makes use of
Corosync for its messaging capabilities and Pacemaker for its
membership (which nodes are up/down) and fencing services.  

.The Pacemaker Stack
image::images/pcmk-stack.png["The Pacemaker StackThe Pacemaker stack when running on Corosync",align="center"]

=== Internal Components ===

Pacemaker itself is composed of four key components (illustrated below in
the same color scheme as the previous diagram):

 * CIB (aka. Cluster Information Base)
 * CRMd (aka. Cluster Resource Management daemon)
 * PEngine (aka. PE or Policy Engine)
 * STONITHd

.Internal Components
image::images/pcmk-internals.png["Subsystems of a Pacemaker cluster running on Corosync",align="center"]

The CIB uses XML to represent both the cluster's configuration and
current state of all resources in the cluster. The contents of the CIB
are automatically kept in sync across the entire cluster and are used
by the PEngine to compute the ideal state of the cluster and how it
should be achieved. 

This list of instructions is then fed to the DC (Designated
Co-ordinator).  Pacemaker centralizes all cluster decision making by
electing one of the CRMd instances to act as a master. Should the
elected CRMd process, or the node it is on, fail... a new one is
quickly established.

The DC carries out the PEngine's instructions in the required order by
passing them to either the LRMd (Local Resource Management daemon) or
CRMd peers on other nodes via the cluster messaging infrastructure
(which in turn passes them on to their LRMd process). 

The peer nodes all report the results of their operations back to the
DC and based on the expected and actual results, will either execute
any actions that needed to wait for the previous one to complete, or
abort processing and ask the PEngine to recalculate the ideal cluster
state based on the unexpected results.

In some cases, it may be necessary to power off nodes in order to
protect shared data or complete resource recovery. For this Pacemaker
comes with STONITHd. STONITH is an acronym for
Shoot-The-Other-Node-In-The-Head and is usually implemented with a
remote power switch. In Pacemaker, STONITH devices are modeled as
resources (and configured in the CIB) to enable them to be easily
monitored for failure, however STONITHd takes care of understanding
the STONITH topology such that its clients simply request a node be
fenced and it does the rest.

== Types of Pacemaker Clusters ==

Pacemaker makes no assumptions about your environment, this allows it
to support practically any
http://en.wikipedia.org/wiki/High-availability_cluster#Node_configurations[redundancy
configuration] including Active/Active, Active/Passive, N+1, N+M,
N-to-1 and N-to-N.

In this document we will focus on the setup of a highly available
Apache web server with an Active/Passive cluster using DRBD and Ext4
to store data. Then, we will upgrade this cluster to Active/Active
using GFS2.

.Active/Passive Redundancy
image::images/pcmk-active-passive.png["Two-node Active/Passive clusters using Pacemaker and DRBD are a cost-effective solution for many High Availability situations",align="center"]


.N to N Redundancy
image::images/pcmk-active-active.png["When shared storage is available, every node can potentially be used for failover. Pacemaker can even run multiple copies of services to spread out the workload",align="center"]
