On node failures, fencing ensures that the erroneous node is guaranteed to be offline. This is required to make sure that no resource runs twice when it gets recovered on another node. This is a really important task, because without this, it would not be possible to recover a resource on another node.
If a node did not get fenced, it would be in an unknown state where it may have still access to shared resources. This is really dangerous! Imagine that every network but the storage one broke. Now, while not reachable from the public network, the VM still runs and writes to the shared storage.
If we then simply start up this VM on another node, we would get a dangerous race condition, because we write from both nodes. Such conditions can destroy all VM data and the whole VM could be rendered unusable. The recovery could also fail if the storage protects against multiple mounts.
There are different methods to fence a node, for example, fence devices which cut off the power from the node or disable their communication completely. Those are often quite expensive and bring additional critical components into a system, because if they fail you cannot recover any service.
We thus wanted to integrate a simpler fencing method, which does not require additional external hardware. This can be done using watchdog timers.
Possible Fencing Methods
Watchdog timers have been widely used in critical and dependable systems since the beginning of microcontrollers. They are often simple, independent integrated circuits which are used to detect and recover from computer malfunctions.
During normal operation, ha-manager
regularly resets the watchdog
timer to prevent it from elapsing. If, due to a hardware fault or
program error, the computer fails to reset the watchdog, the timer
will elapse and trigger a reset of the whole server (reboot).
Recent server motherboards often include such hardware watchdogs, but these need to be configured. If no watchdog is available or configured, we fall back to the Linux Kernel softdog. While still reliable, it is not independent of the servers hardware, and thus has a lower reliability than a hardware watchdog.
By default, all hardware watchdog modules are blocked for security reasons. They are like a loaded gun if not correctly initialized. To enable a hardware watchdog, you need to specify the module to load in /etc/default/pve-ha-manager, for example:
# select watchdog module (default is softdog) WATCHDOG_MODULE=iTCO_wdt
This configuration is read by the watchdog-mux service, which loads the specified module at startup.
After a node failed and its fencing was successful, the CRM tries to move services from the failed node to nodes which are still online.
The selection of nodes, on which those services gets recovered, is
influenced by the resource group
settings, the list of currently active
nodes, and their respective active service count.
The CRM first builds a set out of the intersection between user selected
nodes (from group
setting) and available nodes. It then choose the
subset of nodes with the highest priority, and finally select the node
with the lowest active service count. This minimizes the possibility
of an overloaded node.
On node failure, the CRM distributes services to the remaining nodes. This increases the service count on those nodes, and can lead to high load, especially on small clusters. Please design your cluster so that it can handle such worst case scenarios.