This section provides a detailed description of the Proxmox VE HA manager internals. It describes all involved daemons and how they work together. To provide HA, two daemons run on each node:
pve-ha-lrm
pve-ha-crm
Locks are provided by our distributed configuration file system (pmxcfs). They are used to guarantee that each LRM is active once and working. As an LRM only executes actions when it holds its lock, we can mark a failed node as fenced if we can acquire its lock. This then lets us recover any failed HA services securely without any interference from the now unknown failed node. This all gets supervised by the CRM which currently holds the manager master lock.
The CRM uses a service state enumeration to record the current service
state. This state is displayed on the GUI and can be queried using
the ha-manager
command line tool:
# ha-manager status quorum OK master elsa (active, Mon Nov 21 07:23:29 2016) lrm elsa (active, Mon Nov 21 07:23:22 2016) service ct:100 (elsa, stopped) service ct:102 (elsa, started) service vm:501 (elsa, started)
Here is the list of possible states:
disabled
The local resource manager (pve-ha-lrm
) is started as a daemon on
boot and waits until the HA cluster is quorate and thus cluster-wide
locks are working.
It can be in three states:
After the LRM gets in the active state it reads the manager status
file in /etc/pve/ha/manager_status
and determines the commands it
has to execute for the services it owns.
For each command a worker gets started, these workers are running in
parallel and are limited to at most 4 by default. This default setting
may be changed through the datacenter configuration key max_worker
.
When finished the worker process gets collected and its result saved for
the CRM.
The default value of at most 4 concurrent workers may be unsuited for
a specific setup. For example, 4 live migrations may occur at the same
time, which can lead to network congestions with slower networks and/or
big (memory wise) services. Also, ensure that in the worst case, congestion is
at a minimum, even if this means lowering the max_worker
value. On the
contrary, if you have a particularly powerful, high-end setup you may also want
to increase it.
Each command requested by the CRM is uniquely identifiable by a UID. When
the worker finishes, its result will be processed and written in the LRM
status file /etc/pve/nodes/<nodename>/lrm_status
. There the CRM may collect
it and let its state machine - respective to the commands output - act on it.
The actions on each service between CRM and LRM are normally always synced.
This means that the CRM requests a state uniquely marked by a UID, the LRM
then executes this action one time and writes back the result, which is also
identifiable by the same UID. This is needed so that the LRM does not
execute an outdated command.
The only exceptions to this behaviour are the stop
and error
commands;
these two do not depend on the result produced and are executed
always in the case of the stopped state and once in the case of
the error state.
The HA Stack logs every action it makes. This helps to understand what
and also why something happens in the cluster. Here its important to see
what both daemons, the LRM and the CRM, did. You may use
journalctl -u pve-ha-lrm
on the node(s) where the service is and
the same command for the pve-ha-crm on the node which is the current master.
The cluster resource manager (pve-ha-crm
) starts on each node and
waits there for the manager lock, which can only be held by one node
at a time. The node which successfully acquires the manager lock gets
promoted to the CRM master.
It can be in three states:
Its main task is to manage the services which are configured to be highly available and try to always enforce the requested state. For example, a service with the requested state started will be started if its not already running. If it crashes it will be automatically started again. Thus the CRM dictates the actions the LRM needs to execute.
When a node leaves the cluster quorum, its state changes to unknown. If the current CRM can then secure the failed node’s lock, the services will be stolen and restarted on another node.
When a cluster member determines that it is no longer in the cluster quorum, the LRM waits for a new quorum to form. As long as there is no quorum the node cannot reset the watchdog. This will trigger a reboot after the watchdog times out (this happens after 60 seconds).