If a replication job encounters problems, it is placed in an error state. In this state, the configured replication intervals get suspended temporarily. The failed replication is repeatedly tried again in a 30 minute interval. Once this succeeds, the original schedule gets activated again.
Some of the most common issues are in the following list. Depending on your setup there may be another cause.
You can always use the replication log to find out what is causing the problem.
In the case of a grave error, a virtual guest may get stuck on a failed node. You then need to move it manually to a working node again.
Let’s assume that you have two guests (VM 100 and CT 200) running on node A and replicate to node B. Node A failed and can not get back online. Now you have to migrate the guest to Node B manually.
check if that the cluster is quorate
# pvecm status
If you have no quorum, we strongly advise to fix this first and make the node operable again. Only if this is not possible at the moment, you may use the following command to enforce quorum on the current node:
# pvecm expected 1
Avoid changes which affect the cluster if expected votes
are set
(for example adding/removing nodes, storages, virtual guests) at all costs.
Only use it to get vital guests up and running again or to resolve the quorum
issue itself.
move both guest configuration files form the origin node A to node B:
# mv /etc/pve/nodes/A/qemu-server/100.conf /etc/pve/nodes/B/qemu-server/100.conf # mv /etc/pve/nodes/A/lxc/200.conf /etc/pve/nodes/B/lxc/200.conf
Now you can start the guests again:
# qm start 100 # pct start 200
Remember to replace the VMIDs and node names with your respective values.