Our modern society depends heavily on information provided by computers over the network. Mobile devices amplified that dependency, because people can access the network any time from anywhere. If you provide such services, it is very important that they are available most of the time.
We can mathematically define the availability as the ratio of (A), the total time a service is capable of being used during a given interval to (B), the length of the interval. It is normally expressed as a percentage of uptime in a given year.
Table 15.1. Availability - Downtime per Year
Availability % | Downtime per year |
---|---|
99 | 3.65 days |
99.9 | 8.76 hours |
99.99 | 52.56 minutes |
99.999 | 5.26 minutes |
99.9999 | 31.5 seconds |
99.99999 | 3.15 seconds |
There are several ways to increase availability. The most elegant solution is to rewrite your software, so that you can run it on several hosts at the same time. The software itself needs to have a way to detect errors and do failover. If you only want to serve read-only web pages, then this is relatively simple. However, this is generally complex and sometimes impossible, because you cannot modify the software yourself. The following solutions works without modifying the software:
Use reliable “server” components
Computer components with the same functionality can have varying reliability numbers, depending on the component quality. Most vendors sell components with higher reliability as “server” components - usually at higher price.
Eliminate single point of failure (redundant components)
Reduce downtime
ha-manager
)
ha-manager
)
Virtualization environments like Proxmox VE make it much easier to reach high availability because they remove the “hardware” dependency. They also support the setup and use of redundant storage and network devices, so if one host fails, you can simply start those services on another host within your cluster.
Better still, Proxmox VE provides a software stack called ha-manager
,
which can do that automatically for you. It is able to automatically
detect errors and do automatic failover.
Proxmox VE ha-manager
works like an “automated” administrator. First, you
configure what resources (VMs, containers, …) it should
manage. Then, ha-manager
observes the correct functionality, and handles
service failover to another node in case of errors. ha-manager
can
also handle normal user requests which may start, stop, relocate and
migrate a service.
But high availability comes at a price. High quality components are more expensive, and making them redundant doubles the costs at least. Additional spare parts increase costs further. So you should carefully calculate the benefits, and compare with those additional costs.
Increasing availability from 99% to 99.9% is relatively
simple. But increasing availability from 99.9999% to 99.99999% is very
hard and costly. ha-manager
has typical error detection and failover
times of about 2 minutes, so you can get no more than 99.999%
availability.