8.10. Ceph maintenance

One of the most common maintenance tasks in Ceph is to replace the disk of an OSD. If a disk is already in a failed state, then you can go ahead and run through the steps in Destroy OSDs Section 8.5.2, “Destroy OSDs”. Ceph will recreate those copies on the remaining OSDs if possible. This rebalancing will start as soon as an OSD failure is detected or an OSD was actively stopped.

Note

With the default size/min_size (3/2) of a pool, recovery only starts when ‘size + 1` nodes are available. The reason for this is that the Ceph object balancer CRUSH Section 8.7, “Ceph CRUSH & device classes” defaults to a full node as `failure domain’.

To replace a functioning disk from the GUI, go through the steps in Destroy OSDs Section 8.5.2, “Destroy OSDs”. The only addition is to wait until the cluster shows HEALTH_OK before stopping the OSD to destroy it.

On the command line, use the following commands:

ceph osd out osd.<id>

You can check with the command below if the OSD can be safely removed.

ceph osd safe-to-destroy osd.<id>

Once the above check tells you that it is safe to remove the OSD, you can continue with the following commands:

systemctl stop ceph-osd@<id>.service
pveceph osd destroy <id>

Replace the old disk with the new one and use the same procedure as described in Create OSDs Section 8.5.1, “Create OSDs”.

It is good practice to run fstrim (discard) regularly on VMs and containers. This releases data blocks that the filesystem isn’t using anymore. It reduces data usage and resource load. Most modern operating systems issue such discard commands to their disks regularly. You only need to ensure that the Virtual Machines enable the disk discard option the section called “Trim/Discard”.

Ceph ensures data integrity by scrubbing placement groups. Ceph checks every object in a PG for its health. There are two forms of Scrubbing, daily cheap metadata checks and weekly deep data checks. The weekly deep scrub reads the objects and uses checksums to ensure data integrity. If a running scrub interferes with business (performance) needs, you can adjust the time when scrubs [24] are executed.