To build a hyper-converged Proxmox + Ceph Cluster, you must use at least three (preferably) identical servers for the setup.
Check also the recommendations from Ceph’s website.
CPU. A high CPU core frequency reduces latency and should be preferred. As a simple rule of thumb, you should assign a CPU core (or thread) to each Ceph service to provide enough resources for stable and durable Ceph performance.
Memory. Especially in a hyper-converged setup, the memory consumption needs to be carefully monitored. In addition to the predicted memory usage of virtual machines and containers, you must also account for having enough memory available for Ceph to provide excellent and stable performance.
As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory will be used by an OSD. Especially during recovery, re-balancing or backfilling.
The daemon itself will use additional memory. The Bluestore backend of the daemon requires by default 3-5 GiB of memory (adjustable). In contrast, the legacy Filestore backend uses the OS page cache and the memory consumption is generally related to PGs of an OSD daemon.
Network. We recommend a network bandwidth of at least 10 GbE or more, which is used exclusively for Ceph. A meshed network setup [13] is also an option if there are no 10 GbE switches available.
The volume of traffic, especially during recovery, will interfere with other services on the same network and may even break the Proxmox VE cluster stack.
Furthermore, you should estimate your bandwidth needs. While one HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwidth will ensure that this isn’t your bottleneck and won’t be anytime soon. 25, 40 or even 100 Gbps are possible.
Disks. When planning the size of your Ceph cluster, it is important to take the recovery time into consideration. Especially with small clusters, recovery might take long. It is recommended that you use SSDs instead of HDDs in small setups to reduce recovery time, minimizing the likelihood of a subsequent failure event during recovery.
In general, SSDs will provide more IOPS than spinning disks. With this in mind, in addition to the higher cost, it may make sense to implement a class based Section 8.7, “Ceph CRUSH & device classes” separation of pools. Another way to speed up OSDs is to use a faster disk as a journal or DB/Write-Ahead-Log device, see creating Ceph OSDs Section 8.5, “Ceph OSDs”. If a faster disk is used for multiple OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be selected, otherwise the faster disk becomes the bottleneck for all linked OSDs.
Aside from the disk type, Ceph performs best with an even sized and distributed amount of disks per node. For example, 4 x 500 GB disks within each node is better than a mixed setup with a single 1 TB and three 250 GB disk.
You also need to balance OSD count and single OSD capacity. More capacity allows you to increase storage density, but it also means that a single OSD failure forces Ceph to recover more data at once.
Avoid RAID. As Ceph handles data object redundancy and multiple parallel writes to disks (OSDs) on its own, using a RAID controller normally doesn’t improve performance or availability. On the contrary, Ceph is designed to handle whole disks on it’s own, without any abstraction in between. RAID controllers are not designed for the Ceph workload and may complicate things and sometimes even reduce performance, as their write and caching algorithms may interfere with the ones from Ceph.
Avoid RAID controllers. Use host bus adapter (HBA) instead.
The above recommendations should be seen as a rough guidance for choosing hardware. Therefore, it is still essential to adapt it to your specific needs. You should test your setup and monitor health and performance continuously.