Overwrite support in erasure-coded pools

Although erasure-coded pool support has been in Ceph for several releases now, before the arrival of BlueStore in the Luminous release, it had not supported partial writes. This limitation meant that erasure pools could not directly be used with RBD and CephFS workloads. With the introduction of BlueStore in Luminous, it provided the groundwork for partial write support to be implemented. With partial write support, the number of I/O types that erasure pools can support almost matches replicated pools, enabling the use of erasure-coded pools directly with RBD and CephFS workloads. This dramatically lowers the cost of storage capacity for these use cases.

For full-stripe writes, which occur either for new objects or when the entire object is rewritten, the write penalty is greatly reduced. A client writing a 4 MB object to a 4+2 erasure-coded pool would only have to write 6 MB of data, 4 MB of data chunks, and 2 MB of erasure-coded chunks. This is compared to 12 MB of data written in a replicated pool. It should, however, be noted that each chunk of the erasure stripe will be written to a different OSD. For smaller erasure profiles, such as 4+2, this will tend to offer a large performance boost for both spinning disks and SSDs, as each OSD is having to write less data. However, for larger erasure stripes, the overhead of having to write to an ever-increasing number of OSDs starts to outweigh the benefit of reducing the amount of data to be written, particularly on spinning disks whose latency does not have a linear relationship to the I/O size.

Ceph's userspace clients, such as librbd and libcephfs, are clever enough to try to batch together smaller I/Os and submit a full stripe write if possible; this can help when the application residing previously is submitting sequential I/O but not aligned to the 4 MB object boundaries.

Partial write support allows overwrites to be done to an object; this introduces a number of complexities, as, when a partial write is done, the erasure chunks also require updating to match the new object contents. This is very similar to the challenges faced by RAID 5 and 6, although having to coordinate this process across several OSDs in a consistent manor increases the complexity. When a partial write is performed, Ceph first reads the entire existing object off the disk, and then it must merge in memory the new writes, calculate the new erasure-coded chunks, and write everything back to the disk. So, not only is there both a read and a write operation involved, but each of these operations will likely touch several disks making up the erasure stripe. As you can see, a single I/O can end up having a write penalty several times higher than that of a replicated pool. For a 4+2 erasure-coded pool, a small 4 KB write could end up submitting 12 I/Os to the disks in the cluster, not taking into account any additional Ceph overheads.