Ceph, an open-source distributed storage platform, provides a unified interface for object-, block-, and file-level storage [33, 80, 34, 81]. This chapter presents the block-level workload characteristics of a WD WASP/EPIC microserver-based Ceph cluster. The analysis techniques presented can help you to understand the performance and drive characteristics of Ceph in production environments. In addition, I also identify whether SMR, hybrid disk, and SSD drives are suitable for the Ceph workload.
The basic architecture of Ceph was described in Chapter 1. Ceph’s core, RADOS, is a fully distributed, reliable, and autonomous object store using the CRUSH (Controlled Replication Under Scalable Hashing) algorithm. Ceph’s building blocks are called OSDs (object storage daemons). OSDs are responsible for storing objects on local file systems (e.g., EXT4 and XFS), and cooperating to replicate data, detect and recover from failures, or migrate data when OSDs join or leave the cluster. Ceph’s design originated in the premise that failures are common in large-scale storage systems. Along these lines, Ceph targets at guaranteeing reliability and scalability by leveraging the intelligence of the OSDs. Each OSD uses a journal to accelerate the write operations by coalescing small writes and flushing them asynchronously to the backing file system when the journal is full. The journal can be a different file or located in another device or partition [82, 83, 84].
These tests are based on a “unique” platform. Instead of traditional workstations, the so-called microserver structure is used for the production environments. In the system, each microserver has an individual OS and an HDD. It is almost the minimum granularity for an IO device, which essentially satisfies the original design requirement of Sage Weil, the father of Ceph [80, 33]. In fact, this architecture minimizes the failure domain to a disk unit instead of many disks becoming inaccessible in one server with a multi-disk architecture. The storage cluster is scaled out by connection microservers by a top of the rack Ethernet switch.

Ceph cluster topology
Filestore IO Pattern
Three VMs are used as clients to send bench write requests to a replicate pool (named rep1 with one replicate) for 250 seconds and blktrace to collect traces for 310 seconds, so rados bench -p rep1 250 write from each client and blktrace /dev/sdax -w 310 from each node. Due to the limitation of blktrace (it’s unable to collect an individual partition in the same drive), the trace from sda2/sda4 and the whole sda are collected separately. A bus analyzer is also used to verify the traces.
The common properties of the 12 nodes are listed in Table 9-1. You can observe that the basic properties are generally similar. One of the IO pattern curves is illustrated in Figures 9-2 and 9-3, where the three rows represent sda2, sda4, and sda, respectively. Note that all wasp nodes are write cache enabled. The command of "ceph tell osd.* bench 41943040 4194304" gives around 100MBPS (cached). Therefore, it means the three clients with 32 threads each have almost fully utilized the disk bandwidth. The reason will be explained later.
You can also see that differences of IO patterns may still exist in different nodes; for example, the read/write ratio is high in some nodes while it is low in other nodes, and the idle time distribution varies. Based on read/write ratio, we can roughly divide the IO patterns into two classes: one is read dominated, and the other is write dominated. When read dominates, the average size of the read becomes smaller.
Common Properties for Ceph Nodes
Properties | Metadata | Data |
---|---|---|
R/W | Mixed read and write | No read requests |
Size | Relatively small requests (8-block requests dominated); size of write varies largely; the R/W ratio varies largely. | 1024-block requests dominated, followed by small blocks |
Sequence | Mode =8 (very small); relatively more random over a small range. Much higher near sequential ratio than strict sequential ratio (small gaps exist for 50% requests) | Mode =0; high sequential ratio. Higher near sequential ratio (small gaps exist for 5% write requests) |
Write update | High update ratio (>50% write requests updated) | Low update ratio (updated blocks <1%, more partial) |
Write stack distance | Relatively small distance to achieve high percentage of hits; small average overlap size (8 blocks); necessary for write cache. | Relatively large distance to achieve high percentage of hits; small average overlap size; unnecessary for write cache. |

IO pattern in different partitions: LBA and size distribution

IO pattern in different partitions: Throughput and IOPS
Total Idle Time in the First 240 Seconds
CT | wasp4 | wasp5 | wasp6 | wasp7 | wasp8 | wasp9 | wasp10 | wasp11 | wasp12 | mean | std |
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3–8 | 52.48 | 17.77 | 55.34 | 44.65 | 114.21 | 9.1 | 48.49 | 21.13 | 0.7 | 40.43 | 34.17 | 0.85 | 162.2 |
3–16 | 107.78 | 0.15 | 76.31 | 113.58 | 0.85 | 39.87 | 5.91 | 22.52 | 61.66 | 47.63 | 44.53 | 0.93 | 746.81 |
3–32 | 48.09 | 63.42 | 35.93 | 4.19 | 8.45 | 0.01 | 25.61 | 0 | 4.87 | 21.17 | 23.34 | 1.1 | 895803 |
3–64 | 69.69 | 37.38 | 79.85 | 33.21 | 32.67 | 30.81 | 29.85 | 31.31 | 30.2 | 41.66 | 19.07 | 0.46 | 2.68 |
1–8 | 150.49 | 15.65 | 130.66 | 0.31 | 7.65 | 5.34 | 127.56 | 1.01 | 5.18 | 49.32 | 65.63 | 1.33 | 488.16 |
1–16 | 21.65 | 5.19 | 48.3 | 151.41 | 39.05 | 2.13 | 19.54 | 37.68 | 17.54 | 38.05 | 45.22 | 1.19 | 71.03 |
1–32 | 8.13 | 5.59 | 95.29 | 32.94 | 6.36 | 9.03 | 78.46 | 102.52 | 42.2 | 42.28 | 39.91 | 0.94 | 18.34 |
1–64 | 17.03 | 114.01 | 58.21 | 1.23 | 32.77 | 17.55 | 9.72 | 6.1 | 36.54 | 32.57 | 35.32 | 1.08 | 92.77 |
Performance Consistency Verification
Comparison of Three Approaches
Metrics | Pro | Con |
---|---|---|
Hypothesis | Full view with relatively full information; consistency in a relatively strict sense. | Hardly satisfied |
Average only | Simple and relatively easily satisfied | Partial view with limited information on average only |
Range tolerance | Engineer’s view in practice; easy to check. | Partial view; usually experiment dependent. |
- 1.
Check if all rounds of tests have steady state.
- 2.
Use the steady state of each round as a sample vector for an overall consistency test or one-two-one (paired) test.
- 3.
Select a proper hypothesis test for different requirements/assumptions.
F-test: Requires each sample vector is normal distribution; if the final p-value is smaller than predefined significant level (0.05 by default), you reject the hypothesis that these samples have the same mean.
H-test: Requires each sample vector is continuous distribution (weaker condition); if the final p-value is smaller than predefined significant level (0.05 by default), you reject the hypothesis that these samples have the same median.
T-test: Applicable for paired independent tests only; if the final p-value is smaller than predefined significant level (0.05 by default), you reject the hypothesis that these samples have the same distribution.
This approach actually gives the result in a relatively strict sense. However, you may allow some differences in most cases.
- 1.
Get the average values of interested metrics of each test (possibly in steady state).
- 2.
Form a sample vector with the average values from all rounds.
- 3.
Test if it follows a normal distribution (or other experimental distribution, such as uniform) with an acceptable variance.
- 1.
Check if each run’s value is within a certain range of this run’s mean or expected experimental value. There are two cases: one is required for all data points, such as latency, and the other is only required for almost all points, such as throughput.
- 2.
Check if the average value of each run is within a certain range of the mean of all runs.
This approach usually needs the experts to set up the proper thresholds in order to construct a reasonable range.
An Example of the Hypothesis Approach
Value | rand_6 | rand_4 | rand_2 | write_5 | write_3 | write_1 |
---|---|---|---|---|---|---|
Normal | 2 | 3 | 2 | 6 | 0 | 2 |
Non-normal | 5 | 3 | 5 | 1 | 7 | 5 |
Invalid | 0 | 1 | 0 | 0 | 0 | 0 |
F-value | 3.114 | 60.405 | 3.261 | 24.002 | 24.002 | 57.901 |
P-value | 0.079 | 0 | 0.073 | 0 | –1 | 0 |
Result | 1 | 0 | 1 | 0 | –1 | 0 |
Summary for Bandwidth of Rados Bench
Mean | rand_6 | rand_4 | rand_2 | write_5 | write_3 | write_1 |
---|---|---|---|---|---|---|
R0 | 679.72 | 1642.15 | 1613.56 | 327.86 | 1261.88 | 405.87 |
R1 | 704.20 | 1596.64 | 1446.77 | 345.57 | 1275.05 | 386.54 |
R2 | 814.66 | 1646.67 | 1503.92 | 397.04 | 1322.19 | 442.87 |
R3 | 907.08 | 1566.22 | 1529.88 | 401.05 | 1225.51 | 394.28 |
R4 | 891.53 | 1539.09 | 1399.01 | 409.12 | 1200.37 | 360.88 |
R5 | 902.05 | 1507.67 | 1194.60 | 416.98 | 1310.67 | 422.73 |
R6 | 762.12 | 1524.56 | 1164.23 | 361.48 | 1186.59 | 336.50 |
Mean | 808.77 | 1574.71 | 1407.42 | 379.87 | 1254.61 | 392.81 |
Std | 88.65 | 51.49 | 157.14 | 32.06 | 48.69 | 33.39 |
Std/Mean | 0.11 | 0.03 | 0.11 | 0.08 | 0.04 | 0.09 |
Comparison via Range Tolerance Approach
Diff Ratio | rand_6 | rand_4 | rand_2 | write_5 | write_3 | write_1 |
---|---|---|---|---|---|---|
R0 | –0.160 | 0.043 | 0.146 | –0.137 | 0.006 | 0.033 |
R1 | –0.129 | 0.014 | 0.028 | –0.090 | 0.016 | –0.016 |
R2 | 0.007 | 0.046 | 0.069 | 0.045 | 0.054 | 0.127 |
R3 | 0.122 | –0.005 | 0.087 | 0.056 | –0.023 | 0.004 |
R4 | 0.102 | –0.023 | –0.006 | 0.077 | –0.043 | –0.081 |
R5 | 0.115 | –0.043 | –0.151 | 0.098 | 0.045 | 0.076 |
R6 | –0.058 | –0.032 | –0.173 | –0.048 | –0.054 | –0.143 |
Max | 0.122 | 0.046 | 0.146 | 0.098 | 0.054 | 0.127 |
Min | –0.129 | –0.043 | –0.173 | –0.137 | –0.054 | –0.143 |

Bandwidth from six tests in one round
Ratio of Values within a Given Range Around Mean
Ratio | rand_6 | rand_4 | rand_2 | write_5 | write_3 | write_1 |
---|---|---|---|---|---|---|
± 0.2 | 0.91 | 1 | 0.99 | 0.44 | 1 | 0.9 |
± 0.1 | 0.73 | 0.99 | 0.89 | 0.2 | 0.91 | 0.6 |
Bottleneck Identification
Impact of CPU, Memory, and Network
Variables | Options | Remarks |
---|---|---|
CPU | Core number, speed, structure, instruct set, etc. | A common recommendation is at least one (virtual) core per OSD. Faster CPU cores usually help in performance improvement, although the CPU structure also matters (e.g., Intel vs. ARM, internal architecture/versions) the real perf/GB, perf/$, and so on. Turning off energy-saving mode helps. |
Memory | RAM per server, RAM per OSD, etc. | A common recommendation is at least 1GB per 1TB OSD, and better 2GB per OSD. The actual value is workload-dependent. |
BIOS | HT mode, energy-saving, NUMA, etc. | HT affects the virtual core number (enable). Consider the tradeoff of energy-savings for low power but less computational resource allocated. |
Network switch/NIC | Bandwidth and latency; Ethernet, Fiber, Infiniband, etc. | Higher bandwidth for higher throughput to an extent; lower latency for more small IO. Try ms crc data = false and ms crc header = false for high-quality networks. For cluster with less than 20 spinners or 2 SSDs, consider upgrading to a 25GbE or 40GbE. |
Impact of Disk
Variables | Options | Remarks |
---|---|---|
Drive type | HDD, SSD, NVM, etc. | Balance between price and performance shall be considered; usually SSD acts as cache and journal; unbalanced structure may lead to performance loss; one bad drive can affect the overall pool performance (ceph osd perf). |
Drive number | Drive per server, drive per OSD | More drives increase throughput per server but decrease throughput per OSD; one OSD per platter/drive. |
Drive controller | SAS/ SATA/ PCIe HBA, etc. | More/better HBAs increase throughput. HW RAID may increase IOPS. The best performance is achieved when you have one HBA for every 6-8 SAS drives, but it is cheaper to use a SAS expander to let one HBA control 24 (or more) drives. More HBAs and fewer expanders are used to achieve maximum throughput, or SAS expanders can be applied to minimize cost when full drive throughput is not needed. |
RAID controller | Enable/ disable; cache | More recent testing with Red Hat, Supermicro, and Seagate also showed that a good RAID controller with onboard write-back cache can accelerate IOPS-oriented write performance. While Ceph does not use RAID (since it supports both simple replication and EC), the right RAID controller cache can still improve write performance via the onboard cache. |
Drive cache | Enable or disable write cache | Cache has large impact on small write performance |

Ceph software stack
Dedicated tools for Ceph deployment, monitoring, and management have been developed, such as CeTune (Intel),2 VSM (Intel),3 OpenATTIC,4 and InkScope.5

Functionalities of a Ceph performance tool
As shown in the overall structure of Figure 9-6, there are some components integrated into this tool, such as SaltStack for Ceph management, InfluxDB7 with Telegraf8 for performance data collection and storage, and Grafana9 for data visualization. All these tools are open source and under the free license (InfluxDB/Telegraf under MIT, Grafana/salt- stack under Apache v2).
SaltStack platform or Salt is an open-source configuration management software and remote execution engine in Python. It essentially has a server-client structure. You can use Salt to manage the Ceph nodes and distribute executing commands. InfluxDB is a time series database built from scratch to handle high write and query loads. Telegraf, developed by Go,10 is a metric collection daemon that can gather metrics from a wide array of inputs and write them into a wide group of outputs. It is plugin-driven for both the collection and output of data for easy extension. It is a compiled and standalone binary that can be executed on any system without external dependencies; no npm/pip/gem or other package management tools required. Once Telegraf daemon is running, the data will be automatically saved to influxDB. Grafana provides a powerful and exquisite way to create, explore, and share dashboards and information with your team and the world. After the DBs are set up, you can configure Grafanas data sources from these influxDB.
With this tool, you can easily obtain all necessary information. Take a look at the example provided in the section on the filesstore IO pattern. From Table 9-2, you can clearly observe that some drives are drained of IO bandwidth. However, the CPU, memory and network usages are all lower than 50% at the same time. Therefore, you can make a conclusion that the drives are the performance bottleneck.