© Jun Xu 2018
Jun XuBlock Trace Analysis and Storage System Optimizationhttps://doi.org/10.1007/978-1-4842-3928-5_9

9. Case Study: Ceph

Jun Xu1 
(1)
Singapore, Singapore
 

Ceph, an open-source distributed storage platform, provides a unified interface for object-, block-, and file-level storage [33, 80, 34, 81]. This chapter presents the block-level workload characteristics of a WD WASP/EPIC microserver-based Ceph cluster. The analysis techniques presented can help you to understand the performance and drive characteristics of Ceph in production environments. In addition, I also identify whether SMR, hybrid disk, and SSD drives are suitable for the Ceph workload.

The basic architecture of Ceph was described in Chapter 1. Ceph’s core, RADOS, is a fully distributed, reliable, and autonomous object store using the CRUSH (Controlled Replication Under Scalable Hashing) algorithm. Ceph’s building blocks are called OSDs (object storage daemons). OSDs are responsible for storing objects on local file systems (e.g., EXT4 and XFS), and cooperating to replicate data, detect and recover from failures, or migrate data when OSDs join or leave the cluster. Ceph’s design originated in the premise that failures are common in large-scale storage systems. Along these lines, Ceph targets at guaranteeing reliability and scalability by leveraging the intelligence of the OSDs. Each OSD uses a journal to accelerate the write operations by coalescing small writes and flushing them asynchronously to the backing file system when the journal is full. The journal can be a different file or located in another device or partition [82, 83, 84].

These tests are based on a “unique” platform. Instead of traditional workstations, the so-called microserver structure is used for the production environments. In the system, each microserver has an individual OS and an HDD. It is almost the minimum granularity for an IO device, which essentially satisfies the original design requirement of Sage Weil, the father of Ceph [80, 33]. In fact, this architecture minimizes the failure domain to a disk unit instead of many disks becoming inaccessible in one server with a multi-disk architecture. The storage cluster is scaled out by connection microservers by a top of the rack Ethernet switch.

A microserver-based cluster with 12 nodes (named as sm1-wasp1 to sm1-wasp12) is shown in Figure 9-1. Three virtual machines (VMs) act as the clients to generate the IO requests (named as Cag-blaster-ixgbe-02 to Cag-blasterixgbe-04). All the tests are done in the Ceph version Jewel. In this microserver-based configuration for filestore, each node/drive is divided into four partitions. /dev/sda1 installs the operating system (Ubuntu) and /dev/sda3 is reserved. /dev/sda2 is used for metadata, and /dev/sda4 is used for user data.
../images/468166_1_En_9_Chapter/468166_1_En_9_Fig1_HTML.png
Figure 9-1

Ceph cluster topology

Filestore IO Pattern

Three VMs are used as clients to send bench write requests to a replicate pool (named rep1 with one replicate) for 250 seconds and blktrace to collect traces for 310 seconds, so rados bench -p rep1 250 write from each client and blktrace /dev/sdax -w 310 from each node. Due to the limitation of blktrace (it’s unable to collect an individual partition in the same drive), the trace from sda2/sda4 and the whole sda are collected separately. A bus analyzer is also used to verify the traces.

The common properties of the 12 nodes are listed in Table 9-1. You can observe that the basic properties are generally similar. One of the IO pattern curves is illustrated in Figures 9-2 and 9-3, where the three rows represent sda2, sda4, and sda, respectively. Note that all wasp nodes are write cache enabled. The command of "ceph tell osd.* bench 41943040 4194304" gives around 100MBPS (cached). Therefore, it means the three clients with 32 threads each have almost fully utilized the disk bandwidth. The reason will be explained later.

You can also see that differences of IO patterns may still exist in different nodes; for example, the read/write ratio is high in some nodes while it is low in other nodes, and the idle time distribution varies. Based on read/write ratio, we can roughly divide the IO patterns into two classes: one is read dominated, and the other is write dominated. When read dominates, the average size of the read becomes smaller.

Table 9-1

Common Properties for Ceph Nodes

Properties

Metadata

Data

R/W

Mixed read and write

No read requests

Size

Relatively small requests (8-block requests dominated); size of write varies largely; the R/W ratio varies largely.

1024-block requests dominated, followed by small blocks

Sequence

Mode  =8  (very  small);  relatively more random over a small range.

Much higher near sequential ratio than strict sequential ratio (small gaps exist for 50% requests)

Mode =0; high sequential ratio.

Higher near sequential ratio (small gaps exist for 5% write requests)

Write update

High update ratio (>50% write requests updated)

Low  update  ratio  (updated  blocks <1%, more partial)

Write stack distance

Relatively small distance to achieve high percentage of hits; small average overlap size (8 blocks); necessary for write cache.

Relatively large distance to achieve high percentage of hits; small average overlap size; unnecessary for write cache.

../images/468166_1_En_9_Chapter/468166_1_En_9_Fig2_HTML.png
Figure 9-2

IO pattern in different partitions: LBA and size distribution

../images/468166_1_En_9_Chapter/468166_1_En_9_Fig3_HTML.png
Figure 9-3

IO pattern in different partitions: Throughput and IOPS

Table 9-2 shows the total idle time for different nodes at different scenarios. Basically, the idle time is unevenly distributed, which means that the workload to each node is actually uneven. In other words, some nodes are very busy, such as wasp12 in the case of “3-8” (3 clients, 8 threads), while some are very “lazy,” such as wasp8 in the case of “3-8.” This is partially due to the CRUSH algorithm, which is in charge of PG (placement group) allocations. Although a reasonable number of clients and threads may alleviate the uneven distribution, it may not essentially solve this problem. Thus, some improvement policies, such as asynchronized, active feedback, adjustable PG, etc, shall be implemented.
Table 9-2

Total Idle Time in the First 240 Seconds

CT

wasp4

wasp5

wasp6

wasp7

wasp8

wasp9

wasp10

wasp11

wasp12

mean

std

$$ \frac{std}{mean} $$

$$ \frac{\mathit{\max}}{\mathit{\min}} $$

3–8

52.48

17.77

55.34

44.65

114.21

9.1

48.49

21.13

0.7

40.43

34.17

0.85

162.2

3–16

107.78

0.15

76.31

113.58

0.85

39.87

5.91

22.52

61.66

47.63

44.53

0.93

746.81

3–32

48.09

63.42

35.93

4.19

8.45

0.01

25.61

0

4.87

21.17

23.34

1.1

895803

3–64

69.69

37.38

79.85

33.21

32.67

30.81

29.85

31.31

30.2

41.66

19.07

0.46

2.68

1–8

150.49

15.65

130.66

0.31

7.65

5.34

127.56

1.01

5.18

49.32

65.63

1.33

488.16

1–16

21.65

5.19

48.3

151.41

39.05

2.13

19.54

37.68

17.54

38.05

45.22

1.19

71.03

1–32

8.13

5.59

95.29

32.94

6.36

9.03

78.46

102.52

42.2

42.28

39.91

0.94

18.34

1–64

17.03

114.01

58.21

1.23

32.77

17.55

9.72

6.1

36.54

32.57

35.32

1.08

92.77

Performance Consistency Verification

Performance consistency is a basic requirement for enterprise storage systems and it guarantees the performance repeatability at the same conditions. There are several approaches to check it. Table 9-3 gives a summary.
Table 9-3

Comparison of Three Approaches

Metrics

Pro

Con

Hypothesis

Full view with relatively full information; consistency in a relatively strict sense.

Hardly satisfied

Average only

Simple and relatively easily satisfied

Partial view with limited information on average only

Range tolerance

Engineer’s view in practice; easy to check.

Partial view; usually experiment dependent.

The first one is the hypothesis approach, which actually can be used to test whether two or more samples have the same mean (and variance), median, or distribution in statistical sense. A simple procedure is as follows:
  1. 1.

    Check if all rounds of tests have steady state.

     
  2. 2.

    Use the steady state of each round as a sample vector for an overall consistency test or one-two-one (paired) test.

     
  3. 3.

    Select a proper hypothesis test for different requirements/assumptions.

     
Some common hypothesis tests are used in different scenarios:
  • F-test: Requires each sample vector is normal distribution; if the final p-value is smaller than predefined significant level (0.05 by default), you reject the hypothesis that these samples have the same mean.

  • H-test: Requires each sample vector is continuous distribution (weaker condition); if the final p-value is smaller than predefined significant level (0.05 by default), you reject the hypothesis that these samples have the same median.

  • T-test: Applicable for paired independent tests only; if the final p-value is smaller than predefined significant level (0.05 by default), you reject the hypothesis that these samples have the same distribution.

This approach actually gives the result in a relatively strict sense. However, you may allow some differences in most cases.

The second one uses a simplified statistical method, which only concerns the average value without the overall trend, and is usually for rough estimation only:
  1. 1.

    Get the average values of interested metrics of each test (possibly in steady state).

     
  2. 2.

    Form a sample vector with the average values from all rounds.

     
  3. 3.

    Test if it follows a normal distribution (or other experimental distribution, such as uniform) with an acceptable variance.

     
The third one is the range tolerance approach, which checks if the performance is within a certain region that we can tolerant/allow experimentally:
  1. 1.

    Check if each run’s value is within a certain range of this run’s mean or expected experimental value. There are two cases: one is required for all data points, such as latency, and the other is only required for almost all points, such as throughput.

     
  2. 2.

    Check if the average value of each run is within a certain range of the mean of all runs.

     

This approach usually needs the experts to set up the proper thresholds in order to construct a reasonable range.

Let’s take a look at an example with seven rounds of tests in the same environments in Table 9-4. Each round contains three random read and three sequential write accesses. Since an F-test requires normality, you begin with the normal test on each round. In some cases, if you cannot not capture enough data, you may simply mark it as invalid. In this example, you can see only two rounds out of seven pass the normal test for the test named rand 6, and the two rounds likely have the same mean. Overall, the results indicate that the performance is not strictly consistent.
Table 9-4

An Example of the Hypothesis Approach

Value

rand_6

rand_4

rand_2

write_5

write_3

write_1

Normal

2

3

2

6

0

2

Non-normal

5

3

5

1

7

5

Invalid

0

1

0

0

0

0

F-value

3.114

60.405

3.261

24.002

24.002

57.901

P-value

0.079

0

0.073

0

–1

0

Result

1

0

1

0

–1

0

If you switch to average-only and range tolerance approaches, you may have another observation in a relaxed sense, shown in Tables 9-5 and 9-6. Table 9-5 shows the average throughput in MBPS for each round, as well as the overall mean and standard derivation. Table 9-6 shows the difference ratio between each round and the overall average. You can see most ratios are within 10%. If the customers can allow a 20% range, you may say that the system satisfies the performance consistency requirement. Note that curve of each round shall also satisfy some range requirements in a “continuous” sense. Figure 9-4 shows one example of six tests. Table 9-7 gives the ratio that the total number of values fall into the range of ± 20% or ± 10% of the mean. If a 20% range is set, you can see that only the test named write 5 doesn’t satisfy the requirements.
Table 9-5

Summary for Bandwidth of Rados Bench

Mean

rand_6

rand_4

rand_2

write_5

write_3

write_1

R0

679.72

1642.15

1613.56

327.86

1261.88

405.87

R1

704.20

1596.64

1446.77

345.57

1275.05

386.54

R2

814.66

1646.67

1503.92

397.04

1322.19

442.87

R3

907.08

1566.22

1529.88

401.05

1225.51

394.28

R4

891.53

1539.09

1399.01

409.12

1200.37

360.88

R5

902.05

1507.67

1194.60

416.98

1310.67

422.73

R6

762.12

1524.56

1164.23

361.48

1186.59

336.50

Mean

808.77

1574.71

1407.42

379.87

1254.61

392.81

Std

88.65

51.49

157.14

32.06

48.69

33.39

Std/Mean

0.11

0.03

0.11

0.08

0.04

0.09

Table 9-6

Comparison via Range Tolerance Approach

Diff Ratio

rand_6

rand_4

rand_2

write_5

write_3

write_1

R0

–0.160

0.043

0.146

–0.137

0.006

0.033

R1

–0.129

0.014

0.028

–0.090

0.016

–0.016

R2

0.007

0.046

0.069

0.045

0.054

0.127

R3

0.122

–0.005

0.087

0.056

–0.023

0.004

R4

0.102

–0.023

–0.006

0.077

–0.043

–0.081

R5

0.115

–0.043

–0.151

0.098

0.045

0.076

R6

–0.058

–0.032

–0.173

–0.048

–0.054

–0.143

Max

0.122

0.046

0.146

0.098

0.054

0.127

Min

–0.129

–0.043

–0.173

–0.137

–0.054

–0.143

../images/468166_1_En_9_Chapter/468166_1_En_9_Fig4_HTML.jpg
Figure 9-4

Bandwidth from six tests in one round

Table 9-7

Ratio of Values within a Given Range Around Mean

Ratio

rand_6

rand_4

rand_2

write_5

write_3

write_1

± 0.2

0.91

1

0.99

0.44

1

0.9

± 0.1

0.73

0.99

0.89

0.2

0.91

0.6

Bottleneck Identification

Ceph is a rather complex system whose performance is decided by both hardware and software [41]. From the hardware point of view, CPU, memory, disk, and network are the four major components. Tables 9-8 and 9-9 give some general views. From the software aspect, there are even more factors, such as the file system, Linux OS settings, memory allocator, and more. In addition, the Ceph system configuration provides hundreds of parameters, and many of them affect the overall performance. Therefore, it is generally difficult to identify the performance bottleneck of the overall system.
Table 9-8

Impact of CPU, Memory, and Network

Variables

Options

Remarks

CPU

Core number, speed,  structure, instruct set, etc.

A common recommendation is at least one (virtual) core per OSD. Faster CPU cores usually help in performance improvement, although the CPU structure also matters (e.g., Intel vs. ARM, internal architecture/versions) the real perf/GB, perf/$, and so on. Turning off energy-saving mode helps.

Memory

RAM per server, RAM per OSD, etc.

A common recommendation is at least 1GB per 1TB OSD, and better 2GB per OSD. The actual value is workload-dependent.

BIOS

HT mode, energy-saving, NUMA, etc.

HT affects the virtual core number (enable). Consider the tradeoff of energy-savings for low power but less computational resource allocated.

Network switch/NIC

Bandwidth and latency; Ethernet, Fiber, Infiniband, etc.

Higher bandwidth for higher throughput to an extent; lower latency for more small IO. Try ms crc data = false and ms crc header = false for high-quality networks.

For cluster with less than 20 spinners or 2 SSDs, consider upgrading to a 25GbE or 40GbE.

Table 9-9

Impact of Disk

Variables

Options

Remarks

Drive type

HDD, SSD, NVM, etc.

Balance between price and performance shall be considered; usually SSD acts as cache and journal; unbalanced structure may lead to performance loss; one bad drive can affect the overall pool performance (ceph osd perf).

Drive number

Drive per server, drive per OSD

More drives increase throughput per server but decrease throughput per OSD; one OSD per platter/drive.

Drive controller

SAS/ SATA/ PCIe HBA, etc.

More/better HBAs increase throughput. HW RAID may increase IOPS. The best performance is achieved when you have one HBA for every 6-8 SAS drives, but it is cheaper to use a SAS expander to let one HBA control 24 (or more) drives.

More HBAs and fewer expanders are used to achieve maximum throughput, or SAS expanders can be applied to minimize cost when full drive throughput is not needed.

RAID controller

Enable/ disable; cache

More recent testing with Red Hat, Supermicro, and Seagate also showed that a good RAID controller with onboard write-back cache can accelerate IOPS-oriented write performance.

While Ceph does not use RAID (since it supports both simple replication and EC), the right RAID controller cache can still improve write performance via the onboard cache.

Drive cache

Enable or disable write cache

Cache has large impact on small write performance

In this sense, you shall monitor all necessary components of the system in order to make a conclusion. Take the Ceph software stack as an example in Figure 9-5. You may deploy the corresponding system monitoring tools 1 into interested stack points to collect data so that you can capture all possible places for SW failures, errors, performance degrades, etc. plus all potential SW performance tuning points. In fact, Ceph has some built-in monitoring tools such as LTTng.
../images/468166_1_En_9_Chapter/468166_1_En_9_Fig5_HTML.jpg
Figure 9-5

Ceph software stack

Dedicated tools for Ceph deployment, monitoring, and management have been developed, such as CeTune (Intel),2 VSM (Intel),3 OpenATTIC,4 and InkScope.5

With the aid of some integrated tools, such as SaltStack,6 you may design your own all-in-one tool. Figure 9-6 shows one possible design, which intends to integrate the functionalities of configuration, deployment, benchmarking, and measurement analysis of Ceph systems. It can be developed by using the “glue” programming language Python, together with some Bash scripts.
../images/468166_1_En_9_Chapter/468166_1_En_9_Fig6_HTML.jpg
Figure 9-6

Functionalities of a Ceph performance tool

As shown in the overall structure of Figure 9-6, there are some components integrated into this tool, such as SaltStack for Ceph management, InfluxDB7 with Telegraf8 for performance data collection and storage, and Grafana9 for data visualization. All these tools are open source and under the free license (InfluxDB/Telegraf under MIT, Grafana/salt- stack under Apache v2).

SaltStack platform or Salt is an open-source configuration management software and remote execution engine in Python. It essentially has a server-client structure. You can use Salt to manage the Ceph nodes and distribute executing commands. InfluxDB is a time series database built from scratch to handle high write and query loads. Telegraf, developed by Go,10 is a metric collection daemon that can gather metrics from a wide array of inputs and write them into a wide group of outputs. It is plugin-driven for both the collection and output of data for easy extension. It is a compiled and standalone binary that can be executed on any system without external dependencies; no npm/pip/gem or other package management tools required. Once Telegraf daemon is running, the data will be automatically saved to influxDB. Grafana provides a powerful and exquisite way to create, explore, and share dashboards and information with your team and the world. After the DBs are set up, you can configure Grafanas data sources from these influxDB.

With this tool, you can easily obtain all necessary information. Take a look at the example provided in the section on the filesstore IO pattern. From Table 9-2, you can clearly observe that some drives are drained of IO bandwidth. However, the CPU, memory and network usages are all lower than 50% at the same time. Therefore, you can make a conclusion that the drives are the performance bottleneck.