© Jun Xu 2018
Jun XuBlock Trace Analysis and Storage System Optimizationhttps://doi.org/10.1007/978-1-4842-3928-5_6

6. Case Study: Modern Disks

Jun Xu1 
(1)
Singapore, Singapore
 

Modern disks implement many different features, such as media-based cache (e.g., using a portion of disk space to log some random write accesses), DRAM protection (e.g., using a small-size NVM to temporarily store some data in DRAM cache during a power loss such that write-cache can be always enabled), hybrid structure (e.g., migrating hot data to high-speed devices and cold data to low-speed devices so that the overall access time is reduced), etc. A hybrid disk (e.g., SSHD), one of the hybrid structures, has advantages in some scenarios where data hotness is significant. Some emerging and future techniques like SMR, HAMR, and BPR favor sequential access in order to diminish garbage collection, reduce energy consumption, and/or improve the device life. This chapter shows how trace analysis can help to identify these mechanisms via workload property analysis using two examples: SSHD and SMR drives.

SSHD

In this section, let’s explore the mystery behind SSHD’s performance enhancement in SPC-1C [53] under WCD: SSD/DRAM cache and the self-learning algorithm [56, 57, 16]. I collected data from the XGIG bus analyzer and monitored the response from LeCroy Scope, with a workload generated by the SPC-1C tool. Some techniques, such as pattern recognition, curve fitting, and queue theory, are applied for analysis.

From Figure 6-1, you can see that the IOPS jumps to two times the traditional HDDs for WCD, so the IOPS of SSHD is around 570, while the traditional HDDs (two models: one is Savvio from Seagate, and the other is Sirius from WD) can only reach around 200 IOPS when the response time is less than or equal to 30ms. The task here is to find the reasons for the performance improvement of hybrid structure via trace analysis. The basic idea is to compare several drives with a certain level of similarity: to inject the same workloads to the similar drives, isolate the similarity, and compare the differences. For example, similar CMR models are selected in Table 6-1.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig1_HTML.jpg
Figure 6-1

SSHD performance comparison with traditional HDDs

Table 6-1

Similar Models Chosen for Comparison

 

SSHD

CMR A

CMR B

CMR C

Capacity (GB)

600

900

600

900

RPM

10.5K

10.5K

10K

10.5K

Bytes per sector

512, 520, 524, 528

512

512

512

Discs

2

3

2

3

Average latency (ms)

2.9

2.9

3

2.9

DRAM cache

128MB

64MB

32MB

64MB

NAND

16GB eMLC

None

None

None

Interface

6Gbps SAS

6Gb/s SAS

6Gb/s SAS

6Gb/s SAS

You know from the previous chapter that the write (random) access dominates the IO requests in SPC-1C, which means the write cache actually plays an important role. However, write cache is supposed to be disabled for WCD. Is it true for this SSHD? To verify it, you can do a simple test by injecting random write requests to SSHD and calculating the CCT/qCCT/TtoD time. If write cache is actually disabled, all requests will be written to media directly, which cost roughly 10ms response time. However, from the trace, you can observe that there are many requests with response times of less than 1ms at the beginning. Therefore, write cache actually is active even for the WCD setting. This benefits from the technique of NAND-backed DRAM cache protection , so part of cached data can be written to NAND just after system power loses.

Now let’s start some analysis for two essential problems: the cache size and access isolation.

Cache Size

We begin with the question of “how much DRAM is used as write cache during WCD?” First, let’s make sure that the test is repeatable (or the result is consistent). In order to verify this, perform the following procedure.
  1. 1.

    Connect SSHD to the XGIG bus analyzer and power off/on SSHD.

     
  2. 2.

    Send 100 random write 8K requests to SSHD using IOMeter or another tool, and repeat the same requests 10 times.

     
  3. 3.

    Repeat Steps 1-2 for 4 times with the same requests.

     
  4. 4.

    Compare and find the access pattern for the XGIG traces via a trace analyzer tool.

     
  5. 5.

    Repeat Steps 1-4 with the request number changed to 200 and 400.

     
  6. 6.

    Do the same test on a different SSHD with the same IO pattern.

     
In Step 4, a similar access pattern (LBA vs. CCT) should be observed. Note that you are checking DRAM write cache in this case, so only a random write request is used. For a full cache check, you may also try random read, mixed read/write, and mixed random/sequential patterns. If a similar pattern is observed, you may conclude that the result is consistent and useful to identify some inside information. Otherwise, you shall find out the reasons. One is that the SSD/DRAM cache is not cleaned before a new test. Therefore, you need some cache flush commands or disk initialization commands to force it empty. Also note that:
  • R/W DRAM cache may share the same space.

  • The SSD mapping table may share the same space with R/W DRAM cache (a good case: SSD mapping table uses a dedicated DRAM space).

  • The SSD reboot self-learning procedure may take DRAM space.

Second, implement the following procedure to make sure each test starts with a clean cache:
  1. 1.

    Power off/on SSHD (make sure the DRAM write cache is cleared).

     
  2. 2.

    Send 1000 random 8K write requests to SSHD with queue depth=1 using IOMeter.

     
  3. 3.

    Repeat Steps 1-2 for 10 times each with different request sizes, such as 16K, 32K, 64K, 128K, ..., 2048K.

     
Once you capture the traces, some post-processes shall be made:
  1. 1.

    Count the write DRAM hit number at the first portion of the total accesses for each run by isolating DRAM accesses from others (DRAM CCT/qCCT is generally much smaller than others).

     
  2. 2.

    Choose the maximum number of each count.

     
  3. 3.

    Calculate the hit numbers and the corresponding actual cache size.

     
  4. 4.

    Find the turning point, which provides a hint of the cluster size.

     
  5. 5.

    Refine the turning point by narrowing the region. For example, if the turning point is within [256K-512K], then some more points, such as 300K, 400K, and 500K, may be used.

     
Note that this model of SSHD has read-cache only SSD so that DRAM access will not be mixed with SSD write access, which simplifies the analysis in Figures 6-2-6-5. Figure 6-2 shows the traces from IOmeter random write tests (request size from 1K to 1M). Assume that the write cache is empty.1 Then the first portion of each run could be the DRAM write cache hit.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig2_HTML.jpg
Figure 6-2

IOmeter traces for SSHD

Zoom into the system to find out the hit. Figures 6-3, 6-4, and 6-5 give three examples where the request sizes are 1KB, 512KB, and 1MB, respectively. In Figures 6-3 and 6-4, you can observe obvious write cache hits, and the total hit number for 1KB is much larger than that of 512KB due to limited cache size. However, when the size is increased to 1MB, no obvious write cache is observed, or it means that one threshold between 512K and 1M is set as the turning point for different size requests. This also indicates that large size requests will go directly to the media. With the same steps, you can actually get the required values for WCD and WCE, as shown in Tables 6-2 and 6-3.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig3_HTML.jpg
Figure 6-3

1K request trace details

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig4_HTML.jpg
Figure 6-4

512K request trace details

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig5_HTML.jpg
Figure 6-5

1024K request trace details

Table 6-2

Comparison Under WCE

SSHD

CMR A

Size

Counted number

Size

Size

Counted number

Size

 

1K

98

0.1M

1K

98

0.1M

 

4k

98

0.4M

4k

98

0.4M

 

16K

97

1.5M

16K

99

1.5M

 

64K

102

6.4M

64K

102

6.4M

2

128K

103

12.9M

128K

104

12.9M

 

256K

111

27.8M

256K

110

27.8M

 

512K

115

57.5M

512K

66

33M

 

520K

-

-

880K

35

30M

 

900K

-

-

900K

42

36.9M

 

1024K

N.A.

-

1000K

36

36M

 
Table 6-3

Two Cases Under WCD of SSHD

SSHD (WCD)

test1

 

test2

 

size

Counted number

Size

Counted number

Size

1K

98

0.1M

101

101K

4k

100

0.4M

100

400K

16K

99

1.5K

100

1600K

64K

101

6.3K

101

6464K

128K

54

6.75M

54

6.75M

256K

26

6.5M

26

6.5M

512K

12

6.0M

13

6.5M

520K

-

-

12

6.1M

From the turning point, you may also guess the cache cluster/segment size. For example, SSHD’s cluster size is around 64K for write during WCD, and CMR A is around 256KB. SSHD uses up to 60MB DRAM space as write cache when WCE, while only around 8MB is used for WCD with some 100 segments.

Access Isolation

You saw in the previous chapter that SPC-1C has a large portion of local accesses. This property brings the possibility that some data can be cached into DRAM or SSD and be accessed quickly later. Then the second question is “how many accesses are actually directed to DRAM or SSD?” It is generally a difficult task. However, as the access times of DRAM, SSD, and HDD are significantly different, you may isolate the possible commands in different places roughly. The basic idea is to observe the behaviors of the different accesses and then apply data classification and pattern recognition methods to find the access pattern, and do repeated random read tests to finally find the turning points. Although the procedure is similar to the previous case, you need to change the number of requests to SSHD in this case:
  1. 1.

    Send 100/200/256/257/etc. pieces of 8K requests to SSHD, repeat 20-100 times for each number, and refine the number of commands to be sent according to the access pattern.

     
  2. 2.

    Suppose the turn point is X. Send X random read commands with size 16K,32K,..., and 1024K to SSHD and find the cluster size according to the turning point.

     
To verify if the repeat number is high enough, check the steady states of response time. Figure 6-6 provides an example where 100 rounds are run. You can see that since the third round, the average value and standard derivatives of response times are almost constant. Thus 10 times of repeats should be enough in this case.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig6_HTML.jpg
Figure 6-6

Steady state of response time

Figure 6-7 shows a case where 100 random read requests with 8KB size were sent to SSHD 20 times. In the first round, all reads went to media. After several rounds, the read requests become hot and eventually all cached in DRAM. Slowly increase the number of requests to check how many requests the DRAM read cache can hold.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig7_HTML.jpg
Figure 6-7

100 8K random reads, repeated 20 times

Figure 6-8 illustrates the results for 250 requests repeatedly. You can see that DRAM cache can fully hold at least 250 segments. However, when you slightly increase it to 257, destage starts. When further increase to 260, DRAM destage to SSD happens obviously at a relatively high speed, which is illustrated in Figure 6-9. The destage has a certain adaptive steps, so when the hit number (access frequency) of the data is increased, destage becomes more frequent.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig8_HTML.jpg
Figure 6-8

250 8K random reads, repeated 100 times

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig9_HTML.jpg
Figure 6-9

260 8K random reads, repeated 100 times

Thus, you may guess that 256 could be the maximum read segment number, as no destage happens if this maximum segment number is not exceeded. Note that you can find the destage pattern via different time intervals and sizes.

I leave an issue on request access identification on devices here. Take a look at Figure 6-10, where repeated 260 8KB read requests were sent to SSHD 100 times. As the number is over DRAM’s capacity, some requests will go to SSD. The response time of the 1st, 3rd, 14th, 20th and 50th runs is shown. You can see that the very first run all went to disk. Starting at the second round, some went to DRAM and some to media. In the 20th round, the accesses to DRAM, SSD, and media all existed. However, around 50 rounds, most requests went to DRAM and SSD. In fact, you can see a clear gap of response time for these read requests. Basically, you may say that those below 0.1ms are DRAM accesses, and those above 0.2ms are most from SSD. Now you can get the statistical values for the response time of SSD and DRAM in an estimated sense, as shown in Table 6-4.
../images/468166_1_En_6_Chapter/468166_1_En_6_Fig10_HTML.jpg
Figure 6-10

Read response time pattern over repeated rounds

Table 6-4

Statistics on Response Time (Based on 40-90 Rounds)

260

Average

CCT

qCCT

TtoD

Overall

Mean

0.227

0.227

0.212

 

Std.

0.096

0.096

0.096

SSD

Mean

0.279

0.279

0.264

 

Std.

0.031

0.031

0.032

DRAM

Mean

0.063

0.063

0.048

 

Std.

0.001

0.001

0.001

SMR

This section will discuss the main characteristics of SMR [4, 5] and the interaction between these characteristics and particular workloads. The industry has two approaches for SMR generally:
  • The drive manages all data accesses, and data management is complicated similar to the FTL (flash translation layer) of SSD, so the management of metadata, GC (garbage collection), over-provisioning, variable performance, etc, is all inside the drive. However, there are no host-side changes, so the drive is used as a normal one. Currently, all major SMR drives available in the market fall in this category.

  • The host manages most data-related accesses via a SMR-specific file system similar to flash file system. Data management is complicated but can leverage mature file systems that write sequentially. A few examples are SFS [58], HiSMRfs [59], and Shingledfs [5]. Although mixed drive-host management is also possible, it is really rare.

Many particular design issues are considered for SMR drives, such as data layout management (layout, data placement, defragmentation, GC, pointer to bands), mixed zones (combine shingled and unshingled part in same disk), SMR algorithms, and structure for specific applications, etc. Table 6-5 lists some main expected workload characteristics for SMR so that those applications with designed metrics can work perfectly in SMR drives.
Table 6-5

SMR Characteristics vs. Workload Metrics

SMR char

SMR expectation

Workload metrics

SMR impact

Sequential write

Good for large size sequential write requests

Average write request size and distribution

Seek distance (LBA)

Sequential stream and near-sequential stream

The larger size, the better

The smaller seek distance, the more sequential

The more streams, the more sequential

Write once read-many

Good for less updates and more reads

Read/write ratio

Read on write (ROW) hit ratio

Write update ratio

The higher read/write ratio (ROW ratio), the better

The smaller the write update ratio, the better

Garbage collection (GC)

Smaller write amplification and less GC

Device utilization, device idle time distribution, queue length

IOPS, throughput

Frequented/timed/stacked write update ratio (WUR)

Write on write (WOW) hit distribution and ratio

Long and frequent idle time for GC

Low write update ratio indicates that less GC is required

Sequential read to random write

Less read performance impact due to indirect mapping, such as sequential LBA read requests in random physical address

Read on write (ROW) hit/size distribution and ratio

The higher the small (large) read to small (large) write ratio, the better

In-place or out-of-place update

Frequent and recent updates need random access zone (RAZ)/SSD/large DRAM buffer to hold write data

Stacked write update ratio

The higher ratio in shorter stack, the more necessary to have an in-place update buffer