Case Study: Modern Disks

Modern disks implement many different features, such as media-based cache (e.g., using a portion of disk space to log some random write accesses), DRAM protection (e.g., using a small-size NVM to temporarily store some data in DRAM cache during a power loss such that write-cache can be always enabled), hybrid structure (e.g., migrating hot data to high-speed devices and cold data to low-speed devices so that the overall access time is reduced), etc. A hybrid disk (e.g., SSHD), one of the hybrid structures, has advantages in some scenarios where data hotness is significant. Some emerging and future techniques like SMR, HAMR, and BPR favor sequential access in order to diminish garbage collection, reduce energy consumption, and/or improve the device life. This chapter shows how trace analysis can help to identify these mechanisms via workload property analysis using two examples: SSHD and SMR drives.

SSHD

In this section, let’s explore the mystery behind SSHD’s performance enhancement in SPC-1C [53] under WCD: SSD/DRAM cache and the self-learning algorithm [56, 57, 16]. I collected data from the XGIG bus analyzer and monitored the response from LeCroy Scope, with a workload generated by the SPC-1C tool. Some techniques, such as pattern recognition, curve fitting, and queue theory, are applied for analysis.

From Figure 6-1, you can see that the IOPS jumps to two times the traditional HDDs for WCD, so the IOPS of SSHD is around 570, while the traditional HDDs (two models: one is Savvio from Seagate, and the other is Sirius from WD) can only reach around 200 IOPS when the response time is less than or equal to 30ms. The task here is to find the reasons for the performance improvement of hybrid structure via trace analysis. The basic idea is to compare several drives with a certain level of similarity: to inject the same workloads to the similar drives, isolate the similarity, and compare the differences. For example, similar CMR models are selected in Table 6-1.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig1_HTML.jpg — Figure 6-1
SSHD performance comparison with traditional HDDs

Table 6-1

Similar Models Chosen for Comparison

	SSHD	CMR A	CMR B	CMR C
Capacity (GB)	600	900	600	900
RPM	10.5K	10.5K	10K	10.5K
Bytes per sector	512, 520, 524, 528	512	512	512
Discs	2	3	2	3
Average latency (ms)	2.9	2.9	3	2.9
DRAM cache	128MB	64MB	32MB	64MB
NAND	16GB eMLC	None	None	None
Interface	6Gbps SAS	6Gb/s SAS	6Gb/s SAS	6Gb/s SAS

You know from the previous chapter that the write (random) access dominates the IO requests in SPC-1C, which means the write cache actually plays an important role. However, write cache is supposed to be disabled for WCD. Is it true for this SSHD? To verify it, you can do a simple test by injecting random write requests to SSHD and calculating the CCT/qCCT/TtoD time. If write cache is actually disabled, all requests will be written to media directly, which cost roughly 10ms response time. However, from the trace, you can observe that there are many requests with response times of less than 1ms at the beginning. Therefore, write cache actually is active even for the WCD setting. This benefits from the technique of NAND-backed DRAM cache protection , so part of cached data can be written to NAND just after system power loses.

Now let’s start some analysis for two essential problems: the cache size and access isolation.

Cache Size

We begin with the question of “how much DRAM is used as write cache during WCD?” First, let’s make sure that the test is repeatable (or the result is consistent). In order to verify this, perform the following procedure.

1.
Connect SSHD to the XGIG bus analyzer and power off/on SSHD.
2.
Send 100 random write 8K requests to SSHD using IOMeter or another tool, and repeat the same requests 10 times.
3.
Repeat Steps 1-2 for 4 times with the same requests.
4.
Compare and find the access pattern for the XGIG traces via a trace analyzer tool.
5.
Repeat Steps 1-4 with the request number changed to 200 and 400.
6.
Do the same test on a different SSHD with the same IO pattern.

In Step 4, a similar access pattern (LBA vs. CCT) should be observed. Note that you are checking DRAM write cache in this case, so only a random write request is used. For a full cache check, you may also try random read, mixed read/write, and mixed random/sequential patterns. If a similar pattern is observed, you may conclude that the result is consistent and useful to identify some inside information. Otherwise, you shall find out the reasons. One is that the SSD/DRAM cache is not cleaned before a new test. Therefore, you need some cache flush commands or disk initialization commands to force it empty. Also note that:

R/W DRAM cache may share the same space.
The SSD mapping table may share the same space with R/W DRAM cache (a good case: SSD mapping table uses a dedicated DRAM space).
The SSD reboot self-learning procedure may take DRAM space.

Second, implement the following procedure to make sure each test starts with a clean cache:

1.
Power off/on SSHD (make sure the DRAM write cache is cleared).
2.
Send 1000 random 8K write requests to SSHD with queue depth=1 using IOMeter.
3.
Repeat Steps 1-2 for 10 times each with different request sizes, such as 16K, 32K, 64K, 128K, ..., 2048K.

Once you capture the traces, some post-processes shall be made:

1.
Count the write DRAM hit number at the first portion of the total accesses for each run by isolating DRAM accesses from others (DRAM CCT/qCCT is generally much smaller than others).
2.
Choose the maximum number of each count.
3.
Calculate the hit numbers and the corresponding actual cache size.
4.
Find the turning point, which provides a hint of the cluster size.
5.
Refine the turning point by narrowing the region. For example, if the turning point is within [256K-512K], then some more points, such as 300K, 400K, and 500K, may be used.

Note that this model of SSHD has read-cache only SSD so that DRAM access will not be mixed with SSD write access, which simplifies the analysis in Figures 6-2-6-5. Figure 6-2 shows the traces from IOmeter random write tests (request size from 1K to 1M). Assume that the write cache is empty.¹ Then the first portion of each run could be the DRAM write cache hit.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig2_HTML.jpg — Figure 6-2
IOmeter traces for SSHD

Zoom into the system to find out the hit. Figures 6-3, 6-4, and 6-5 give three examples where the request sizes are 1KB, 512KB, and 1MB, respectively. In Figures 6-3 and 6-4, you can observe obvious write cache hits, and the total hit number for 1KB is much larger than that of 512KB due to limited cache size. However, when the size is increased to 1MB, no obvious write cache is observed, or it means that one threshold between 512K and 1M is set as the turning point for different size requests. This also indicates that large size requests will go directly to the media. With the same steps, you can actually get the required values for WCD and WCE, as shown in Tables 6-2 and 6-3.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig3_HTML.jpg — Figure 6-3
1K request trace details

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig4_HTML.jpg — Figure 6-4
512K request trace details

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig5_HTML.jpg — Figure 6-5
1024K request trace details

Table 6-2

Comparison Under WCE

SSHD			CMR A
Size	Counted number	Size	Size	Counted number	Size
1K	98	0.1M	1K	98	0.1M
4k	98	0.4M	4k	98	0.4M
16K	97	1.5M	16K	99	1.5M
64K	102	6.4M	64K	102	6.4M	2
128K	103	12.9M	128K	104	12.9M
256K	111	27.8M	256K	110	27.8M
512K	115	57.5M	512K	66	33M
520K	-	-	880K	35	30M
900K	-	-	900K	42	36.9M
1024K	N.A.	-	1000K	36	36M

Table 6-3

Two Cases Under WCD of SSHD

SSHD (WCD)	test1		test2
size	Counted number	Size	Counted number	Size
1K	98	0.1M	101	101K
4k	100	0.4M	100	400K
16K	99	1.5K	100	1600K
64K	101	6.3K	101	6464K
128K	54	6.75M	54	6.75M
256K	26	6.5M	26	6.5M
512K	12	6.0M	13	6.5M
520K	-	-	12	6.1M

From the turning point, you may also guess the cache cluster/segment size. For example, SSHD’s cluster size is around 64K for write during WCD, and CMR A is around 256KB. SSHD uses up to 60MB DRAM space as write cache when WCE, while only around 8MB is used for WCD with some 100 segments.

Access Isolation

You saw in the previous chapter that SPC-1C has a large portion of local accesses. This property brings the possibility that some data can be cached into DRAM or SSD and be accessed quickly later. Then the second question is “how many accesses are actually directed to DRAM or SSD?” It is generally a difficult task. However, as the access times of DRAM, SSD, and HDD are significantly different, you may isolate the possible commands in different places roughly. The basic idea is to observe the behaviors of the different accesses and then apply data classification and pattern recognition methods to find the access pattern, and do repeated random read tests to finally find the turning points. Although the procedure is similar to the previous case, you need to change the number of requests to SSHD in this case:

1.
Send 100/200/256/257/etc. pieces of 8K requests to SSHD, repeat 20-100 times for each number, and refine the number of commands to be sent according to the access pattern.
2.
Suppose the turn point is X. Send X random read commands with size 16K,32K,..., and 1024K to SSHD and find the cluster size according to the turning point.

To verify if the repeat number is high enough, check the steady states of response time. Figure 6-6 provides an example where 100 rounds are run. You can see that since the third round, the average value and standard derivatives of response times are almost constant. Thus 10 times of repeats should be enough in this case.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig6_HTML.jpg — Figure 6-6
Steady state of response time

Figure 6-7 shows a case where 100 random read requests with 8KB size were sent to SSHD 20 times. In the first round, all reads went to media. After several rounds, the read requests become hot and eventually all cached in DRAM. Slowly increase the number of requests to check how many requests the DRAM read cache can hold.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig7_HTML.jpg — Figure 6-7
100 8K random reads, repeated 20 times

Figure 6-8 illustrates the results for 250 requests repeatedly. You can see that DRAM cache can fully hold at least 250 segments. However, when you slightly increase it to 257, destage starts. When further increase to 260, DRAM destage to SSD happens obviously at a relatively high speed, which is illustrated in Figure 6-9. The destage has a certain adaptive steps, so when the hit number (access frequency) of the data is increased, destage becomes more frequent.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig8_HTML.jpg — Figure 6-8
250 8K random reads, repeated 100 times

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig9_HTML.jpg — Figure 6-9
260 8K random reads, repeated 100 times

Thus, you may guess that 256 could be the maximum read segment number, as no destage happens if this maximum segment number is not exceeded. Note that you can find the destage pattern via different time intervals and sizes.

I leave an issue on request access identification on devices here. Take a look at Figure 6-10, where repeated 260 8KB read requests were sent to SSHD 100 times. As the number is over DRAM’s capacity, some requests will go to SSD. The response time of the 1st, 3rd, 14th, 20th and 50th runs is shown. You can see that the very first run all went to disk. Starting at the second round, some went to DRAM and some to media. In the 20th round, the accesses to DRAM, SSD, and media all existed. However, around 50 rounds, most requests went to DRAM and SSD. In fact, you can see a clear gap of response time for these read requests. Basically, you may say that those below 0.1ms are DRAM accesses, and those above 0.2ms are most from SSD. Now you can get the statistical values for the response time of SSD and DRAM in an estimated sense, as shown in Table 6-4.

../images/468166_1_En_6_Chapter/468166_1_En_6_Fig10_HTML.jpg — Figure 6-10
Read response time pattern over repeated rounds

Table 6-4

Statistics on Response Time (Based on 40-90 Rounds)

260	Average	CCT	qCCT	TtoD
Overall	Mean	0.227	0.227	0.212
	Std.	0.096	0.096	0.096
SSD	Mean	0.279	0.279	0.264
	Std.	0.031	0.031	0.032
DRAM	Mean	0.063	0.063	0.048
	Std.	0.001	0.001	0.001

SMR

This section will discuss the main characteristics of SMR [4, 5] and the interaction between these characteristics and particular workloads. The industry has two approaches for SMR generally:

The drive manages all data accesses, and data management is complicated similar to the FTL (flash translation layer) of SSD, so the management of metadata, GC (garbage collection), over-provisioning, variable performance, etc, is all inside the drive. However, there are no host-side changes, so the drive is used as a normal one. Currently, all major SMR drives available in the market fall in this category.
The host manages most data-related accesses via a SMR-specific file system similar to flash file system. Data management is complicated but can leverage mature file systems that write sequentially. A few examples are SFS [58], HiSMRfs [59], and Shingledfs [5]. Although mixed drive-host management is also possible, it is really rare.

Many particular design issues are considered for SMR drives, such as data layout management (layout, data placement, defragmentation, GC, pointer to bands), mixed zones (combine shingled and unshingled part in same disk), SMR algorithms, and structure for specific applications, etc. Table 6-5 lists some main expected workload characteristics for SMR so that those applications with designed metrics can work perfectly in SMR drives.

Table 6-5

SMR Characteristics vs. Workload Metrics

SMR char	SMR expectation	Workload metrics	SMR impact
Sequential write	Good for large size sequential write requests	Average write request size and distribution Seek distance (LBA) Sequential stream and near-sequential stream	The larger size, the better The smaller seek distance, the more sequential The more streams, the more sequential
Write once read-many	Good for less updates and more reads	Read/write ratio Read on write (ROW) hit ratio Write update ratio	The higher read/write ratio (ROW ratio), the better The smaller the write update ratio, the better
Garbage collection (GC)	Smaller write amplification and less GC	Device utilization, device idle time distribution, queue length IOPS, throughput Frequented/timed/stacked write update ratio (WUR) Write on write (WOW) hit distribution and ratio	Long and frequent idle time for GC Low write update ratio indicates that less GC is required
Sequential read to random write	Less read performance impact due to indirect mapping, such as sequential LBA read requests in random physical address	Read on write (ROW) hit/size distribution and ratio	The higher the small (large) read to small (large) write ratio, the better
In-place or out-of-place update	Frequent and recent updates need random access zone (RAZ)/SSD/large DRAM buffer to hold write data	Stacked write update ratio	The higher ratio in shorter stack, the more necessary to have an in-place update buffer

6. Case Study: Modern Disks

SSHD

Cache Size

Access Isolation

SMR