Case Study: RAID

RAID is one of the most widely applied data-protection strategies in the world [23, 60, 61, 62, 63]. It has unique features compared with single disk access, such as file synchronization, recovery, etc. Therefore, it leads to some unique IO patterns compared with others. This chapter analyzes two examples based on RAID 5 from two application scenarios. Large differences are observed between two traces. This chapter also analyzes whether the workloads are suitable for SMR drives. In addition, some suggestions are provided in order to improve system performance.

The concept of a RAID was introduced to harness the potential of commodity hard drives in 1987. Patterson et al. [64] officially established the RAID taxonomy in 1988. RAID overcomes the capacity limitations of commodity disks by exposing an array of such low-capacity disks as a virtual single large expensive disks (SLED).

RAID technology usually requires the distribution of data across a number of disks via the data stripes. A stripe represents the smallest unit of protection in an array, thus any lost data within a stripe can be recovered using only the surviving data within that stripe. In early days, since clients were connected to the RAID via a serial access channel, parallel access by multiple clients was not explicitly supported. However, with many advanced queuing schedulers developed, parallelism now is widely applied in RAID systems in order to fully utilize the advantage of multiple disks.

There are some common performance issues within RAID systems [20, 62, 63], such as the small write problem, the synchronization problem, performance loss during downgrade (recovery and reconstruction), and more.

The small write problem exists in many critical applications, such as online transaction processing (OLTP) systems. Those applications usually contain many read-modify-write (RMW) accesses. This leads to some issues for a RAID system. First, a write in a striped array requires reads of both data and parity blocks, and computation of a new parity, before the writing of both new data and new parity, which is four times more accesses than for a single disk. Second, these small accesses only alter a few blocks within a specific stripe, yet the parity disk for the entire stripe is unavailable during the update. This dramatically downgrades the performance of the array by reducing the possible parallelism.

The synchronization problem is due to the data integrity requirements; only when all drives of one stripe array are completed, the system returns a completion signal. Since some disks may finish access earlier than others, the faster disks have to wait for the slow ones. This requirement may be relieved in some conditions, such as non-critical applications, protected DRAM, etc. During recovery, due to background recovery access, the foreground user requests may be largely impacted [65].

Similar problems are also applicable to the disk arrays using erasure code (EC). And in some cases, the problem may be more critical due to the higher complexity of EC than that of traditional RAID.

Workload Analysis

You will study two RAID 5 examples from two different vendors under video surveillance applications. The system settings will be given first, followed by the analysis of two different traces: read-dominant and write-dominant cases.

System Settings

In the first example, there are 10 7200RPM HDDs each of 4TB. 24 write streams and 6 read streams are imposed to this system. The second example has 36 similar HDDs with 90 video channels. The trace length is 620 and 110 seconds, respectively. Some basic metrics are listed in Tables 7-1 and 7-2.

Table 7-1

RAID Trace 1: Read Dominated

	Combined	Read	Write
Numbers of commands	7821	5493 (70.2%)	2328 (29.8%)
Number of blocks	5284520	3231312 (61.1%)	2053208 (38.9%)
Average size (block)	675.7	588.3	882
r/s w/s rsec/s wsec/s rkB/s wkB/s IOPS TP(MBPS)
8.86 3.75 5211.8 3311.6 2605.9 1655.8 12.61 4.23

Table 7-2

RAID Trace 2: Write Dominated

Metrics	Combined	Read	Write
Cmd number	21449	8535	12914
Total blk size	2528514	357824	2170690
Average blk size	210.012	41.924	168.088
Average IOPS	193.152	76.859	116.293
Average TP (MBps)	11.118	1.573	9.545

Read-Dominated Trace

The LBA distribution of requests are near sequential in this trace, as shown in Figure 7-1. For reads, there are two regions. One is the same to the current write region, and the other is close to the previous write region (i.e., playback). The sizes of the read requests are mostly 512 or 1024 blocks. However, the ratio of 1024 blocks of read is less than that of write, which is displayed in Figure 7-2.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig1_HTML.jpg — Figure 7-1
LBA distribution of RAID Trace 1

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig2_HTML.jpg — Figure 7-2
Size distribution of RAID trace 1

In general, this trace has large portion of idle time, which accumulates 83.4% of total time. The average summation of idle time is almost evenly distributed over time but the large idle intervals not, as shown in Figure 7-3. The intervals >200ms and >500ms count 8% and 1.7%, respectively, but occupy 71.6% and 34% of total idle time, respectively. In fact, 65% (94%) of idle frequency is less than 10ms (1s), and 2% (70%) of idle time is less than 10ms (1s), as illustrated in Figure 7-4. So we can conclude that the total idle time is long enough for small-IO-based background activities, but the individual long idle intervals may be not sufficient, which means GC access shall be completed in small steps.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig3_HTML.jpg — Figure 7-3
Idle time distribution of RAID trace 1

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig4_HTML.jpg — Figure 7-4
Idle time CDF of RAID trace 1

Besides the large idle time, note that there are some abnormally large response times for some requests (> 200ms). As you know, even for the worst case, the access time of a 1024-block request should not excess 60ms. Thus, the waiting time is too long for the two cases listed in Table 7-3: 1) the CMD 529 is continuous to 530 but the write access costs over 430ms; 2) CMD 1015 is close to 1024 but it costs 390ms. This may be caused by 1) background disk activities such as log writes, metadata updates, zone switches, etc; or 2) RAID synchronization events. A possible solution is to evenly distribute tasks and actively provide idle time for background tasks.

Table 7-3

A Segment of RAID Trace 1

Start(sec)	End	ID	End	Cmd	ICT(ms)	LBA	Length
27.51865	27.53302	529	529	W	0.109105	1.21e+08	1024
27.5331	27.96851	530	530	W	0.078425	1.21e+08	512
27.9686	27.98473	531	531	R	0.086485	80135168	1024
27.98483	27.987	532	532	R	0.10257	80351232	512
...
52.73206	52.7429	1014	1014	R	0.466475	80690688	512
53.46918	53.85766	1015	1015	R	726.2745	80875520	512
53.85778	53.86006	1016	1016	R	0.112894	80876032	512
53.86014	53.87545	1017	1017	R	0.083455	80881664	1024

For the frequented write update shown in Figure 7-5, you can see that 94.2% of the accessed blocks (maybe repeated) are only written once and 5.8% of the blocks are at least accessed twice and <0.1% of the blocks are written three times. This means a very low rewritten ratio. Thus you need to identify if large size requests or small size requests are rewritten most. The fact that decreasing percentage of written blocks are written multiple times means a tiny portion of hot blocks.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig5_HTML.jpg — Figure 7-5
Frequented update of RAID trace 1

For the timed write update shown in Figure 7-6, the total write blocks occurs 35% of total access blocks (read and write) and the updated blocks (at least write twice) are only 1% (1/35=2.9% rewritten blocks). Total write commands are 30% of the total commands and the update commands are 1.5%. Note that the timed write update ratio is closely related to the frequented write update ratio; in other words, sum(hit*(update freq-1))/total blocks = updated blocks/total write blocks.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig6_HTML.jpg — Figure 7-6
Timed update of RAID trace 1

By further considering the write stack distance in Figure 7-7, you can see that the hit ratio is low and it is not necessary to have an inline write cache to hold these write data for a long time. Based on IOPS, to reach stack distance 100, it costs roughly 26.6 seconds. In this period, only 10% full write hit and 20% partial write hit of the overall 5% hit are observed. The updated size is around 43MB on average. Thus it is not worthy of compensating such a small hit. Note that the write hit distributes over the write range. A full hit is only for 512-block requests while a partial hit is for 1024-block requests in this trace. See Figure 7-8 for details.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig7_HTML.jpg — Figure 7-7
Stack update of RAID trace 1

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig8_HTML.jpg — Figure 7-8
Write hit distribution of RAID trace 1

When you take SMR drive into consideration, as discussed in Chapter 6, you have the main characteristics summarized in Table 7-4. You may understand this table together with the SMR properties introduced in Chapter 6. Although it is a read-dominated trace, it has no WORM property.

Table 7-4

Main Characteristics of Trace 1 for SMR

SMR characteristic	Observation
Sequential write	Large size write requests ( >=512 blocks) > 99.9% Mode ratio: 50% for read & write (Q=1) Sequential cmd ratio(M>=2 & S>=1024): write 85% & read 90% (Q>=50)
Write-once-read-many	R/W: cmd 70:30; blks 61:39 Stacked ROW ratio: < 1% Total write blocks occurs 2.9% of total access blocks
Garbage collection (GC)	Frequent small idle time; short queue length 5.8% frequented WUR the updated blocks (at least write twice) are only 1% of total access blocks and 2.9% write blocks, so very small write update ratio and write amplification 103% (considering the short trace duration)
Sequential read to random write	ROW ratio is 1.2, so it’s a very small read ratio, thus the written data is rarely likely to be immediately read back
In-place or out-of-place update	Very small update ratio; not necessary to apply large-size SSD/DRAM/AZR cache for performance improvement (write update in cache)

Write-Dominated Trace

This trace from a video surveillance application shows a large difference from the previous one in many aspects, such as the read/write ratio, LBA distribution, size distribution, write update ratio, etc. Therefore, for different venders under different scenarios, the actual workloads may differ from each other significantly, even using the same storage structure.

Figure 7-9 shows the LBA distribution where only one main region spanning 30GB is applied to both read and write close to the starting position of LBA (as the trace was collected when the RAID is nearly empty). If you further consider the figure of LBA vs. Time, you can see that the write requests are more sequential than the read ones. Figure 7-10 illustrates that write and read have similar size distribution, dominated 8-block requests, and close shape of 8-128 blocks distribution. Also, the size distribution range is much larger than the previous trace.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig9_HTML.jpg — Figure 7-9
LBA distribution of RAID trace 2

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig10_HTML.jpg — Figure 7-10
Size distribution of RAID trace 2

As it is write dominated, let’s focus more on the write update. Figure 7-11 shows the stacked distance for write requests. This confirms the timed write update, which is the small overlap size (possibly due to metadata block attached to data blocks). From Figure 7-12, the stack distance 250 is roughly 2.1 seconds based on IOPS. In this period, it’s near 60% full write hit and 60% partial write hit of the overall 52% write command hit. This means that some portions kept updating. Therefore, the disk or system may require a random access zone or NVM for this small portion of updated data.

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig11_HTML.jpg — Figure 7-11
Write stack distance of RAID trace 2

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig12_HTML.jpg — Figure 7-12
Stack update of RAID trace 2

In sum, tens of mixed streams lead to a not-very-sequential IO pattern, which indicates that a proper stream-detection algorithm with long queue is required. Special metadata and parity structure lead to a relatively high LBA update size and large update command ratio, which implies that a large size DRAM/NVM/RAZ cache may be necessary to avoid the frequent updates. Also, you can conclude that the impact of write cache is very limited in the previous read-dominated trace. Note the frequent small idles but less effective idle intervals, which indicates that GC policy may be adjusted to fit this situation.

The impact to SMR drives is summarized in Table 7-5. Essentially, the impact for normal write access is not trivial due to the relatively large update ratio. The high updated command ratio may cause relative high defragmentation. The metadata management scheme of the surveillance system and/or the SMR drives may require changes. The drive may not have enough idle time for the background GC subject to GC policy, as the useful effective idle intervals are marginal.

Table 7-5

Main Characteristics of Trace 2 wrt SMR

SMR characteristics	Hadoop observation
Sequential write	Large size write requests ( >=128 blocks): 35% Mode ratio: 18% (27%) for write when Q=1(Q=128) Sequential cmd ratio(M>=2): write 35% at QL=1 & 60% at Q=256
Write-once-read-many	R/W: cmd 1:1.5; blks 1:6.1 High stacked ROW ratio Total write blocks occurs 85.9% of total access blocks
Garbage collection (GC)	Updated blocks (at least write twice) are 13.4% of write blocks, so a relatively high write update ratio and write amplification 115.5% (considering the short trace duration) Updated command ratio >50% with small overlap possibly due to the metadata attached Frequent but small-size idle time in host side difficult for background GC
In-place or out-of-place update	Relatively high update ratio, so it’s necessary to apply large-size SSD/DRAM/AZR cache for performance improvement (write update in cache)

This trace is much busier than the previous one. The total effective idle time (idle interval>0.1s) is 14.40 seconds and total idle time 98.6 seconds. The total effective idle frequency (idle interval>0.1s) is 33 only while the total idle frequency is 6244. Now the question is whether the (effective) idle time is enough for background activities. Due to the relatively consistent workload of video surveillance and data/metadata structure, the garbage ratio of each SMR data zone is similar. Suppose a 1GB for zone size and 3MB per track. The total write workload of this trace is about 1GB. So is it possible to move 1GB to new place in the effective idle times? Here is the analysis:

Completion of a sequential 1GB read and a sequential 887MB write (assume13.4% garbage ratio) requires around 5.3 seconds for 7200RPM.
The average useful idle time for GC is 14.4/33−0.1=0.34 second. Suppose the positioning time is 6ms for R/W. (0.34−0.006*2) second can handle up to 64.4MB data in GC zone.¹ A total of 33 idle intervals can handle around 2GB data, which is larger than 1GB.
Additionally, the old video data is replaced by new data periodically, which will not change the garbage ratio much in general.

Ideally, the idle time seems large enough to handle GC activities, given that

The effective idle time should be fully used and the GC size is adjusted dynamically.
The idle time algorithm works quite well with a lower idle detection threshold, such as from 100ms to 50ms to increase the GC activities.
The other background activities may not take much time.

However, in reality, you may require much large idle time. In particular, defragmentation may significantly increase the write amplification ratio.

7. Case Study: RAID

Workload Analysis

System Settings

Read-Dominated Trace

Write-Dominated Trace