© Jun Xu 2018
Jun XuBlock Trace Analysis and Storage System Optimizationhttps://doi.org/10.1007/978-1-4842-3928-5_7

7. Case Study: RAID

Jun Xu1 
(1)
Singapore, Singapore
 

RAID is one of the most widely applied data-protection strategies in the world [23, 60, 61, 62, 63]. It has unique features compared with single disk access, such as file synchronization, recovery, etc. Therefore, it leads to some unique IO patterns compared with others. This chapter analyzes two examples based on RAID 5 from two application scenarios. Large differences are observed between two traces. This chapter also analyzes whether the workloads are suitable for SMR drives. In addition, some suggestions are provided in order to improve system performance.

The concept of a RAID was introduced to harness the potential of commodity hard drives in 1987. Patterson et al. [64] officially established the RAID taxonomy in 1988. RAID overcomes the capacity limitations of commodity disks by exposing an array of such low-capacity disks as a virtual single large expensive disks (SLED).

RAID technology usually requires the distribution of data across a number of disks via the data stripes. A stripe represents the smallest unit of protection in an array, thus any lost data within a stripe can be recovered using only the surviving data within that stripe. In early days, since clients were connected to the RAID via a serial access channel, parallel access by multiple clients was not explicitly supported. However, with many advanced queuing schedulers developed, parallelism now is widely applied in RAID systems in order to fully utilize the advantage of multiple disks.

There are some common performance issues within RAID systems [20, 62, 63], such as the small write problem, the synchronization problem, performance loss during downgrade (recovery and reconstruction), and more.

The small write problem exists in many critical applications, such as online transaction processing (OLTP) systems. Those applications usually contain many read-modify-write (RMW) accesses. This leads to some issues for a RAID system. First, a write in a striped array requires reads of both data and parity blocks, and computation of a new parity, before the writing of both new data and new parity, which is four times more accesses than for a single disk. Second, these small accesses only alter a few blocks within a specific stripe, yet the parity disk for the entire stripe is unavailable during the update. This dramatically downgrades the performance of the array by reducing the possible parallelism.

The synchronization problem is due to the data integrity requirements; only when all drives of one stripe array are completed, the system returns a completion signal. Since some disks may finish access earlier than others, the faster disks have to wait for the slow ones. This requirement may be relieved in some conditions, such as non-critical applications, protected DRAM, etc. During recovery, due to background recovery access, the foreground user requests may be largely impacted [65].

Similar problems are also applicable to the disk arrays using erasure code (EC). And in some cases, the problem may be more critical due to the higher complexity of EC than that of traditional RAID.

Workload Analysis

You will study two RAID 5 examples from two different vendors under video surveillance applications. The system settings will be given first, followed by the analysis of two different traces: read-dominant and write-dominant cases.

System Settings

In the first example, there are 10 7200RPM HDDs each of 4TB. 24 write streams and 6 read streams are imposed to this system. The second example has 36 similar HDDs with 90 video channels. The trace length is 620 and 110 seconds, respectively. Some basic metrics are listed in Tables 7-1 and 7-2.
Table 7-1

RAID Trace 1: Read Dominated

 

Combined

Read

Write

Numbers of commands

7821

5493 (70.2%)

2328 (29.8%)

Number of blocks

5284520

3231312 (61.1%)

2053208 (38.9%)

Average size (block)

675.7

588.3

882

r/s         w/s        rsec/s        wsec/s        rkB/s          wkB/s        IOPS         TP(MBPS)

8.86      3.75      5211.8       3311.6       2605.9        1655.8      12.61        4.23

Table 7-2

RAID Trace 2: Write Dominated

Metrics

Combined

Read

Write

Cmd number

21449

8535

12914

Total blk size

2528514

357824

2170690

Average blk size

210.012

41.924

168.088

Average IOPS

193.152

76.859

116.293

Average TP (MBps)

11.118

1.573

9.545

Read-Dominated Trace

The LBA distribution of requests are near sequential in this trace, as shown in Figure 7-1. For reads, there are two regions. One is the same to the current write region, and the other is close to the previous write region (i.e., playback). The sizes of the read requests are mostly 512 or 1024 blocks. However, the ratio of 1024 blocks of read is less than that of write, which is displayed in Figure 7-2.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig1_HTML.jpg
Figure 7-1

LBA distribution of RAID Trace 1

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig2_HTML.jpg
Figure 7-2

Size distribution of RAID trace 1

In general, this trace has large portion of idle time, which accumulates 83.4% of total time. The average summation of idle time is almost evenly distributed over time but the large idle intervals not, as shown in Figure 7-3. The intervals >200ms and >500ms count 8% and 1.7%, respectively, but occupy 71.6% and 34% of total idle time, respectively. In fact, 65% (94%) of idle frequency is less than 10ms (1s), and 2% (70%) of idle time is less than 10ms (1s), as illustrated in Figure 7-4. So we can conclude that the total idle time is long enough for small-IO-based background activities, but the individual long idle intervals may be not sufficient, which means GC access shall be completed in small steps.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig3_HTML.jpg
Figure 7-3

Idle time distribution of RAID trace 1

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig4_HTML.jpg
Figure 7-4

Idle time CDF of RAID trace 1

Besides the large idle time, note that there are some abnormally large response times for some requests (> 200ms). As you know, even for the worst case, the access time of a 1024-block request should not excess 60ms. Thus, the waiting time is too long for the two cases listed in Table 7-3: 1) the CMD 529 is continuous to 530 but the write access costs over 430ms; 2) CMD 1015 is close to 1024 but it costs 390ms. This may be caused by 1) background disk activities such as log writes, metadata updates, zone switches, etc; or 2) RAID synchronization events. A possible solution is to evenly distribute tasks and actively provide idle time for background tasks.
Table 7-3

A Segment of RAID Trace 1

Start(sec)

End

ID

End

Cmd

ICT(ms)

LBA

Length

27.51865

27.53302

529

529

W

0.109105

1.21e+08

1024

27.5331

27.96851

530

530

W

0.078425

1.21e+08

512

27.9686

27.98473

531

531

R

0.086485

80135168

1024

27.98483

27.987

532

532

R

0.10257

80351232

512

...

       

52.73206

52.7429

1014

1014

R

0.466475

80690688

512

53.46918

53.85766

1015

1015

R

726.2745

80875520

512

53.85778

53.86006

1016

1016

R

0.112894

80876032

512

53.86014

53.87545

1017

1017

R

0.083455

80881664

1024

For the frequented write update shown in Figure 7-5, you can see that 94.2% of the accessed blocks (maybe repeated) are only written once and 5.8% of the blocks are at least accessed twice and <0.1% of the blocks are written three times. This means a very low rewritten ratio. Thus you need to identify if large size requests or small size requests are rewritten most. The fact that decreasing percentage of written blocks are written multiple times means a tiny portion of hot blocks.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig5_HTML.jpg
Figure 7-5

Frequented update of RAID trace 1

For the timed write update shown in Figure 7-6, the total write blocks occurs 35% of total access blocks (read and write) and the updated blocks (at least write twice) are only 1% (1/35=2.9% rewritten blocks). Total write commands are 30% of the total commands and the update commands are 1.5%. Note that the timed write update ratio is closely related to the frequented write update ratio; in other words, sum(hit*(update freq-1))/total blocks = updated blocks/total write blocks.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig6_HTML.jpg
Figure 7-6

Timed update of RAID trace 1

By further considering the write stack distance in Figure 7-7, you can see that the hit ratio is low and it is not necessary to have an inline write cache to hold these write data for a long time. Based on IOPS, to reach stack distance 100, it costs roughly 26.6 seconds. In this period, only 10% full write hit and 20% partial write hit of the overall 5% hit are observed. The updated size is around 43MB on average. Thus it is not worthy of compensating such a small hit. Note that the write hit distributes over the write range. A full hit is only for 512-block requests while a partial hit is for 1024-block requests in this trace. See Figure 7-8 for details.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig7_HTML.jpg
Figure 7-7

Stack update of RAID trace 1

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig8_HTML.jpg
Figure 7-8

Write hit distribution of RAID trace 1

When you take SMR drive into consideration, as discussed in Chapter 6, you have the main characteristics summarized in Table 7-4. You may understand this table together with the SMR properties introduced in Chapter 6. Although it is a read-dominated trace, it has no WORM property.
Table 7-4

Main Characteristics of Trace 1 for SMR

SMR characteristic

Observation

Sequential write

Large size write requests ( >=512 blocks) > 99.9%

Mode ratio: 50% for read & write (Q=1)

Sequential cmd ratio(M>=2 & S>=1024):  write  85% & read 90% (Q>=50)

Write-once-read-many

R/W: cmd 70:30; blks 61:39

Stacked ROW ratio: < 1%

Total write blocks occurs 2.9% of total access blocks

Garbage collection (GC)

Frequent small idle time; short queue length 5.8% frequented WUR the updated blocks (at least write twice) are only 1% of total access blocks and 2.9% write blocks, so very small write update ratio and write amplification 103% (considering the short trace duration)

Sequential read to random write

ROW ratio is 1.2, so it’s a very small read ratio, thus the written data is rarely likely to be immediately read back

In-place or out-of-place update

Very small update ratio; not necessary to apply large-size SSD/DRAM/AZR cache for performance improvement (write update in cache)

Write-Dominated Trace

This trace from a video surveillance application shows a large difference from the previous one in many aspects, such as the read/write ratio, LBA distribution, size distribution, write update ratio, etc. Therefore, for different venders under different scenarios, the actual workloads may differ from each other significantly, even using the same storage structure.

Figure 7-9 shows the LBA distribution where only one main region spanning 30GB is applied to both read and write close to the starting position of LBA (as the trace was collected when the RAID is nearly empty). If you further consider the figure of LBA vs. Time, you can see that the write requests are more sequential than the read ones. Figure 7-10 illustrates that write and read have similar size distribution, dominated 8-block requests, and close shape of 8-128 blocks distribution. Also, the size distribution range is much larger than the previous trace.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig9_HTML.jpg
Figure 7-9

LBA distribution of RAID trace 2

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig10_HTML.jpg
Figure 7-10

Size distribution of RAID trace 2

As it is write dominated, let’s focus more on the write update. Figure 7-11 shows the stacked distance for write requests. This confirms the timed write update, which is the small overlap size (possibly due to metadata block attached to data blocks). From Figure 7-12, the stack distance 250 is roughly 2.1 seconds based on IOPS. In this period, it’s near 60% full write hit and 60% partial write hit of the overall 52% write command hit. This means that some portions kept updating. Therefore, the disk or system may require a random access zone or NVM for this small portion of updated data.
../images/468166_1_En_7_Chapter/468166_1_En_7_Fig11_HTML.jpg
Figure 7-11

Write stack distance of RAID trace 2

../images/468166_1_En_7_Chapter/468166_1_En_7_Fig12_HTML.jpg
Figure 7-12

Stack update of RAID trace 2

In sum, tens of mixed streams lead to a not-very-sequential IO pattern, which indicates that a proper stream-detection algorithm with long queue is required. Special metadata and parity structure lead to a relatively high LBA update size and large update command ratio, which implies that a large size DRAM/NVM/RAZ cache may be necessary to avoid the frequent updates. Also, you can conclude that the impact of write cache is very limited in the previous read-dominated trace. Note the frequent small idles but less effective idle intervals, which indicates that GC policy may be adjusted to fit this situation.

The impact to SMR drives is summarized in Table 7-5. Essentially, the impact for normal write access is not trivial due to the relatively large update ratio. The high updated command ratio may cause relative high defragmentation. The metadata management scheme of the surveillance system and/or the SMR drives may require changes. The drive may not have enough idle time for the background GC subject to GC policy, as the useful effective idle intervals are marginal.
Table 7-5

Main Characteristics of Trace 2 wrt SMR

SMR characteristics

Hadoop observation

Sequential write

Large size write requests ( >=128 blocks): 35%

Mode ratio: 18% (27%) for write when Q=1(Q=128)

Sequential cmd ratio(M>=2): write 35% at QL=1 & 60% at Q=256

Write-once-read-many

R/W: cmd 1:1.5; blks 1:6.1

High stacked ROW ratio

Total write blocks occurs  85.9% of total access blocks

Garbage collection (GC)

Updated blocks (at least write twice) are 13.4% of write blocks, so a relatively high write update ratio and write amplification 115.5% (considering the short trace duration)

Updated command ratio >50% with small overlap possibly due to the metadata attached

Frequent but small-size idle time in host side difficult for background GC

In-place or out-of-place update

Relatively high update ratio, so it’s necessary to apply large-size SSD/DRAM/AZR cache for performance improvement (write update in cache)

This trace is much busier than the previous one. The total effective idle time (idle interval>0.1s) is 14.40 seconds and total idle time 98.6 seconds. The total effective idle frequency (idle interval>0.1s) is 33 only while the total idle frequency is 6244. Now the question is whether the (effective) idle time is enough for background activities. Due to the relatively consistent workload of video surveillance and data/metadata structure, the garbage ratio of each SMR data zone is similar. Suppose a 1GB for zone size and 3MB per track. The total write workload of this trace is about 1GB. So is it possible to move 1GB to new place in the effective idle times? Here is the analysis:
  • Completion of a sequential 1GB read and a sequential 887MB write (assume13.4% garbage ratio) requires around 5.3 seconds for 7200RPM.

  • The average useful idle time for GC is 14.4/33−0.1=0.34 second. Suppose the positioning time is 6ms for R/W. (0.34−0.006*2) second can handle up to 64.4MB data in GC zone.1 A total of 33 idle intervals can handle around 2GB data, which is larger than 1GB.

  • Additionally, the old video data is replaced by new data periodically, which will not change the garbage ratio much in general.

Ideally, the idle time seems large enough to handle GC activities, given that
  • The effective idle time should be fully used and the GC size is adjusted dynamically.

  • The idle time algorithm works quite well with a lower idle detection threshold, such as from 100ms to 50ms to increase the GC activities.

  • The other background activities may not take much time.

However, in reality, you may require much large idle time. In particular, defragmentation may significantly increase the write amplification ratio.