RAID is one of the most widely applied data-protection strategies in the world [23, 60, 61, 62, 63]. It has unique features compared with single disk access, such as file synchronization, recovery, etc. Therefore, it leads to some unique IO patterns compared with others. This chapter analyzes two examples based on RAID 5 from two application scenarios. Large differences are observed between two traces. This chapter also analyzes whether the workloads are suitable for SMR drives. In addition, some suggestions are provided in order to improve system performance.
The concept of a RAID was introduced to harness the potential of commodity hard drives in 1987. Patterson et al. [64] officially established the RAID taxonomy in 1988. RAID overcomes the capacity limitations of commodity disks by exposing an array of such low-capacity disks as a virtual single large expensive disks (SLED).
RAID technology usually requires the distribution of data across a number of disks via the data stripes. A stripe represents the smallest unit of protection in an array, thus any lost data within a stripe can be recovered using only the surviving data within that stripe. In early days, since clients were connected to the RAID via a serial access channel, parallel access by multiple clients was not explicitly supported. However, with many advanced queuing schedulers developed, parallelism now is widely applied in RAID systems in order to fully utilize the advantage of multiple disks.
There are some common performance issues within RAID systems [20, 62, 63], such as the small write problem, the synchronization problem, performance loss during downgrade (recovery and reconstruction), and more.
The small write problem exists in many critical applications, such as online transaction processing (OLTP) systems. Those applications usually contain many read-modify-write (RMW) accesses. This leads to some issues for a RAID system. First, a write in a striped array requires reads of both data and parity blocks, and computation of a new parity, before the writing of both new data and new parity, which is four times more accesses than for a single disk. Second, these small accesses only alter a few blocks within a specific stripe, yet the parity disk for the entire stripe is unavailable during the update. This dramatically downgrades the performance of the array by reducing the possible parallelism.
The synchronization problem is due to the data integrity requirements; only when all drives of one stripe array are completed, the system returns a completion signal. Since some disks may finish access earlier than others, the faster disks have to wait for the slow ones. This requirement may be relieved in some conditions, such as non-critical applications, protected DRAM, etc. During recovery, due to background recovery access, the foreground user requests may be largely impacted [65].
Similar problems are also applicable to the disk arrays using erasure code (EC). And in some cases, the problem may be more critical due to the higher complexity of EC than that of traditional RAID.
Workload Analysis
You will study two RAID 5 examples from two different vendors under video surveillance applications. The system settings will be given first, followed by the analysis of two different traces: read-dominant and write-dominant cases.
System Settings
RAID Trace 1: Read Dominated
Combined | Read | Write | |
---|---|---|---|
Numbers of commands | 7821 | 5493 (70.2%) | 2328 (29.8%) |
Number of blocks | 5284520 | 3231312 (61.1%) | 2053208 (38.9%) |
Average size (block) | 675.7 | 588.3 | 882 |
r/s w/s rsec/s wsec/s rkB/s wkB/s IOPS TP(MBPS) | |||
8.86 3.75 5211.8 3311.6 2605.9 1655.8 12.61 4.23 |
RAID Trace 2: Write Dominated
Metrics | Combined | Read | Write |
---|---|---|---|
Cmd number | 21449 | 8535 | 12914 |
Total blk size | 2528514 | 357824 | 2170690 |
Average blk size | 210.012 | 41.924 | 168.088 |
Average IOPS | 193.152 | 76.859 | 116.293 |
Average TP (MBps) | 11.118 | 1.573 | 9.545 |
Read-Dominated Trace

LBA distribution of RAID Trace 1

Size distribution of RAID trace 1

Idle time distribution of RAID trace 1

Idle time CDF of RAID trace 1
A Segment of RAID Trace 1
Start(sec) | End | ID | End | Cmd | ICT(ms) | LBA | Length |
---|---|---|---|---|---|---|---|
27.51865 | 27.53302 | 529 | 529 | W | 0.109105 | 1.21e+08 | 1024 |
27.5331 | 27.96851 | 530 | 530 | W | 0.078425 | 1.21e+08 | 512 |
27.9686 | 27.98473 | 531 | 531 | R | 0.086485 | 80135168 | 1024 |
27.98483 | 27.987 | 532 | 532 | R | 0.10257 | 80351232 | 512 |
... | |||||||
52.73206 | 52.7429 | 1014 | 1014 | R | 0.466475 | 80690688 | 512 |
53.46918 | 53.85766 | 1015 | 1015 | R | 726.2745 | 80875520 | 512 |
53.85778 | 53.86006 | 1016 | 1016 | R | 0.112894 | 80876032 | 512 |
53.86014 | 53.87545 | 1017 | 1017 | R | 0.083455 | 80881664 | 1024 |

Frequented update of RAID trace 1

Timed update of RAID trace 1

Stack update of RAID trace 1

Write hit distribution of RAID trace 1
Main Characteristics of Trace 1 for SMR
SMR characteristic | Observation |
---|---|
Sequential write | Large size write requests ( >=512 blocks) > 99.9% Mode ratio: 50% for read & write (Q=1) Sequential cmd ratio(M>=2 & S>=1024): write 85% & read 90% (Q>=50) |
Write-once-read-many | R/W: cmd 70:30; blks 61:39 Stacked ROW ratio: < 1% Total write blocks occurs 2.9% of total access blocks |
Garbage collection (GC) | Frequent small idle time; short queue length 5.8% frequented WUR the updated blocks (at least write twice) are only 1% of total access blocks and 2.9% write blocks, so very small write update ratio and write amplification 103% (considering the short trace duration) |
Sequential read to random write | ROW ratio is 1.2, so it’s a very small read ratio, thus the written data is rarely likely to be immediately read back |
In-place or out-of-place update | Very small update ratio; not necessary to apply large-size SSD/DRAM/AZR cache for performance improvement (write update in cache) |
Write-Dominated Trace
This trace from a video surveillance application shows a large difference from the previous one in many aspects, such as the read/write ratio, LBA distribution, size distribution, write update ratio, etc. Therefore, for different venders under different scenarios, the actual workloads may differ from each other significantly, even using the same storage structure.

LBA distribution of RAID trace 2

Size distribution of RAID trace 2

Write stack distance of RAID trace 2

Stack update of RAID trace 2
In sum, tens of mixed streams lead to a not-very-sequential IO pattern, which indicates that a proper stream-detection algorithm with long queue is required. Special metadata and parity structure lead to a relatively high LBA update size and large update command ratio, which implies that a large size DRAM/NVM/RAZ cache may be necessary to avoid the frequent updates. Also, you can conclude that the impact of write cache is very limited in the previous read-dominated trace. Note the frequent small idles but less effective idle intervals, which indicates that GC policy may be adjusted to fit this situation.
Main Characteristics of Trace 2 wrt SMR
SMR characteristics | Hadoop observation |
---|---|
Sequential write | Large size write requests ( >=128 blocks): 35% Mode ratio: 18% (27%) for write when Q=1(Q=128) Sequential cmd ratio(M>=2): write 35% at QL=1 & 60% at Q=256 |
Write-once-read-many | R/W: cmd 1:1.5; blks 1:6.1 High stacked ROW ratio Total write blocks occurs 85.9% of total access blocks |
Garbage collection (GC) | Updated blocks (at least write twice) are 13.4% of write blocks, so a relatively high write update ratio and write amplification 115.5% (considering the short trace duration) Updated command ratio >50% with small overlap possibly due to the metadata attached Frequent but small-size idle time in host side difficult for background GC |
In-place or out-of-place update | Relatively high update ratio, so it’s necessary to apply large-size SSD/DRAM/AZR cache for performance improvement (write update in cache) |
Completion of a sequential 1GB read and a sequential 887MB write (assume13.4% garbage ratio) requires around 5.3 seconds for 7200RPM.
The average useful idle time for GC is 14.4/33−0.1=0.34 second. Suppose the positioning time is 6ms for R/W. (0.34−0.006*2) second can handle up to 64.4MB data in GC zone.1 A total of 33 idle intervals can handle around 2GB data, which is larger than 1GB.
Additionally, the old video data is replaced by new data periodically, which will not change the garbage ratio much in general.
The effective idle time should be fully used and the GC size is adjusted dynamically.
The idle time algorithm works quite well with a lower idle detection threshold, such as from 100ms to 50ms to increase the GC activities.
The other background activities may not take much time.
However, in reality, you may require much large idle time. In particular, defragmentation may significantly increase the write amplification ratio.