Trace analysis provides insights into workload properties and IO patterns, which are essential for storage system tuning and optimizing. This chapter discusses how the workload interacts with system components, algorithms, structures, and applications.
Interactions with Components
Write cache hits can avoid some disk mechanical writes; instead, the dirty blocks in DRAM cache are overwritten. It is a benefit of temporal locality.
Larger cache space means longer write queue: physically contiguous dirty blocks can be grouped into a single IO operation. It is a benefit of spatial locality.
An advanced replacement policy efficiently places cold data onto a disk while keeping the hot data in cache via exploring both spatial and temporal locality.
The cache can temporarily absorb the write burst and distribute the write load evenly over time to minimize the impact to concurrent IOs.
For HDD with MBC, write cache can provide log access for write burst, and thus give a better arrangement for grouped I/O access. In this section, I mainly discuss the HDD and SSD factors that influence the performance.
HDD Factors
For HDD, the performance varies with respect to (wrt) the disk drive’s features (e.g., RPM, TPI/SPT, location [OD, MD or ID], head quality, servo control mechanism, cache structure/algorithm, queue length, and so on) and workload properties (e.g., sequence, request size, queue depth, and more).

Throughput difference in different HDD locations

IOPS difference wrt queue depth and request size under WCE and WCD (random write via IOMeter)

Throughput difference wrt to buffer size for WCE and WCD (sequential write via IOMeter)
SSD Factors
For SSD, the performance also varies wrt the disk drive’s features (e.g., die number, block size, parallel access, flash management algorithm (wear-leveling), address mapping policy, trim condition, cache structure/algorithm, queue length, IO driver interface, and more) and the workload properties (e.g., fragmentation, sequence/randomness, write update, read/write ratio, request size, queue depth, request intensity/throttling, etc.).

SSD performance states [12]
NAND SSDs generally use a virtual address mapping scheme, whereby LBAs are mapped to PBAs for some reason [12]. For instance, wear leveling algorithms allocate updated data to new cell locations to promote evenly distributed wear on the memory cells and thus improve the memory cell life or endurance. As a result, the SSD must keep track of the LBA-PBA affiliations. Similar to HDD, sequential operation may also be faster than random access when the data in the physical location is less fragmented.

SSD IOPS vs. queue depth

SSD sequential throughput vs. request size
The read/write ratio has larger impact to the SSDs than CMR HDDs. First, the “new” write generally needs more time than read, so more steps of the write operation than that of the read operation. Second, for a write-in-place update, an “erase” access is required. Therefore, the number of write steps relies on how full the drive is and whether the SSD controller shall erase the target cell (or even relocate some data by performing a more time-costly RMW access) before writing the new data.
Although the performance ratio of sequential to random access is not so high as the HDDs, sequentiality is still important because it contributes in minimizing erase operations via grouping write requests by blocks, optimizing both lifetime and I/O performance by reducing the number of erasures, and so on.
Access location: It significantly matters for HDD due to positioning time, while it doesn’t determine latency in NVM generally, although the access order matters.
Access size: Large and sequential requests are significantly faster than small and random requests. However, it has less impact to NVM. In fact, larger IO may pay an additional cost due to internal structures, although sequential access is still generally faster than random access in NVM.
Access type: HDD is usually either block- or file-based. Some object-based devices still use internal mapping between block and object. However, some NVMs can be byte-level. The object-level mapping is also more native than that of HDD. Read and write performances are likely to be different in many NVMs.
Content: Some techniques, such as compression and reduplication, are content-dependent. They are not necessary for HDD due to the additional computational and IO resource usage, which may downgrade the HDD performance largely. Compared with the space saving, they may not be worthwhile. However, for NVM, these techniques can reduce the cost and improve the storage efficiency.
Timing: HDD usually caps at 300 IOPS, while some NVM devices may be 100 to 10000 times faster. The cache scheduler therefore has large difference.
There are many testing and benchmarking tools with different measurement conditions. People may be confused by the results from these different tools. SNIA developed standard testing tools called SSSI Reference Test Platform (RTP) and the Performance Test Specification (PTS).2
Interactions with Algorithms
Two Most Important Factors
Items | Description | Typical Algorithms |
---|---|---|
Access frequency (R/W) (F), Access interval (T) | The access time within a given time period. Due to the different performance in R/W, we may also consider them separately. Some argue that the least recently used data may have higher probability to be re-accessed in the near future; some deny it; now an acceptable tradeoff is that it depends on IO pattern/workload. | LFU (least frequently used)[48], GDSF (Greedy-Dual Size Frequency) [49] LRU (least recently used), MRU (most recently used), LFUDA (LFU with dynamic aging) [48], LRU-K(least recently used k)[50], GDS [51] |
Other Performance Factors
Items | Description |
---|---|
Data size | Generally, only hot data with small size is required to move to a higher tier. The small degree depends on the read/write speed rate of SSD and HDD, and the migration speed between them, etc. |
Cache total/remaining size Device total/remaining bandwidth | The cache size decides how much hot data can be stored in the cache. Hence it decides the threshold of hot degree. The bandwidth decides if the migration is proper at current time. An approximated function may be built to predict the remaining bandwidth with respect to R/W ratio, IO intensity, etc. |
R/W ratio | Since the R/W access time and pattern are different, this ratio gives different performance (e.g., the write amplification). |
R/W granularity and IO intensity | The value represents the data amount ratio relating to an R/W IO to a fixed size data block. Average R/W granularity is the average ratio of all the IOs in a predefined time interval. Commonly, the larger the value, the more important the data is to users. |
Data correlation | One data may be related to another, so the IO operations in a data block have some characteristics in a predefined period of time, and another may have similar properties, hence they are associated. This value can be used for IO predication. |
IO range/amount/distribution | IO distribution represents the statistical accessing information, such as the accessing address range and the accessing frequency in a given accessing period. |
Grain size | The minimum size for each page/block to be replaced/migrated |
Tier contrast/compensation (device value) | It values the difference between two different storage tiers/caches for direct data migration, including device status, accessing speed, etc. |
Others | Data loss/error, etc. |
Interactions with Structure
The fundamental structure of the storage device or system also has a large impact on the system performance. For example, RAID- and EC-based systems have the functionality of data protection. However, it increases the internal IO burden to the disks due to the additional parity data. In particular, during the system recovery from a critical disk failure, the internal workload eats large portion of disk bandwidth, and therefore the overall system performance to the external users is significantly downgraded. Chapter 7 will analyze the impact of RAID structure to the IO pattern.
For a hybrid storage system, although it has the potential to improve the performance of hot data, the internal data migration may also occupy some additional resources. Improper IO scheduler and data migration algorithms will definitely lower the overall performance. In addition, the so-called cache structure and tiering structure may have large difference in data allocation and IO scheduling, which leads to performance diversity under different scenarios. Chapter 6 will use a small-scale hybrid device as an example. Furthermore, the inter-connection structure, such as bus and bridge, could also be the performance bottleneck in some cases.
Interactions with Applications
Typical Requirements for Some Applications
HPC data storage | Cloud Storage | HDFS/Mixed | Archive/ Backup | HPC check pointing data | Database | |
---|---|---|---|---|---|---|
Attribute | Near term data storage: Seq., high-TP WORM operations | Traditionally batch IO seq. read/write | Traditionally batch IO seq. read/write | Write once, read infrequently | Checkpoint operations: Bursty, high-TP operations | Transactional: Small IO read, modify write |
Latency | Similar demands of cloud storage. Generally Ethernet, sometimes IB. | Between 10 and 100ms | Between 10 and 100ms | High latency expected (¿=10s); Amazon glacier at 3-5 hours | 1-2 ms; up to 45us for >1kb data transfers | Faster the better, 0-10ms but can be 90ms before issues |
IOPS/Tput | Many different offerings. Vendor-specific storage specs vary. | 100s to 10k IOPS depending on size of instance | Disk perf. dependent, little observable overhead to impede HW perf. | LTO4 tape is 120MB/s with 22s latency | 10K+ IOPS per 4U unit System performance usually; | Usually 1K-30K IOPS, up to 1-10M+ level; implementation is platform-specific; |
Other | Usually for scientific data analysis, not super computer checkpoint storage. | High availability (4/5 of 9); high data durability (9/11 of 9) | In-place data analysis capability | High data durability is expected with infrequent data access | Communicates over dedicated IB 84.8TB per shelf | Supports ACID semantics |