Trace Analysis

Trace analysis provides insights into workload properties and IO patterns, which are essential for storage system tuning and optimizing. This chapter discusses how the workload interacts with system components, algorithms, structures, and applications.

Interactions with Components

As discussed in Chapter 1, different storage devices may have large different properties. In addition, their internal structures and algorithms also have significant impacts on the final performance. For example, write cache of HDDs can gain benefits from data locality:

Write cache hits can avoid some disk mechanical writes; instead, the dirty blocks in DRAM cache are overwritten. It is a benefit of temporal locality.
Larger cache space means longer write queue: physically contiguous dirty blocks can be grouped into a single IO operation. It is a benefit of spatial locality.
An advanced replacement policy efficiently places cold data onto a disk while keeping the hot data in cache via exploring both spatial and temporal locality.
The cache can temporarily absorb the write burst and distribute the write load evenly over time to minimize the impact to concurrent IOs.

For HDD with MBC, write cache can provide log access for write burst, and thus give a better arrangement for grouped I/O access. In this section, I mainly discuss the HDD and SSD factors that influence the performance.

HDD Factors

For HDD, the performance varies with respect to (wrt) the disk drive’s features (e.g., RPM, TPI/SPT, location [OD, MD or ID], head quality, servo control mechanism, cache structure/algorithm, queue length, and so on) and workload properties (e.g., sequence, request size, queue depth, and more).

First, look at drive’s features. For example, the throughput of the OD side of HDDs can be double that of the ID side, as shown in Figure 4-1. A 10K RPM enterprise drive may be over two times faster than a 5400 RPM desktop drive. The fast RPM drives generally have a quicker response time than the slow RPM ones.

../images/468166_1_En_4_Chapter/468166_1_En_4_Fig1_HTML.jpg — Figure 4-1
Throughput difference in different HDD locations

Second, consider the workload properties. Figure 4-2 provides an example that IOPS changes wrt request size (0.5, 1, 2,..., 2014KB) and queue depth (1, 2, 4, 8, 16, 32) for write cache enabled (WCE) or disabled (WCD). You can see that without cache/buffer, WCD gives similar performance for different queue depths under the same request size. However, when write cache is enabled, the performance for queue depth as 1 has a significant difference from that for 16. Figures 4-2 and 4-3 illustrate the performance difference wrt buffer size under WCE and WCD.

../images/468166_1_En_4_Chapter/468166_1_En_4_Fig2_HTML.jpg — Figure 4-2
IOPS difference wrt queue depth and request size under WCE and WCD (random write via IOMeter)

../images/468166_1_En_4_Chapter/468166_1_En_4_Fig3_HTML.jpg — Figure 4-3
Throughput difference wrt to buffer size for WCE and WCD (sequential write via IOMeter)

SSD Factors

For SSD, the performance also varies wrt the disk drive’s features (e.g., die number, block size, parallel access, flash management algorithm (wear-leveling), address mapping policy, trim condition, cache structure/algorithm, queue length, IO driver interface, and more) and the workload properties (e.g., fragmentation, sequence/randomness, write update, read/write ratio, request size, queue depth, request intensity/throttling, etc.).

Different from HDD, a consumer-class NAND SSD may show artificially and unsustainably high performance temporarily during initial measurements. It may also display unacceptable performance in bad conditions. Thus I shall have a proper condition for SSD in order to demonstrate sustained solid-state performance. The well-known starting point is a completely new SSD or a low-level formatted SSD (to wipe the contents and restore it to its original state). Run some random writes for a while, depending on the SSD capacity. Then the SSD is put in a “used” state. When the performance levels settle down to a sustainable rate, we have the true performance value. Figure 4-4 illustrates this phenomenon, where D1-D6 are MLC and D7-D8 are SLC.¹ Note that this situation has been alleviated since 2017.

../images/468166_1_En_4_Chapter/468166_1_En_4_Fig4_HTML.jpg — Figure 4-4
SSD performance states [12]

NAND SSDs generally use a virtual address mapping scheme, whereby LBAs are mapped to PBAs for some reason [12]. For instance, wear leveling algorithms allocate updated data to new cell locations to promote evenly distributed wear on the memory cells and thus improve the memory cell life or endurance. As a result, the SSD must keep track of the LBA-PBA affiliations. Similar to HDD, sequential operation may also be faster than random access when the data in the physical location is less fragmented.

Similar to HDD, the queue depth (i.e., the number of the outstanding IOs) has a deep impact to the IOPS performance. Figure 4-5 illustrates the IOPS trends for four different models of SSDs under two applications: database and file server. You can see that the resulting IOPS are largely different. In addition, SSD2 performs better than SSD3 in the database, while worse in the file server. This indicates that the internal architecture and algorithm of a SSD is sensitive to the applications.

../images/468166_1_En_4_Chapter/468166_1_En_4_Fig5_HTML.jpg — Figure 4-5
SSD IOPS vs. queue depth

Block alignment is also a performance issue. When blocks are aligned with the NAND flash memory cell boundaries, they are more efficiently stored in an SSD. For instance, an 8KB block will fit precisely in an 8KB NAND page size. If all things are equal, more small block IOs can be accessed in a given period of time than large block IOs, although the amount of data might be the same, such as 64 IOs of 8KB data transfer length vs. 4 IOs of 128KB data transfer length. In any case, the minimum granularity of access to NAND flash depends on the design of the underlying NAND flash. Figure 4-6 shows the throughput under sequential requests with different sizes. You can see that when the size is less than 32KB, the transfer speed is significantly influenced by the size. However, when the size is larger than 128KB, the throughput is relatively stable.

../images/468166_1_En_4_Chapter/468166_1_En_4_Fig6_HTML.jpg — Figure 4-6
SSD sequential throughput vs. request size

The read/write ratio has larger impact to the SSDs than CMR HDDs. First, the “new” write generally needs more time than read, so more steps of the write operation than that of the read operation. Second, for a write-in-place update, an “erase” access is required. Therefore, the number of write steps relies on how full the drive is and whether the SSD controller shall erase the target cell (or even relocate some data by performing a more time-costly RMW access) before writing the new data.

Although the performance ratio of sequential to random access is not so high as the HDDs, sequentiality is still important because it contributes in minimizing erase operations via grouping write requests by blocks, optimizing both lifetime and I/O performance by reducing the number of erasures, and so on.

In sum, there are many major differences between SSD and HDD, besides those listed in the summary table in Chapter 1. Here, I further extend it to general NVM:

Access location: It significantly matters for HDD due to positioning time, while it doesn’t determine latency in NVM generally, although the access order matters.
Access size: Large and sequential requests are significantly faster than small and random requests. However, it has less impact to NVM. In fact, larger IO may pay an additional cost due to internal structures, although sequential access is still generally faster than random access in NVM.
Access type: HDD is usually either block- or file-based. Some object-based devices still use internal mapping between block and object. However, some NVMs can be byte-level. The object-level mapping is also more native than that of HDD. Read and write performances are likely to be different in many NVMs.
Content: Some techniques, such as compression and reduplication, are content-dependent. They are not necessary for HDD due to the additional computational and IO resource usage, which may downgrade the HDD performance largely. Compared with the space saving, they may not be worthwhile. However, for NVM, these techniques can reduce the cost and improve the storage efficiency.
Timing: HDD usually caps at 300 IOPS, while some NVM devices may be 100 to 10000 times faster. The cache scheduler therefore has large difference.

There are many testing and benchmarking tools with different measurement conditions. People may be confused by the results from these different tools. SNIA developed standard testing tools called SSSI Reference Test Platform (RTP) and the Performance Test Specification (PTS).²

Interactions with Algorithms

The algorithms and policies utilized in the hybrid storage systems actually determine the performance of the overall storage system when the hardware is fixed. In this section, the most important algorithms, such as data allocation, hot data identification, data migration, and scheduling algorithm, are surveyed. For easy of representation, I list some main factors considered in these algorithms in Tables 4-1 and 4-2, where access frequency and interval are the most important two factors in hot data identification and data migration algorithms.

Table 4-1

Two Most Important Factors

Items	Description	Typical Algorithms
Access frequency (R/W) (F), Access interval (T)	The access time within a given time period. Due to the different performance in R/W, we may also consider them separately. Some argue that the least recently used data may have higher probability to be re-accessed in the near future; some deny it; now an acceptable tradeoff is that it depends on IO pattern/workload.	LFU (least frequently used)[48], GDSF (Greedy-Dual Size Frequency) [49] LRU (least recently used), MRU (most recently used), LFUDA (LFU with dynamic aging) [48], LRU-K(least recently used k)[50], GDS [51]

Items

Description

Typical Algorithms

Access frequency (R/W) (F),

Access interval (T)

The access time within a given time period. Due to the different performance in R/W, we may also consider them separately.

Some argue that the least recently used data may have higher probability to be re-accessed in the near future; some deny it; now an acceptable tradeoff is that it depends on IO pattern/workload.

LFU (least frequently used)[48], GDSF (Greedy-Dual Size Frequency) [49]

LRU (least recently used), MRU (most recently used), LFUDA (LFU with dynamic aging) [48], LRU-K(least recently used k)[50], GDS [51]

Table 4-2

Other Performance Factors

Items	Description
Data size	Generally, only hot data with small size is required to move to a higher tier. The small degree depends on the read/write speed rate of SSD and HDD, and the migration speed between them, etc.
Cache total/remaining size Device total/remaining bandwidth	The cache size decides how much hot data can be stored in the cache. Hence it decides the threshold of hot degree. The bandwidth decides if the migration is proper at current time. An approximated function may be built to predict the remaining bandwidth with respect to R/W ratio, IO intensity, etc.
R/W ratio	Since the R/W access time and pattern are different, this ratio gives different performance (e.g., the write amplification).
R/W granularity and IO intensity	The value represents the data amount ratio relating to an R/W IO to a fixed size data block. Average R/W granularity is the average ratio of all the IOs in a predefined time interval. Commonly, the larger the value, the more important the data is to users.
Data correlation	One data may be related to another, so the IO operations in a data block have some characteristics in a predefined period of time, and another may have similar properties, hence they are associated. This value can be used for IO predication.
IO range/amount/distribution	IO distribution represents the statistical accessing information, such as the accessing address range and the accessing frequency in a given accessing period.
Grain size	The minimum size for each page/block to be replaced/migrated
Tier contrast/compensation (device value)	It values the difference between two different storage tiers/caches for direct data migration, including device status, accessing speed, etc.
Others	Data loss/error, etc.

Interactions with Structure

The fundamental structure of the storage device or system also has a large impact on the system performance. For example, RAID- and EC-based systems have the functionality of data protection. However, it increases the internal IO burden to the disks due to the additional parity data. In particular, during the system recovery from a critical disk failure, the internal workload eats large portion of disk bandwidth, and therefore the overall system performance to the external users is significantly downgraded. Chapter 7 will analyze the impact of RAID structure to the IO pattern.

For a hybrid storage system, although it has the potential to improve the performance of hot data, the internal data migration may also occupy some additional resources. Improper IO scheduler and data migration algorithms will definitely lower the overall performance. In addition, the so-called cache structure and tiering structure may have large difference in data allocation and IO scheduling, which leads to performance diversity under different scenarios. Chapter 6 will use a small-scale hybrid device as an example. Furthermore, the inter-connection structure, such as bus and bridge, could also be the performance bottleneck in some cases.

Interactions with Applications

As discussed in Chapter 2, the metrics of different applications may have large differences [35]. Table 4-3 provides a simple comparison of typical requirements among some common applications.³ Due to the significant variation of requirements from one to another, it imposes different demands on the storage systems. Chapter 8 will illustrate the IO pattern of a Hadoop system with HDFS for big data applications, while Chapter 9 will discuss one of the most popular distributed storage systems, Ceph.

Table 4-3

Typical Requirements for Some Applications

	HPC data storage	Cloud Storage	HDFS/Mixed	Archive/ Backup	HPC check pointing data	Database
Attribute	Near term data storage: Seq., high-TP WORM operations	Traditionally batch IO seq. read/write	Traditionally batch IO seq. read/write	Write once, read infrequently	Checkpoint operations: Bursty, high-TP operations	Transactional: Small IO read, modify write
Latency	Similar demands of cloud storage. Generally Ethernet, sometimes IB.	Between 10 and 100ms	Between 10 and 100ms	High latency expected (¿=10s); Amazon glacier at 3-5 hours	1-2 ms; up to 45us for >1kb data transfers	Faster the better, 0-10ms but can be 90ms before issues
IOPS/Tput	Many different offerings. Vendor-specific storage specs vary.	100s to 10k IOPS depending on size of instance	Disk perf. dependent, little observable overhead to impede HW perf.	LTO4 tape is 120MB/s with 22s latency	10K+ IOPS per 4U unit System performance usually;	Usually 1K-30K IOPS, up to 1-10M+ level; implementation is platform-specific;
Other	Usually for scientific data analysis, not super computer checkpoint storage.	High availability (4/5 of 9); high data durability (9/11 of 9)	In-place data analysis capability	High data durability is expected with infrequent data access	Communicates over dedicated IB 84.8TB per shelf	Supports ACID semantics