Benchmark tools are useful to provide some “standard” performance indexes for storage systems with specific requirements. This chapter shows how to identify the access pattern of benchmark results. The first tool is SPC-1C from the Storage Performance Council (SPC). After capturing the pattern, I developed a synthetic emulator to match the real traces. The second tool is PCMark from FutureMark. I illustrate how to use gain-loss analysis to improve cache algorithm efficiency.
- Synthetic trace.
The user can specify test scenarios for queue depth, request size, transfer rate, sequence, etc. It is good to determine corner case behavior.
Examples: IOMeter, VDBench, fio, IOzone, iorate, sqlio, diskspd1
- Application-based trace.
The user can choose specific applications with predefined workload patterns. It is good to illustrate real world cases.
Examples: SysMark,2 PCMark, SPC (Storage Performance Council, e.g., SPC-1, SPC-2), TPC (Transaction Processing Council, e.g., TPC-A/B/C/D/H/W),3 SPEC (Standard Performance Evaluation Corporation, e.g., HPC/SFS/JVM/SDM),4 Jetstress,5 COSbench6
- Real trace.
The user can input the real-world trace directly. It is useful when the user attempts to test similar applications in different systems.
Example: AnandTech Storage Bench7
A Comparison of Some Common Benchmark Tools
Block | File | Posix | OS | S/C | Open | Latest | |
---|---|---|---|---|---|---|---|
Iometer | Y | - | - | WLU | S/C | Y | 1.1.0/ |
2014 | |||||||
IOzone | Y | Y | Y | WLU | S | Y | 2006 |
Bonnie++ | Y | LU | S | Y (GPL2) | 1.0.1 | ||
dbench | Y | Y | Y | LU | S/C | Y (GNU) | 2008 |
Filebench | Y | Y | Y | LU | S | Y | 2011 |
For large-scale systems, the traditional tools may be insufficient (e.g., lack of measurement metrics) or inconvenient (e.g., no integrated user interface) enough. Therefore, some dedicated tools are proposed, such as HiBench,8 Berkely BDB,9 BigDataBench,10 and BigBench11 for big data benchmarks, in particular, the Hadoop/Spark systems.
Five Steps for General Benchmarks
Steps | Remarks |
---|---|
Choose the proper configuration. | Select the proper type of benchmark, such as macro-benchmark (overall test for the full system), micro-benchmark (few operations to check partial changes), or trace-based. |
Choose the correct environment. | Consistent hardware and software settings, and some factors, such as cache status, HDD zoned constant angular velocity, file system aging, nonessential processes, etc. |
Run the benchmark. | Identical run for repeatability, multiple rounds for accuracy with small standard deviations or high confidence level, sufficient running time for steady state, automatic run via suitable scripts. |
Present the results. | Statistical results with small confidence-interval and near normal distribution |
Validate the results. | Reproduce/confirm the results or compare with other similar ones. |
SPC-1C
The Storage Performance Council (SPC) provides several benchmarking tools under different levels. SPC-1C [53] is designed to be vendor/platform independent and is applicable for a wide range of storage component products, such as disk drives, host bus adapters (HBAs), intelligent enclosures, and storage software such as logical volume managers.
Workload Properties
Two important concepts related to the workload intensity are the BSU (business scaling unit) and ASU (application storage unit). Each BSU is composed of three ASUs: ASU1 for a data store with a weight of 45%, ASU2 for a user store with 45%, and ASU3 for a log with 10%, which totally corresponds to five IOPS per ASU.
Decomposition of SPC-1C Workload
ASU | 1-1 | 1-2 | 1-3 | 1-4 | 2-1 | 2-2 | 2-3 | 3-1 |
---|---|---|---|---|---|---|---|---|
Intensity | 0.035 | 0.281 | 0.07 | 0.21 | 0.018 | 0.07 | 0.035 | 0.281 |
R/W | 0.5 | 0.5 | 1 | 0.5 | 0.3 | 0.3 | 1 | 0 |
Random | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Pattern | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
Seq. | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
Ratio | 1-1 | 1-2 | 1-3 | 1-4 | 2-1 | 2-2 | 2-3 | 3-1 |
R rand. | 0.0175 | 0 | 0 | 0 | 0.0054 | 0 | 0 | 0 |
R seq. | 0 | 0 | 0.07 | 0 | 0 | 0 | 0.035 | 0 |
R pattern | 0 | 0.1405 | 0 | 0.105 | 0 | 0.021 | 0 | 0 |
W rand. | 0.0175 | 0 | 0 | 0 | 0.0126 | 0 | 0 | 0 |
W seq. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.281 |
W pattern | 0 | 0.1405 | 0 | 0.105 | 0 | 0.049 | 0 | 0 |
SPC-1C Workload Ratio
Workload ratio | ASU1 | ASU2 | ASU3 |
---|---|---|---|
ASU i/total | 0.596 | 0.123 | 0.281 |
Read rand. in ASU i | 0.02936 | 0.0439 | 0 |
Read seq. in ASU i | 0.11745 | 0.28455 | 0 |
Read pattern in ASU i | 0.41191 | 0.17073 | 0 |
Write rand. in ASU i | 0.02936 | 0.10244 | 0 |
Write seq. in ASU i | 0 | 0 | 1 |
Write pattern in ASU i | 0.41191 | 0.39837 | 0 |
Size Distribution
Size distribution | Probability | Size distribution | Probability |
---|---|---|---|
8 | 0.7684 | 64 | 0.03088 |
16 | 0.09264 | 128 | 0.03088 |
32 | 0.0772 |
SPC-1C Random Walk Pattern
ASU | 1-1 | 1-2 | 1-3 | 1-4 | 2-1 | 2-2 | 2-3 | 3-1 | Sum |
---|---|---|---|---|---|---|---|---|---|
Read | 0 | 0.1405 | 0 | 0.105 | 0 | 0.021 | 0 | 0 | 0.2665 |
Write | 0 | 0.1405 | 0 | 0.105 | 0 | 0.049 | 0 | 0 | 0.2945 |
Basic IO Pattern Distribution
R/W | Ratio | Mode | Ratio |
---|---|---|---|
Read | 0.3944 | Random | 0.053 |
Write | 0.6056 | Sequential | 0.386 |
Pattern | 0.561 |
Synthetic Trace

Inter-arrival time histogram

SPC-1C spatial distribution

SPC-1C trace temporal distribution
From Figure 5-1, you can see that the inter-arrival time is approximately an exponential distribution. As you know, the workload is mixed by eight different types of IO streams from three ASUs. Some high intensity-streams are visible in this spatial distribution of Figure 5-2. Generally, the generated random R/W IO streams are consistent with the real workload, but the sequential ones are not very close in a light workload. One possible reason is the real workload was not generated by the exact parameters as those in the random walking model. In fact, it is hard to align the temporal distribution well. However, the result in Figure 5-3 is fairly acceptable.
PCMark
Eight Applications in PCMark
Order | Eight Apps | Total CMD | Write | Read |
---|---|---|---|---|
1 | Windows Defender | 6755 | 300 | 6455 |
2 | Gaming | 11040 | 62 | 10978 |
3 | Importing Pictures | 2806 | 4 | 2802 |
4 | Vista Startup | 7418 | 1327 | 6091 |
5 | Video Editing | 8204 | 3711 | 4493 |
6 | Windows Media Center | 5011 | 3309 | 1702 |
7 | Adding Music | 3336 | 1506 | 1830 |
8 | Application Loading | 11155 | 2660 | 8495 |

Eight applications in PCMark

Cache prefetch
Post hole: The first LBA of the incoming command has a distance (> 0) within a threshold to the last LBA of the queued commands in the cache.
Pre hole: The last LBA of the incoming command has a distance within a threshold to the first LBA of the queued commands in the cache.
Pre and post hole: Both post and pre holes exist.

Hole types
Constant: If the range between two close regions in the cache is less than the constant Hs blocks, this range is viewed as a hole.
Adaptive: The hole size is related to DRAM and SSD cache (most likely monotone increasing to the total/remaining
where ST and SR are the total and remaining cache size, respectively, and aT and aR are coefficients for ST and SR. An example for 64MB of DRAM cache is defined as
where
.The actual
and
are decided by trace, such as the median size of requests within a time window. More specifically, consider if the hole is in the same track/cylinder.
A background task shall be implemented to monitor the LBA regions in cache and the hit density on these regions. This process is also tasked to look for gaps in regions of data with some level of hit density and to generate self-induced read commands to fill in these holes with data from main store.
If two regions of a certain range within a time-frame have a high enough correlation, the gap between them is likely to be accessed.
The pre-fetch should not affect the overall performance much, so the benefits gained from the additional cache hit ratio should be larger than the additional cost due to pre-fetch.
The up-bound of the hole size to be fetched is decided by multiple factors, such as the total/remaining cache size, workload size, cache type, access time, etc.
What is the benefit from a hole filling policy? The key is to get the increased cache hit ratio (hit ratio with hole filling v.s. hit ratio without hole filling).
What is the additional cost from a hole filling policy? The key is to find how many self-induced commands are generated to fill the hole. The time of the user workload must be considered.
Since the similarity exists between hole filling and the prefetch policy, is it possible to merge hole filling to prefetch policy (integration)? The key is to find the overlapped cache hit between two policies; if the overlap rate is high, prefetch may include hole filling as part of it.
When and where to apply the two policies (or integrated policy) with balanced benefit and cost? The key is to reduce the additional mechanical access cost and cache pollution.
Workload Properties

PCMark: Think time distribution

PCMark: Completion time
PCMark: Think/Completion Time (ms) and IOPS
App | IOPS | Think-time mean | Std. | Completion-time mean (blk) | Std. |
---|---|---|---|---|---|
1 | 120.57 | 8.29 | 46.04 | 3.43 | 5.49 |
2 | 189.29 | 5.28 | 9.20 | 10.24 | 47.4 |
3 | 48.70 | 20.54 | 21.25 | 1.99 | 3.44 |
4 | 311.30 | 3.21 | 8.25 | 3.59 | 5.84 |
5 | 122.47 | 8.17 | 16.68 | 0.76 | 2.64 |
6 | 59.56 | 16.79 | 22.58 | 1.2 | 3.22 |
7 | 145.65 | 6.87 | 14.87 | 2.78 | 5.86 |
8 | 204.29 | 4.90 | 16.37 | 16.41 | 29.87 |

PCMark: Size distribution
PCMark: Size Distribution (Blocks)
App | Mean -all | std. | Mean -read | std. | Mean -write | std. |
---|---|---|---|---|---|---|
1 | 98.67 | 87.55 | 102.78 | 87.4 | 10.22 | 7.3 |
2 | 86.92 | 56.39 | 87.36 | 56.24 | 8 | 0 |
3 | 175.09 | 82.28 | 175.33 | 82.1 | 8 | 0 |
4 | 71.92 | 54.86 | 83.12 | 49.32 | 20.48 | 49.35 |
5 | 60.77 | 63.68 | 59.18 | 25.42 | 62.69 | 90.43 |
6 | 173.36 | 115.82 | 255.16 | 13.43 | 131.29 | 122.52 |
7 | 43.01 | 61.43 | 60.39 | 70.63 | 21.89 | 38.54 |
8 | 36.23 | 37.58 | 41.31 | 37.47 | 20.04 | 33.11 |
Based on a sequential stream, you can observe some obvious sequences in general. Importing pictures has the most sequence, while application loading has less sequence. Except for application loading (mixed streams), queue length 10 is good enough to detect most streams. Application loading has relatively more mixed sequential streams. Also note that for read-only cases, most streams have only two commands. In Apps 2, 3, and 4, the average stream length is longer than others, which provides more chance for infinity read mode.
Gain-Loss Analysis
With the prefetch policy, the gain is from the increased hit ratio due to the cache policy (e.g., hole filling, prefetch), while the loss is from the additional disk access (e.g., pre-/post read attached to the user commands; the self-induced non-attached commands) due to the cache policy.
Two methods can be used to analyze the gain and loss. One simple method considers frequency only, so it counts the number of the hit frequency and additional disk access frequency. The self-induced commands’ cost is generally smaller than non-attached commands. If the hit increases due to self-induced commands, the gain is large; otherwise, it is small. This method can provide a rough estimation. The other method is the so-called mixed simulation and modeling approach; it uses simulation to get the hit ratio and additional command frequency, while using the analytical model to obtain the average response time, which is relative complex and quantitative. Let’s only consider the first one.
Gain: The additional cache hit obtained (a2) = the cache hit ratio at hole length x (a3) - the cache hit ratio without hole filling (a0)
Loss: The additional self-induced commands occurrence / the total read command (a1)
- Gain-loss-ratio a = a2/a1: The higher, the better to estimate a, but there are a few basic considerations:
The queue depth (Qd) in the cache: which has significant influence on the LRU algorithm. By observing the largest a vs. Qd, you will see that the optimal Qd. Qd for different applications may be various.
The time interval (dT ) between commands: Only if the interval is larger than a certain value, the system has chance to handle self-induced command, such as finding the ratio a for different dT .
The hit frequency of each hole: When using a correlation method to select the hole to fill, the hit frequency must be higher than 1. If the hit frequency is generally larger than 1, then this method is meaningful; otherwise, it is meaningless.
- 1.
Check a1, a2, and a for all read commands (write commands also fill the cache).
- 2.
If a1 is much larger than a2 (e.g., a < 0.2), it is not economic for hole filling; then check if a2 is too small to be worthy. Otherwise, hole filling is useful; and check the selective self-induced commands based on workload.
- 3.
If the time-interval between two commands is large enough (e.g., > 15ms), the self-induced command may not cost additional resources. If so, any increased cache hit ratio is beneficial. Repeat Steps 1-2.
Gain-Loss Analysis for Hole Filling
Qd =128;fs =128 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.026 | 0.024 | 0.007 | 0.001 | 0.049 | 0.000 | 0.051 | 0.079 |
a1 | 0.328 | 0.060 | 0.041 | 0.183 | 0.099 | 0.020 | 0.293 | 0.192 |
a | 0.078 | 0.399 | 0.165 | 0.007 | 0.494 | 0.000 | 0.175 | 0.413 |
Qd =128;fs=256 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.045 | 0.044 | 0.017 | 0.003 | 0.061 | 0.000 | 0.109 | 0.118 |
a1 | 0.362 | 0.058 | 0.028 | 0.225 | 0.106 | 0.026 | 0.265 | 0.235 |
a | 0.124 | 0.760 | 0.620 | 0.015 | 0.580 | 0.000 | 0.412 | 0.505 |
Qd =256;fs =128 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.029 | 0.034 | 0.007 | 0.003 | 0.048 | 0.000 | 0.051 | 0.102 |
a1 | 0.338 | 0.065 | 0.042 | 0.183 | 0.099 | 0.031 | 0.295 | 0.206 |
a | 0.084 | 0.520 | 0.160 | 0.014 | 0.490 | 0.000 | 0.174 | 0.493 |
Qd =256;fs =256 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.048 | 0.057 | 0.017 | 0.006 | 0.062 | 0.000 | 0.110 | 0.154 |
a1 | 0.375 | 0.061 | 0.030 | 0.225 | 0.104 | 0.042 | 0.282 | 0.242 |
a | 0.128 | 0.930 | 0.590 | 0.027 | 0.593 | 0.000 | 0.390 | 0.637 |
Gain-Loss Analysis for Hole Filling (Constrained)
Qd =128;fs =128 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.015 | 0.023 | 0.006 | 0.001 | 0.023 | 0.000 | 0.031 | 0.058 |
a1 | 0.205 | 0.043 | 0.016 | 0.025 | 0.028 | 0.009 | 0.174 | 0.146 |
a | 0.073 | 0.548 | 0.364 | 0.026 | 0.811 | 0.000 | 0.176 | 0.396 |
Qd =128;fs =256 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.026 | 0.035 | 0.015 | 0.002 | 0.034 | 0.000 | 0.050 | 0.094 |
a1 | 0.252 | 0.045 | 0.006 | 0.041 | 0.037 | 0.014 | 0.196 | 0.184 |
a | 0.105 | 0.778 | 2.389 | 0.060 | 0.923 | 0.000 | 0.253 | 0.512 |
Qd =256;fs =128 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.018 | 0.033 | 0.006 | 0.001 | 0.023 | 0.000 | 0.031 | 0.080 |
a1 | 0.209 | 0.047 | 0.016 | 0.027 | 0.029 | 0.019 | 0.189 | 0.162 |
a | 0.088 | 0.697 | 0.364 | 0.025 | 0.773 | 0.000 | 0.162 | 0.495 |
Qd =256;fs =256) | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
a2 | 0.031 | 0.049 | 0.015 | 0.002 | 0.035 | 0.000 | 0.050 | 0.127 |
a1 | 0.259 | 0.048 | 0.006 | 0.042 | 0.039 | 0.025 | 0.211 | 0.196 |
a | 0.119 | 1.028 | 2.389 | 0.058 | 0.908 | 0.000 | 0.235 | 0.648 |
Prefetch: Includes post-read and pre-read. Post-read means an additional size is attached to the last LBA; pre-read means an additional size is attached to the first LBA. For example, assume that the original command is to read LBA 10-20 and the pre-read size is 8. Then the extended command is to read LBA 2-20.
Gain: The additional cache hit obtained (b2) = the cache hit ratio at fetch length x (b3) - the cache hit ratio without prefetch (b0)
Loss: The prefetch commands occurs / the total read command (b1)
Gain-loss-ratio b = b2/b1: The higher, the better
Gain-Loss Analysis for Prefetch Policy
Qd =128;fs =128 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
b2 | 0.179 | 0.140 | 0.063 | 0.119 | 0.462 | 0.000 | 0.220 | 0.224 |
b1 | 0.818 | 0.859 | 0.937 | 0.880 | 0.528 | 1.000 | 0.769 | 0.764 |
b | 0.219 | 0.163 | 0.067 | 0.136 | 0.875 | 0.000 | 0.286 | 0.294 |
Qd =128;fs =256 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
b2 | 0.307 | 0.417 | 0.197 | 0.383 | 0.607 | 0.000 | 0.398 | 0.321 |
b1 | 0.690 | 0.582 | 0.803 | 0.616 | 0.382 | 1.000 | 0.591 | 0.668 |
b | 0.444 | 0.717 | 0.245 | 0.621 | 1.589 | 0.000 | 0.674 | 0.480 |
Qd =256;fs =128 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
b2 | 0.183 | 0.147 | 0.063 | 0.120 | 0.461 | 0.000 | 0.220 | 0.239 |
b1 | 0.814 | 0.851 | 0.937 | 0.879 | 0.527 | 1.000 | 0.769 | 0.748 |
b | 0.225 | 0.173 | 0.067 | 0.137 | 0.875 | 0.000 | 0.286 | 0.319 |
Qd =256;fs =256 | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
b2 | 0.310 | 0.427 | 0.197 | 0.384 | 0.607 | 0.000 | 0.398 | 0.342 |
b1 | 0.687 | 0.572 | 0.803 | 0.615 | 0.381 | 1.000 | 0.591 | 0.645 |
b | 0.452 | 0.746 | 0.245 | 0.625 | 1.593 | 0.000 | 0.674 | 0.530 |
You may also analyze the relationship between post-read and sequential stream by counting 1) the stream numbers (c) and the ratio c1 = c/total read numbers; 2) each streams hit frequency c2; 3) the cache hit as b2 (only consider post read, redefine b2). This test can help to check if the infinity sequence read takes effect, such as only when the sequence length is long enough, the cache enters into infinity mode. In infinity mode, post-read is automatic. However, if it is not in infinity mode, you still need to consider the benefit of the post-read, after detecting a two-command sequential stream, like the possibility of three or more commands attached to the stream.
Gain: The additional cache hit obtained (d2) = the cache hit ratio at fetch length x and hole length y (d3) - the cache hit ratio without prefetch and hole filling (d0)
Loss: The additional commands occurs / the total read command (d1)
Gain-loss-ratio d = d2/d1: The higher, the better
Two queues, one for prefetch (Q1) and other for hole filling (Q2), both under LRU. Consider the overlapped hit ratio, i.e., if a command hits in Q1, it also hits in Q2 and vice versa.
Two queues, one for prefetch (Q1) and other for hole filling + prefetch (Q3); both under LRU. Consider the additional hit ratio of Q3 over Q1.
Define x0 as the hit number without prefetch and hole filling policies, x¯1 = x1 + x0 as the hit with prefetch, x¯2 = x2 + x0 as the hit due to hole filling, and x¯3 as the hit due to combined prefetch and hole filling. Now if x0 is much larger than x1 and x2, it shows that there is not much difference between prefetch and hole filling; otherwise, check the difference between x1 and x2. If x1 is much larger than x2, it shows that the benefit of fetch is much larger than hole filling; otherwise, it is reasonable to do hole filling instead of prefetch. If x3 − x0 is much larger than x1, it indicates that the hole filling gained benefits from refetch; otherwise, prefetch is not beneficial.
Without post-read fetch, you can observe that x¯1 is generally higher than x¯2, and x1 and x0 are generally larger than x2. Except for Apps 5 and 6, x2 is generally smaller than 1% (Apps 4 and 8 are around 1% when queue length and fetch size are 256). This means that if post-read takes effect, then the hole filling’s influence is very small. The overlap between two policies (post only) are much higher than x1 and x1 + x2; x0 /(x1 + x2) is generally larger than 10 when prefetch size is over 128 (various prefetch size; fixed hole filling size 256). Now the key problem is whether it is worthy to do the hole filling for the additional 2% (queue length =128, fetch size =256/128) cache hit ratio at the cost of additional self-induced access (~ 10%).
Observe that b1 is generally larger than a1. However, the cost of each prefetch is generally smaller than that of a self-induced hole filling command. Note that this direct comparison may be unfair. An indirect method is to estimate the average service time for the prefetch policy and hole filling policy (WCD), so for prefetch, define the response time as write ratio ∗write access time + read ratio ∗(b3 ∗cache access time + (1–b3)∗ read access time), and for hole filling, define the response time as write ratio ∗write access time + read ratio ∗(a3 ∗cache access time + (1–a3)∗ read access time). However, I omit the details here.
The benefit (increased hit frequency) and the cost (additional self-induced commands) ratio of (post) hole filling is generally less than 1 if filling any hole within a certain distance; the average is 0.47 (queue depth = 256; hole size = 256 blks).
The benefit and the cost ratio of prefetch (post-read) is generally less than 1 if prefetching any non-hit command with a certain prefetch size; the average is 0.68 (queue depth = 256; prefetch size = 256 blks), which is better than a (post) hole filling policy, considering that some hole filling commands may need more recourses than prefetch commands.
If further applying (post) hole filling policy to the prefetch policy, the additional benefit/cost ratio is generally less than 0.2 (queue depth = 256; prefetch size = 256 blks, hole size = 256 blks). Unless a well-designed hole filling policy (e.g., based on data-correlation) is applied, it is not very useful for overall performance improvement.
Due to the similarity of hole filling and prefetch, the data-correlation methods may be applied to both hole filling and prefetch; thus the two policies might be merged.