Case Study: Benchmarking Tools

Benchmark tools are useful to provide some “standard” performance indexes for storage systems with specific requirements. This chapter shows how to identify the access pattern of benchmark results. The first tool is SPC-1C from the Storage Performance Council (SPC). After capturing the pattern, I developed a synthetic emulator to match the real traces. The second tool is PCMark from FutureMark. I illustrate how to use gain-loss analysis to improve cache algorithm efficiency.

Storage performance benchmarks assess the relative performance of storage systems by running a number of standards tests and trails, via a tool or a set of programs with or without specific hardware equipment supported. Below are some benchmark tools that are often used for active trace collection :

Synthetic trace
- The user can specify test scenarios for queue depth, request size, transfer rate, sequence, etc. It is good to determine corner case behavior.
- Examples: IOMeter, VDBench, fio, IOzone, iorate, sqlio, diskspd¹
.
Application-based trace
- The user can choose specific applications with predefined workload patterns. It is good to illustrate real world cases.
- Examples: SysMark,² PCMark, SPC (Storage Performance Council, e.g., SPC-1, SPC-2), TPC (Transaction Processing Council, e.g., TPC-A/B/C/D/H/W),³ SPEC (Standard Performance Evaluation Corporation, e.g., HPC/SFS/JVM/SDM),⁴ Jetstress,⁵ COSbench⁶
.
Real trace
- The user can input the real-world trace directly. It is useful when the user attempts to test similar applications in different systems.
- Example: AnandTech Storage Bench⁷
.

Table 5-1 gives a simple comparison of some commonly used tools, where the letters W, L, and U in the OS column indicate Windows, Linux, and Unix, respectively.

Table 5-1

A Comparison of Some Common Benchmark Tools

	Block	File	Posix	OS	S/C	Open	Latest
Iometer	Y	-	-	WLU	S/C	Y	1.1.0/
							2014
IOzone	Y	Y	Y	WLU	S	Y	2006
Bonnie++	Y			LU	S	Y (GPL2)	1.0.1
dbench	Y	Y	Y	LU	S/C	Y (GNU)	2008
Filebench	Y	Y	Y	LU	S	Y	2011

For large-scale systems, the traditional tools may be insufficient (e.g., lack of measurement metrics) or inconvenient (e.g., no integrated user interface) enough. Therefore, some dedicated tools are proposed, such as HiBench,⁸ Berkely BDB,⁹ BigDataBench,¹⁰ and BigBench¹¹ for big data benchmarks, in particular, the Hadoop/Spark systems.

A general benchmark procedure is shown in Table 5-2 [52]. Note that there exist other classification methods. For example, three types are named as micro-benchmark in lower-level system operation (e.g., evaluating HDFS operations on modern cluster), functional/components benchmarks in high-level functions (e.g., Terasort, basic SQL), application-level benchmarks (e.g., overall system performance for a given application scenario). For more details, refer to [41, 35].

Table 5-2

Five Steps for General Benchmarks

Steps	Remarks
Choose the proper configuration.	Select the proper type of benchmark, such as macro-benchmark (overall test for the full system), micro-benchmark (few operations to check partial changes), or trace-based.
Choose the correct environment.	Consistent hardware and software settings, and some factors, such as cache status, HDD zoned constant angular velocity, file system aging, nonessential processes, etc.
Run the benchmark.	Identical run for repeatability, multiple rounds for accuracy with small standard deviations or high confidence level, sufficient running time for steady state, automatic run via suitable scripts.
Present the results.	Statistical results with small confidence-interval and near normal distribution
Validate the results.	Reproduce/confirm the results or compare with other similar ones.

SPC-1C

The Storage Performance Council (SPC) provides several benchmarking tools under different levels. SPC-1C [53] is designed to be vendor/platform independent and is applicable for a wide range of storage component products, such as disk drives, host bus adapters (HBAs), intelligent enclosures, and storage software such as logical volume managers.

Workload Properties

Two important concepts related to the workload intensity are the BSU (business scaling unit) and ASU (application storage unit). Each BSU is composed of three ASUs: ASU1 for a data store with a weight of 45%, ASU2 for a user store with 45%, and ASU3 for a log with 10%, which totally corresponds to five IOPS per ASU.

There are three types of access patterns: random (uniform distribution), sequential, and (random walking access) pattern. The details can be found in Table 5-3, which is further summarized in Table 5-4. Table 5-5 indicates that the small size requests (8 and 16 blocks) are over 85%. Random walk is an important model of Markovian Chain. Table 5-6 shows the main characteristics of this model.

Table 5-3

Decomposition of SPC-1C Workload

ASU	1-1	1-2	1-3	1-4	2-1	2-2	2-3	3-1
Intensity	0.035	0.281	0.07	0.21	0.018	0.07	0.035	0.281
R/W	0.5	0.5	1	0.5	0.3	0.3	1	0
Random	1	0	0	0	1	0	0	0
Pattern	0	1	0	1	0	1	0	0
Seq.	0	0	1	0	0	0	1	1
Ratio	1-1	1-2	1-3	1-4	2-1	2-2	2-3	3-1
R rand.	0.0175	0	0	0	0.0054	0	0	0
R seq.	0	0	0.07	0	0	0	0.035	0
R pattern	0	0.1405	0	0.105	0	0.021	0	0
W rand.	0.0175	0	0	0	0.0126	0	0	0
W seq.	0	0	0	0	0	0	0	0.281
W pattern	0	0.1405	0	0.105	0	0.049	0	0

Table 5-4

SPC-1C Workload Ratio

Workload ratio	ASU1	ASU2	ASU3
ASU i/total	0.596	0.123	0.281
Read rand. in ASU i	0.02936	0.0439	0
Read seq. in ASU i	0.11745	0.28455	0
Read pattern in ASU i	0.41191	0.17073	0
Write rand. in ASU i	0.02936	0.10244	0
Write seq. in ASU i	0	0	1
Write pattern in ASU i	0.41191	0.39837	0

Table 5-5

Size Distribution

Size distribution	Probability	Size distribution	Probability
8	0.7684	64	0.03088
16	0.09264	128	0.03088
32	0.0772

Table 5-6

SPC-1C Random Walk Pattern

ASU	1-1	1-2	1-3	1-4	2-1	2-2	2-3	3-1	Sum
Read	0	0.1405	0	0.105	0	0.021	0	0	0.2665
Write	0	0.1405	0	0.105	0	0.049	0	0	0.2945

You can also find from Table 5-7 that the write accesses and pattern access are dominate, which shows the importance of self-similarity and write cache in the benchmarking test.

Table 5-7

Basic IO Pattern Distribution

R/W	Ratio	Mode	Ratio
Read	0.3944	Random	0.053
Write	0.6056	Sequential	0.386
		Pattern	0.561

Synthetic Trace

With all these parameters, you can actually write your own synthetic trace generator. Thus more flexibility is provided to change any parameters you are interested in, such as disk size, BSU, simulation times, even the configuration of SPC-1C, like ASU, IOPS per BSU, distribution patterns, etc. For example, you can separate ASU from BSU to see the influence of different applications (data store, user store, and log./seq. write) instead of mixed workload, and change the distribution (e.g., inter-arrival time, request size, etc.) to fit more specific requirements. In addition, you may integrate the generator to another toolkit, such as Disksim [54] and IOMeter, as a synthetic generator component. Figures 5-1, 5-2, and 5-3 show the comparison of a real trace captured by bus analyzer and the synthetic trace generated by the MATLAB-based tool in Appendix A.

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig1_HTML.jpg — Figure 5-1
Inter-arrival time histogram

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig2_HTML.jpg — Figure 5-2
SPC-1C spatial distribution

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig3_HTML.jpg — Figure 5-3
SPC-1C trace temporal distribution

From Figure 5-1, you can see that the inter-arrival time is approximately an exponential distribution. As you know, the workload is mixed by eight different types of IO streams from three ASUs. Some high intensity-streams are visible in this spatial distribution of Figure 5-2. Generally, the generated random R/W IO streams are consistent with the real workload, but the sequential ones are not very close in a light workload. One possible reason is the real workload was not generated by the exact parameters as those in the random walking model. In fact, it is hard to align the temporal distribution well. However, the result in Figure 5-3 is fairly acceptable.

PCMark

PCMark Vantage [55] is a widely used benchmark tool that is not limited to disk performance. It can provide application-level traces in eight categories, as shown in Table 5-8, where a particular trace is decomposed into the number of write and read commands. Figure 5-4 further shows the traces in the plot of LBA vs time.¹²

Table 5-8

Eight Applications in PCMark

Order	Eight Apps	Total CMD	Write	Read
1	Windows Defender	6755	300	6455
2	Gaming	11040	62	10978
3	Importing Pictures	2806	4	2802
4	Vista Startup	7418	1327	6091
5	Video Editing	8204	3711	4493
6	Windows Media Center	5011	3309	1702
7	Adding Music	3336	1506	1830
8	Application Loading	11155	2660	8495

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig4_HTML.jpg — Figure 5-4
Eight applications in PCMark

Now you can use this trace as an example for the read cache performance analysis. Some early research shows that the common differentiator between drives is read cache hit rate, with reasonable pre- and/or post-read data. For easy of notation, I define a prefetch action as both a pre-read and post-read request, as shown in Figure 5-5. A common case is that some later read requests hit the data in the cache due to prefetch data, if the trace has a strong locality. Three types of prefetch accesses are commonly used: prefetch always (PA), prefetch on a miss (PoM), prefetch on a hit (PoH).

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig5_HTML.jpg — Figure 5-5
Cache prefetch

As mentioned, a gap between requests is allowed for near sequential streams. This gap can be a “hole” in the prefetch data, which is harmful to the overall performance. As shown in Figure 5-6, there are three types of holes:

Post hole: The first LBA of the incoming command has a distance (> 0) within a threshold to the last LBA of the queued commands in the cache.
Pre hole: The last LBA of the incoming command has a distance within a threshold to the first LBA of the queued commands in the cache.
Pre and post hole: Both post and pre holes exist.

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig6_HTML.jpg — Figure 5-6
Hole types

To formally define the hole, consider the following two cases:

Constant: If the range between two close regions in the cache is less than the constant Hs blocks, this range is viewed as a hole.
Adaptive: The hole size is related to DRAM and SSD cache (most likely monotone increasing to the total/remaining
${H}_S=\Big\{{\displaystyle \begin{array}{cc}\underset{\_}{H_S,}& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor <\underset{\_}{H_S}\\ {}{\overline{H}}_S,& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor >{\overline{H}}_S\\ {}\left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor, & Otherwise\end{array}}$
where S_T and S_R are the total and remaining cache size, respectively, and a_T and a_R are coefficients for S_T and S_R. An example for 64MB of DRAM cache is defined as
${H}_S=\Big\{{\displaystyle \begin{array}{cc}16,& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor <\underset{\_}{H_S}\\ {}256,& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor >{\overline{H}}_S\\ {}\left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor, & Otherwise\end{array}}$
where ${a}_T=\frac{1}{2^{13}},{a}_R=\frac{15}{2^{13}}$ .The actual $\underset{\_}{H_s}$ and ${\overline{H}}_s$ are decided by trace, such as the median size of requests within a time window. More specifically, consider if the hole is in the same track/cylinder.

A background task shall be implemented to monitor the LBA regions in cache and the hit density on these regions. This process is also tasked to look for gaps in regions of data with some level of hit density and to generate self-induced read commands to fill in these holes with data from main store.

To connect the cache algorithm to the trace properties, use the following hypothesis:

If two regions of a certain range within a time-frame have a high enough correlation, the gap between them is likely to be accessed.
The pre-fetch should not affect the overall performance much, so the benefits gained from the additional cache hit ratio should be larger than the additional cost due to pre-fetch.
The up-bound of the hole size to be fetched is decided by multiple factors, such as the total/remaining cache size, workload size, cache type, access time, etc.

Now we have some questions to answer.

What is the benefit from a hole filling policy? The key is to get the increased cache hit ratio (hit ratio with hole filling v.s. hit ratio without hole filling).
What is the additional cost from a hole filling policy? The key is to find how many self-induced commands are generated to fill the hole. The time of the user workload must be considered.
Since the similarity exists between hole filling and the prefetch policy, is it possible to merge hole filling to prefetch policy (integration)? The key is to find the overlapped cache hit between two policies; if the overlap rate is high, prefetch may include hole filling as part of it.
When and where to apply the two policies (or integrated policy) with balanced benefit and cost? The key is to reduce the additional mechanical access cost and cache pollution.

Workload Properties

Let’s look at the think time and completion time first, as shown in Figures 5-7 and 5-8, respectively. You can see the large difference for their distributions of various applications. Table 5-9 further gives the mean and standard derivative values together with IOPS. Figure 5-9 and Table 5-10 provide the size distribution. You may also find the relation between the size and completion time.

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig7_HTML.jpg — Figure 5-7
PCMark: Think time distribution

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig8_HTML.jpg — Figure 5-8
PCMark: Completion time

Table 5-9

PCMark: Think/Completion Time (ms) and IOPS

App	IOPS	Think-time mean	Std.	Completion-time mean (blk)	Std.
1	120.57	8.29	46.04	3.43	5.49
2	189.29	5.28	9.20	10.24	47.4
3	48.70	20.54	21.25	1.99	3.44
4	311.30	3.21	8.25	3.59	5.84
5	122.47	8.17	16.68	0.76	2.64
6	59.56	16.79	22.58	1.2	3.22
7	145.65	6.87	14.87	2.78	5.86
8	204.29	4.90	16.37	16.41	29.87

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig9_HTML.jpg — Figure 5-9
PCMark: Size distribution

Table 5-10

PCMark: Size Distribution (Blocks)

App	Mean -all	std.	Mean -read	std.	Mean -write	std.
1	98.67	87.55	102.78	87.4	10.22	7.3
2	86.92	56.39	87.36	56.24	8	0
3	175.09	82.28	175.33	82.1	8	0
4	71.92	54.86	83.12	49.32	20.48	49.35
5	60.77	63.68	59.18	25.42	62.69	90.43
6	173.36	115.82	255.16	13.43	131.29	122.52
7	43.01	61.43	60.39	70.63	21.89	38.54
8	36.23	37.58	41.31	37.47	20.04	33.11

Based on a sequential stream, you can observe some obvious sequences in general. Importing pictures has the most sequence, while application loading has less sequence. Except for application loading (mixed streams), queue length 10 is good enough to detect most streams. Application loading has relatively more mixed sequential streams. Also note that for read-only cases, most streams have only two commands. In Apps 2, 3, and 4, the average stream length is longer than others, which provides more chance for infinity read mode.

Gain-Loss Analysis

With the prefetch policy, the gain is from the increased hit ratio due to the cache policy (e.g., hole filling, prefetch), while the loss is from the additional disk access (e.g., pre-/post read attached to the user commands; the self-induced non-attached commands) due to the cache policy.

Two methods can be used to analyze the gain and loss. One simple method considers frequency only, so it counts the number of the hit frequency and additional disk access frequency. The self-induced commands’ cost is generally smaller than non-attached commands. If the hit increases due to self-induced commands, the gain is large; otherwise, it is small. This method can provide a rough estimation. The other method is the so-called mixed simulation and modeling approach; it uses simulation to get the hit ratio and additional command frequency, while using the analytical model to obtain the average response time, which is relative complex and quantitative. Let’s only consider the first one.

Let’s define gain and loss quantitatively:

Gain: The additional cache hit obtained (a₂) = the cache hit ratio at hole length x (a₃) - the cache hit ratio without hole filling (a₀)
Loss: The additional self-induced commands occurrence / the total read command (a₁)
Gain-loss-ratio a = a₂/a₁: The higher, the better to estimate a, but there are a few basic considerations:
- The queue depth (Q_d) in the cache: which has significant influence on the LRU algorithm. By observing the largest a vs. Q_d, you will see that the optimal Q_d. Q_d for different applications may be various.
- The time interval (d_T ) between commands: Only if the interval is larger than a certain value, the system has chance to handle self-induced command, such as finding the ratio a for different d_T .
- The hit frequency of each hole: When using a correlation method to select the hole to fill, the hit frequency must be higher than 1. If the hit frequency is generally larger than 1, then this method is meaningful; otherwise, it is meaningless.

Let’s start to verify whether a self-induced hole filling policy is beneficial. Take the following steps:

1.
Check a₁, a_2, and a for all read commands (write commands also fill the cache).
2.
If a₁ is much larger than a₂ (e.g., a < 0.2), it is not economic for hole filling; then check if a₂ is too small to be worthy. Otherwise, hole filling is useful; and check the selective self-induced commands based on workload.
3.
If the time-interval between two commands is large enough (e.g., > 15ms), the self-induced command may not cost additional resources. If so, any increased cache hit ratio is beneficial. Repeat Steps 1-2.

Table 5-11 provides the results where Q_d = queue depth and f_s = filling hole size. You can see that App 2 is most significant for hole filling, while App 6 is least significant. Hole filling is generally suitable for Apps 2, 3, 5, and 8. If you only consider the commands with a think time larger than 10ms, you have the results listed in Table 5-12. You can see that the general trend is similar to the case without time-interval constraints. App 5 is most significant for hole filling for f_s = 128 and App 3 for f_s = 256; while App 6 is less significant. In sum, a is generally smaller than 1 for short queue depth. When hole size is very small, a is very small (<< 1). When both queue length and hole size are short, hole filling policy is generally useless. Based on the value of a, you can say that this policy is generally useless for App 6 and may be useful for Apps 2, 5, and 8. Further conclusion should be made after comparing with prefetch policy.

Table 5-11

Gain-Loss Analysis for Hole Filling

Qd =128;fs =128
	1	2	3	4	5	6	7	8
a2	0.026	0.024	0.007	0.001	0.049	0.000	0.051	0.079
a1	0.328	0.060	0.041	0.183	0.099	0.020	0.293	0.192
a	0.078	0.399	0.165	0.007	0.494	0.000	0.175	0.413
Qd =128;fs=256
	1	2	3	4	5	6	7	8
a2	0.045	0.044	0.017	0.003	0.061	0.000	0.109	0.118
a1	0.362	0.058	0.028	0.225	0.106	0.026	0.265	0.235
a	0.124	0.760	0.620	0.015	0.580	0.000	0.412	0.505
Qd =256;fs =128
	1	2	3	4	5	6	7	8
a2	0.029	0.034	0.007	0.003	0.048	0.000	0.051	0.102
a1	0.338	0.065	0.042	0.183	0.099	0.031	0.295	0.206
a	0.084	0.520	0.160	0.014	0.490	0.000	0.174	0.493
Qd =256;fs =256
	1	2	3	4	5	6	7	8
a2	0.048	0.057	0.017	0.006	0.062	0.000	0.110	0.154
a1	0.375	0.061	0.030	0.225	0.104	0.042	0.282	0.242
a	0.128	0.930	0.590	0.027	0.593	0.000	0.390	0.637

Table 5-12

Gain-Loss Analysis for Hole Filling (Constrained)

Qd =128;fs =128
	1	2	3	4	5	6	7	8
a2	0.015	0.023	0.006	0.001	0.023	0.000	0.031	0.058
a1	0.205	0.043	0.016	0.025	0.028	0.009	0.174	0.146
a	0.073	0.548	0.364	0.026	0.811	0.000	0.176	0.396
Qd =128;fs =256
	1	2	3	4	5	6	7	8
a2	0.026	0.035	0.015	0.002	0.034	0.000	0.050	0.094
a1	0.252	0.045	0.006	0.041	0.037	0.014	0.196	0.184
a	0.105	0.778	2.389	0.060	0.923	0.000	0.253	0.512
Qd =256;fs =128
	1	2	3	4	5	6	7	8
a2	0.018	0.033	0.006	0.001	0.023	0.000	0.031	0.080
a1	0.209	0.047	0.016	0.027	0.029	0.019	0.189	0.162
a	0.088	0.697	0.364	0.025	0.773	0.000	0.162	0.495
Qd =256;fs =256)
	1	2	3	4	5	6	7	8
a2	0.031	0.049	0.015	0.002	0.035	0.000	0.050	0.127
a1	0.259	0.048	0.006	0.042	0.039	0.025	0.211	0.196
a	0.119	1.028	2.389	0.058	0.908	0.000	0.235	0.648

Next, let’s consider the gain and loss for the prefetch policy. Let’s define the terms:

Prefetch: Includes post-read and pre-read. Post-read means an additional size is attached to the last LBA; pre-read means an additional size is attached to the first LBA. For example, assume that the original command is to read LBA 10-20 and the pre-read size is 8. Then the extended command is to read LBA 2-20.
Gain: The additional cache hit obtained (b2) = the cache hit ratio at fetch length x (b3) - the cache hit ratio without prefetch (b0)
Loss: The prefetch commands occurs / the total read command (b1)
Gain-loss-ratio b = b₂/b₁: The higher, the better

For this case, let’s also consider the number of sequential streams and each stream’s length, besides the queue depth and the time interval. The procedure is similar to the previous case. Table 5-13 shows the result. You can see that App 5 is most significant for post-read; while App 6 is less significant. b is generally smaller than 1 for short queue depth. When prefetch size is very small (< 64), b is very small (<< 1). When the queue length and hole length is long, a prefetch policy is generally useful. Based on the value of b, you may say that this policy is generally useless for App 6, most significant for App 5 with post-read, and may be useful for Apps 2, 4, 7, and 8.

Table 5-13

Gain-Loss Analysis for Prefetch Policy

Qd =128;fs =128
	1	2	3	4	5	6	7	8
b2	0.179	0.140	0.063	0.119	0.462	0.000	0.220	0.224
b1	0.818	0.859	0.937	0.880	0.528	1.000	0.769	0.764
b	0.219	0.163	0.067	0.136	0.875	0.000	0.286	0.294
Qd =128;fs =256
	1	2	3	4	5	6	7	8
b2	0.307	0.417	0.197	0.383	0.607	0.000	0.398	0.321
b1	0.690	0.582	0.803	0.616	0.382	1.000	0.591	0.668
b	0.444	0.717	0.245	0.621	1.589	0.000	0.674	0.480
Qd =256;fs =128
	1	2	3	4	5	6	7	8
b2	0.183	0.147	0.063	0.120	0.461	0.000	0.220	0.239
b1	0.814	0.851	0.937	0.879	0.527	1.000	0.769	0.748
b	0.225	0.173	0.067	0.137	0.875	0.000	0.286	0.319
Qd =256;fs =256
	1	2	3	4	5	6	7	8
b2	0.310	0.427	0.197	0.384	0.607	0.000	0.398	0.342
b1	0.687	0.572	0.803	0.615	0.381	1.000	0.591	0.645
b	0.452	0.746	0.245	0.625	1.593	0.000	0.674	0.530

You may also analyze the relationship between post-read and sequential stream by counting 1) the stream numbers (c) and the ratio c₁ = c/total read numbers; 2) each streams hit frequency c₂; 3) the cache hit as b₂ (only consider post read, redefine b₂). This test can help to check if the infinity sequence read takes effect, such as only when the sequence length is long enough, the cache enters into infinity mode. In infinity mode, post-read is automatic. However, if it is not in infinity mode, you still need to consider the benefit of the post-read, after detecting a two-command sequential stream, like the possibility of three or more commands attached to the stream.

You can further find the relation between prefetch and hole filling by defining

Gain: The additional cache hit obtained (d₂) = the cache hit ratio at fetch length x and hole length y (d₃) - the cache hit ratio without prefetch and hole filling (d₀)
Loss: The additional commands occurs / the total read command (d₁)
Gain-loss-ratio d = d₂/d₁: The higher, the better

Two comparison methods are conducted here:

Two queues, one for prefetch (Q₁) and other for hole filling (Q₂), both under LRU. Consider the overlapped hit ratio, i.e., if a command hits in Q₁, it also hits in Q₂ and vice versa.
Two queues, one for prefetch (Q₁) and other for hole filling + prefetch (Q₃); both under LRU. Consider the additional hit ratio of Q₃ over Q₁.

Define x₀ as the hit number without prefetch and hole filling policies, x¯₁ = x₁ + x₀ as the hit with prefetch, x¯₂ = x₂ + x₀ as the hit due to hole filling, and x¯₃ as the hit due to combined prefetch and hole filling. Now if x₀ is much larger than x₁ and x₂, it shows that there is not much difference between prefetch and hole filling; otherwise, check the difference between x₁ and x₂. If x₁ is much larger than x₂, it shows that the benefit of fetch is much larger than hole filling; otherwise, it is reasonable to do hole filling instead of prefetch. If x₃ − x₀ is much larger than x₁, it indicates that the hole filling gained benefits from refetch; otherwise, prefetch is not beneficial.

Without post-read fetch, you can observe that x¯₁ is generally higher than x¯₂, and x₁ and x₀ are generally larger than x₂. Except for Apps 5 and 6, x₂ is generally smaller than 1% (Apps 4 and 8 are around 1% when queue length and fetch size are 256). This means that if post-read takes effect, then the hole filling’s influence is very small. The overlap between two policies (post only) are much higher than x₁ and x₁ + x₂; x₀ /(x₁ + x₂) is generally larger than 10 when prefetch size is over 128 (various prefetch size; fixed hole filling size 256). Now the key problem is whether it is worthy to do the hole filling for the additional 2% (queue length =128, fetch size =256/128) cache hit ratio at the cost of additional self-induced access (~ 10%).

Observe that b₁ is generally larger than a₁. However, the cost of each prefetch is generally smaller than that of a self-induced hole filling command. Note that this direct comparison may be unfair. An indirect method is to estimate the average service time for the prefetch policy and hole filling policy (WCD), so for prefetch, define the response time as write ratio ∗write access time + read ratio ∗(b₃ ∗cache access time + (1–b₃)∗ read access time), and for hole filling, define the response time as write ratio ∗write access time + read ratio ∗(a₃ ∗cache access time + (1–a₃)∗ read access time). However, I omit the details here.

In sum, you can see that hole filling policy can improve the cache ratio by introducing self-induced background filling commands.

The benefit (increased hit frequency) and the cost (additional self-induced commands) ratio of (post) hole filling is generally less than 1 if filling any hole within a certain distance; the average is 0.47 (queue depth = 256; hole size = 256 blks).
The benefit and the cost ratio of prefetch (post-read) is generally less than 1 if prefetching any non-hit command with a certain prefetch size; the average is 0.68 (queue depth = 256; prefetch size = 256 blks), which is better than a (post) hole filling policy, considering that some hole filling commands may need more recourses than prefetch commands.
If further applying (post) hole filling policy to the prefetch policy, the additional benefit/cost ratio is generally less than 0.2 (queue depth = 256; prefetch size = 256 blks, hole size = 256 blks). Unless a well-designed hole filling policy (e.g., based on data-correlation) is applied, it is not very useful for overall performance improvement.

Due to the similarity of hole filling and prefetch, the data-correlation methods may be applied to both hole filling and prefetch; thus the two policies might be merged.

5. Case Study: Benchmarking Tools

SPC-1C

Workload Properties

Synthetic Trace

PCMark

Workload Properties

Gain-Loss Analysis