© Jun Xu 2018
Jun XuBlock Trace Analysis and Storage System Optimizationhttps://doi.org/10.1007/978-1-4842-3928-5_5

5. Case Study: Benchmarking Tools

Jun Xu1 
(1)
Singapore, Singapore
 

Benchmark tools are useful to provide some “standard” performance indexes for storage systems with specific requirements. This chapter shows how to identify the access pattern of benchmark results. The first tool is SPC-1C from the Storage Performance Council (SPC). After capturing the pattern, I developed a synthetic emulator to match the real traces. The second tool is PCMark from FutureMark. I illustrate how to use gain-loss analysis to improve cache algorithm efficiency.

Storage performance benchmarks assess the relative performance of storage systems by running a number of standards tests and trails, via a tool or a set of programs with or without specific hardware equipment supported. Below are some benchmark tools that are often used for active trace collection :
  • Synthetic trace
    • The user can specify test scenarios for queue depth, request size, transfer rate, sequence, etc. It is good to determine corner case behavior.

    • Examples: IOMeter, VDBench, fio, IOzone, iorate, sqlio, diskspd1

    .
  • Application-based trace
    • The user can choose specific applications with predefined workload patterns. It is good to illustrate real world cases.

    • Examples: SysMark,2 PCMark, SPC (Storage Performance Council, e.g., SPC-1, SPC-2), TPC (Transaction Processing Council, e.g., TPC-A/B/C/D/H/W),3 SPEC (Standard Performance Evaluation Corporation, e.g., HPC/SFS/JVM/SDM),4 Jetstress,5 COSbench6

    .
  • Real trace
    • The user can input the real-world trace directly. It is useful when the user attempts to test similar applications in different systems.

    • Example: AnandTech Storage Bench7

    .
Table 5-1 gives a simple comparison of some commonly used tools, where the letters W, L, and U in the OS column indicate Windows, Linux, and Unix, respectively.
Table 5-1

A Comparison of Some Common Benchmark Tools

 

Block

File

Posix

OS

S/C

Open

Latest

Iometer

Y

-

-

WLU

S/C

Y

1.1.0/

       

2014

IOzone

Y

Y

Y

WLU

S

Y

2006

Bonnie++

Y

  

LU

S

Y (GPL2)

1.0.1

dbench

Y

Y

Y

LU

S/C

Y (GNU)

2008

Filebench

Y

Y

Y

LU

S

Y

2011

For large-scale systems, the traditional tools may be insufficient (e.g., lack of measurement metrics) or inconvenient (e.g., no integrated user interface) enough. Therefore, some dedicated tools are proposed, such as HiBench,8 Berkely BDB,9 BigDataBench,10 and BigBench11 for big data benchmarks, in particular, the Hadoop/Spark systems.

A general benchmark procedure is shown in Table 5-2 [52]. Note that there exist other classification methods. For example, three types are named as micro-benchmark in lower-level system operation (e.g., evaluating HDFS operations on modern cluster), functional/components benchmarks in high-level functions (e.g., Terasort, basic SQL), application-level benchmarks (e.g., overall system performance for a given application scenario). For more details, refer to [41, 35].
Table 5-2

Five Steps for General Benchmarks

Steps

Remarks

Choose the proper configuration.

Select the proper type of benchmark, such as macro-benchmark (overall test for the full system), micro-benchmark (few operations to check partial changes), or trace-based.

Choose the correct environment.

Consistent hardware and software settings, and some factors, such as cache status, HDD zoned constant angular velocity, file system aging, nonessential processes, etc.

Run the benchmark.

Identical run for repeatability, multiple rounds for accuracy with small standard deviations or high confidence level, sufficient running time for steady state, automatic run via suitable scripts.

Present the results.

Statistical results with small confidence-interval and near normal distribution

Validate the results.

Reproduce/confirm the results or compare with other similar ones.

SPC-1C

The Storage Performance Council (SPC) provides several benchmarking tools under different levels. SPC-1C [53] is designed to be vendor/platform independent and is applicable for a wide range of storage component products, such as disk drives, host bus adapters (HBAs), intelligent enclosures, and storage software such as logical volume managers.

Workload Properties

Two important concepts related to the workload intensity are the BSU (business scaling unit) and ASU (application storage unit). Each BSU is composed of three ASUs: ASU1 for a data store with a weight of 45%, ASU2 for a user store with 45%, and ASU3 for a log with 10%, which totally corresponds to five IOPS per ASU.

There are three types of access patterns: random (uniform distribution), sequential, and (random walking access) pattern. The details can be found in Table 5-3, which is further summarized in Table 5-4. Table 5-5 indicates that the small size requests (8 and 16 blocks) are over 85%. Random walk is an important model of Markovian Chain. Table 5-6 shows the main characteristics of this model.
Table 5-3

Decomposition of SPC-1C Workload

ASU

1-1

1-2

1-3

1-4

2-1

2-2

2-3

3-1

Intensity

0.035

0.281

0.07

0.21

0.018

0.07

0.035

0.281

R/W

0.5

0.5

1

0.5

0.3

0.3

1

0

Random

1

0

0

0

1

0

0

0

Pattern

0

1

0

1

0

1

0

0

Seq.

0

0

1

0

0

0

1

1

Ratio

1-1

1-2

1-3

1-4

2-1

2-2

2-3

3-1

R rand.

0.0175

0

0

0

0.0054

0

0

0

R seq.

0

0

0.07

0

0

0

0.035

0

R pattern

0

0.1405

0

0.105

0

0.021

0

0

W rand.

0.0175

0

0

0

0.0126

0

0

0

W seq.

0

0

0

0

0

0

0

0.281

W pattern

0

0.1405

0

0.105

0

0.049

0

0

Table 5-4

SPC-1C Workload Ratio

Workload ratio

ASU1

ASU2

ASU3

ASU i/total

0.596

0.123

0.281

Read rand. in ASU i

0.02936

0.0439

0

Read seq. in ASU i

0.11745

0.28455

0

Read pattern in ASU i

0.41191

0.17073

0

Write rand. in ASU i

0.02936

0.10244

0

Write seq. in ASU i

0

0

1

Write pattern in ASU i

0.41191

0.39837

0

Table 5-5

Size Distribution

Size distribution

Probability

Size distribution

Probability

8

0.7684

64

0.03088

16

0.09264

128

0.03088

32

0.0772

  
Table 5-6

SPC-1C Random Walk Pattern

ASU

1-1

1-2

1-3

1-4

2-1

2-2

2-3

3-1

Sum

Read

0

0.1405

0

0.105

0

0.021

0

0

0.2665

Write

0

0.1405

0

0.105

0

0.049

0

0

0.2945

You can also find from Table 5-7 that the write accesses and pattern access are dominate, which shows the importance of self-similarity and write cache in the benchmarking test.
Table 5-7

Basic IO Pattern Distribution

R/W

Ratio

Mode

Ratio

Read

0.3944

Random

0.053

Write

0.6056

Sequential

0.386

  

Pattern

0.561

Synthetic Trace

With all these parameters, you can actually write your own synthetic trace generator. Thus more flexibility is provided to change any parameters you are interested in, such as disk size, BSU, simulation times, even the configuration of SPC-1C, like ASU, IOPS per BSU, distribution patterns, etc. For example, you can separate ASU from BSU to see the influence of different applications (data store, user store, and log./seq. write) instead of mixed workload, and change the distribution (e.g., inter-arrival time, request size, etc.) to fit more specific requirements. In addition, you may integrate the generator to another toolkit, such as Disksim [54] and IOMeter, as a synthetic generator component. Figures 5-1, 5-2, and 5-3 show the comparison of a real trace captured by bus analyzer and the synthetic trace generated by the MATLAB-based tool in Appendix A.
../images/468166_1_En_5_Chapter/468166_1_En_5_Fig1_HTML.jpg
Figure 5-1

Inter-arrival time histogram

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig2_HTML.jpg
Figure 5-2

SPC-1C spatial distribution

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig3_HTML.jpg
Figure 5-3

SPC-1C trace temporal distribution

From Figure 5-1, you can see that the inter-arrival time is approximately an exponential distribution. As you know, the workload is mixed by eight different types of IO streams from three ASUs. Some high intensity-streams are visible in this spatial distribution of Figure 5-2. Generally, the generated random R/W IO streams are consistent with the real workload, but the sequential ones are not very close in a light workload. One possible reason is the real workload was not generated by the exact parameters as those in the random walking model. In fact, it is hard to align the temporal distribution well. However, the result in Figure 5-3 is fairly acceptable.

PCMark

PCMark Vantage [55] is a widely used benchmark tool that is not limited to disk performance. It can provide application-level traces in eight categories, as shown in Table 5-8, where a particular trace is decomposed into the number of write and read commands. Figure 5-4 further shows the traces in the plot of LBA vs time.12
Table 5-8

Eight Applications in PCMark

Order

Eight Apps

Total CMD

Write

Read

1

Windows Defender

6755

300

6455

2

Gaming

11040

62

10978

3

Importing Pictures

2806

4

2802

4

Vista Startup

7418

1327

6091

5

Video Editing

8204

3711

4493

6

Windows Media Center

5011

3309

1702

7

Adding Music

3336

1506

1830

8

Application Loading

11155

2660

8495

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig4_HTML.jpg
Figure 5-4

Eight applications in PCMark

Now you can use this trace as an example for the read cache performance analysis. Some early research shows that the common differentiator between drives is read cache hit rate, with reasonable pre- and/or post-read data. For easy of notation, I define a prefetch action as both a pre-read and post-read request, as shown in Figure 5-5. A common case is that some later read requests hit the data in the cache due to prefetch data, if the trace has a strong locality. Three types of prefetch accesses are commonly used: prefetch always (PA), prefetch on a miss (PoM), prefetch on a hit (PoH).
../images/468166_1_En_5_Chapter/468166_1_En_5_Fig5_HTML.jpg
Figure 5-5

Cache prefetch

As mentioned, a gap between requests is allowed for near sequential streams. This gap can be a “hole” in the prefetch data, which is harmful to the overall performance. As shown in Figure 5-6, there are three types of holes:
  • Post hole: The first LBA of the incoming command has a distance (> 0) within a threshold to the last LBA of the queued commands in the cache.

  • Pre hole: The last LBA of the incoming command has a distance within a threshold to the first LBA of the queued commands in the cache.

  • Pre and post hole: Both post and pre holes exist.

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig6_HTML.jpg
Figure 5-6

Hole types

To formally define the hole, consider the following two cases:
  • Constant: If the range between two close regions in the cache is less than the constant Hs blocks, this range is viewed as a hole.

  • Adaptive: The hole size is related to DRAM and SSD cache (most likely monotone increasing to the total/remaining

    $$ {H}_S=\Big\{{\displaystyle \begin{array}{cc}\underset{\_}{H_S,}& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor <\underset{\_}{H_S}\\ {}{\overline{H}}_S,& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor >{\overline{H}}_S\\ {}\left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor, & Otherwise\end{array}} $$

    where ST and SR are the total and remaining cache size, respectively, and aT and aR are coefficients for ST and SR. An example for 64MB of DRAM cache is defined as

    $$ {H}_S=\Big\{{\displaystyle \begin{array}{cc}16,& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor <\underset{\_}{H_S}\\ {}256,& \left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor >{\overline{H}}_S\\ {}\left\lfloor {a}_T{S}_T+{a}_R{S}_R\right\rfloor, & Otherwise\end{array}} $$

    where $$ {a}_T=\frac{1}{2^{13}},{a}_R=\frac{15}{2^{13}} $$.The actual $$ \underset{\_}{H_s} $$ and $$ {\overline{H}}_s $$ are decided by trace, such as the median size of requests within a time window. More specifically, consider if the hole is in the same track/cylinder.

A background task shall be implemented to monitor the LBA regions in cache and the hit density on these regions. This process is also tasked to look for gaps in regions of data with some level of hit density and to generate self-induced read commands to fill in these holes with data from main store.

To connect the cache algorithm to the trace properties, use the following hypothesis:
  • If two regions of a certain range within a time-frame have a high enough correlation, the gap between them is likely to be accessed.

  • The pre-fetch should not affect the overall performance much, so the benefits gained from the additional cache hit ratio should be larger than the additional cost due to pre-fetch.

  • The up-bound of the hole size to be fetched is decided by multiple factors, such as the total/remaining cache size, workload size, cache type, access time, etc.

Now we have some questions to answer.
  • What is the benefit from a hole filling policy? The key is to get the increased cache hit ratio (hit ratio with hole filling v.s. hit ratio without hole filling).

  • What is the additional cost from a hole filling policy? The key is to find how many self-induced commands are generated to fill the hole. The time of the user workload must be considered.

  • Since the similarity exists between hole filling and the prefetch policy, is it possible to merge hole filling to prefetch policy (integration)? The key is to find the overlapped cache hit between two policies; if the overlap rate is high, prefetch may include hole filling as part of it.

  • When and where to apply the two policies (or integrated policy) with balanced benefit and cost? The key is to reduce the additional mechanical access cost and cache pollution.

Workload Properties

Let’s look at the think time and completion time first, as shown in Figures 5-7 and 5-8, respectively. You can see the large difference for their distributions of various applications. Table 5-9 further gives the mean and standard derivative values together with IOPS. Figure 5-9 and Table 5-10 provide the size distribution. You may also find the relation between the size and completion time.
../images/468166_1_En_5_Chapter/468166_1_En_5_Fig7_HTML.jpg
Figure 5-7

PCMark: Think time distribution

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig8_HTML.jpg
Figure 5-8

PCMark: Completion time

Table 5-9

PCMark: Think/Completion Time (ms) and IOPS

App

IOPS

Think-time mean

Std.

Completion-time mean (blk)

Std.

1

120.57

8.29

46.04

3.43

5.49

2

189.29

5.28

9.20

10.24

47.4

3

48.70

20.54

21.25

1.99

3.44

4

311.30

3.21

8.25

3.59

5.84

5

122.47

8.17

16.68

0.76

2.64

6

59.56

16.79

22.58

1.2

3.22

7

145.65

6.87

14.87

2.78

5.86

8

204.29

4.90

16.37

16.41

29.87

../images/468166_1_En_5_Chapter/468166_1_En_5_Fig9_HTML.jpg
Figure 5-9

PCMark: Size distribution

Table 5-10

PCMark: Size Distribution (Blocks)

App

Mean -all

std.

Mean -read

std.

Mean -write

std.

1

98.67

87.55

102.78

87.4

10.22

7.3

2

86.92

56.39

87.36

56.24

8

0

3

175.09

82.28

175.33

82.1

8

0

4

71.92

54.86

83.12

49.32

20.48

49.35

5

60.77

63.68

59.18

25.42

62.69

90.43

6

173.36

115.82

255.16

13.43

131.29

122.52

7

43.01

61.43

60.39

70.63

21.89

38.54

8

36.23

37.58

41.31

37.47

20.04

33.11

Based on a sequential stream, you can observe some obvious sequences in general. Importing pictures has the most sequence, while application loading has less sequence. Except for application loading (mixed streams), queue length 10 is good enough to detect most streams. Application loading has relatively more mixed sequential streams. Also note that for read-only cases, most streams have only two commands. In Apps 2, 3, and 4, the average stream length is longer than others, which provides more chance for infinity read mode.

Gain-Loss Analysis

With the prefetch policy, the gain is from the increased hit ratio due to the cache policy (e.g., hole filling, prefetch), while the loss is from the additional disk access (e.g., pre-/post read attached to the user commands; the self-induced non-attached commands) due to the cache policy.

Two methods can be used to analyze the gain and loss. One simple method considers frequency only, so it counts the number of the hit frequency and additional disk access frequency. The self-induced commands’ cost is generally smaller than non-attached commands. If the hit increases due to self-induced commands, the gain is large; otherwise, it is small. This method can provide a rough estimation. The other method is the so-called mixed simulation and modeling approach; it uses simulation to get the hit ratio and additional command frequency, while using the analytical model to obtain the average response time, which is relative complex and quantitative. Let’s only consider the first one.

Let’s define gain and loss quantitatively:
  • Gain: The additional cache hit obtained (a2) = the cache hit ratio at hole length x (a3) - the cache hit ratio without hole filling (a0)

  • Loss: The additional self-induced commands occurrence / the total read command (a1)

  • Gain-loss-ratio a = a2/a1: The higher, the better to estimate a, but there are a few basic considerations:
    • The queue depth (Qd) in the cache: which has significant influence on the LRU algorithm. By observing the largest a vs. Qd, you will see that the optimal Qd. Qd for different applications may be various.

    • The time interval (dT ) between commands: Only if the interval is larger than a certain value, the system has chance to handle self-induced command, such as finding the ratio a for different dT .

    • The hit frequency of each hole: When using a correlation method to select the hole to fill, the hit frequency must be higher than 1. If the hit frequency is generally larger than 1, then this method is meaningful; otherwise, it is meaningless.

Let’s start to verify whether a self-induced hole filling policy is beneficial. Take the following steps:
  1. 1.

    Check a1, a2, and a for all read commands (write commands also fill the cache).

     
  2. 2.

    If a1 is much larger than a2 (e.g., a < 0.2), it is not economic for hole filling; then check if a2 is too small to be worthy. Otherwise, hole filling is useful; and check the selective self-induced commands based on workload.

     
  3. 3.

    If the time-interval between two commands is large enough (e.g., > 15ms), the self-induced command may not cost additional resources. If so, any increased cache hit ratio is beneficial. Repeat Steps 1-2.

     
Table 5-11 provides the results where Qd = queue depth and fs = filling hole size. You can see that App 2 is most significant for hole filling, while App 6 is least significant. Hole filling is generally suitable for Apps 2, 3, 5, and 8. If you only consider the commands with a think time larger than 10ms, you have the results listed in Table 5-12. You can see that the general trend is similar to the case without time-interval constraints. App 5 is most significant for hole filling for fs = 128 and App 3 for fs = 256; while App 6 is less significant. In sum, a is generally smaller than 1 for short queue depth. When hole size is very small, a is very small (<< 1). When both queue length and hole size are short, hole filling policy is generally useless. Based on the value of a, you can say that this policy is generally useless for App 6 and may be useful for Apps 2, 5, and 8. Further conclusion should be made after comparing with prefetch policy.
Table 5-11

Gain-Loss Analysis for Hole Filling

Qd =128;fs =128

 

1

2

3

4

5

6

7

8

a2

0.026

0.024

0.007

0.001

0.049

0.000

0.051

0.079

a1

0.328

0.060

0.041

0.183

0.099

0.020

0.293

0.192

a

0.078

0.399

0.165

0.007

0.494

0.000

0.175

0.413

Qd =128;fs=256

 

1

2

3

4

5

6

7

8

a2

0.045

0.044

0.017

0.003

0.061

0.000

0.109

0.118

a1

0.362

0.058

0.028

0.225

0.106

0.026

0.265

0.235

a

0.124

0.760

0.620

0.015

0.580

0.000

0.412

0.505

Qd =256;fs =128

 

1

2

3

4

5

6

7

8

a2

0.029

0.034

0.007

0.003

0.048

0.000

0.051

0.102

a1

0.338

0.065

0.042

0.183

0.099

0.031

0.295

0.206

a

0.084

0.520

0.160

0.014

0.490

0.000

0.174

0.493

Qd =256;fs =256

 

1

2

3

4

5

6

7

8

a2

0.048

0.057

0.017

0.006

0.062

0.000

0.110

0.154

a1

0.375

0.061

0.030

0.225

0.104

0.042

0.282

0.242

a

0.128

0.930

0.590

0.027

0.593

0.000

0.390

0.637

Table 5-12

Gain-Loss Analysis for Hole Filling (Constrained)

Qd =128;fs =128

 

1

2

3

4

5

6

7

8

a2

0.015

0.023

0.006

0.001

0.023

0.000

0.031

0.058

a1

0.205

0.043

0.016

0.025

0.028

0.009

0.174

0.146

a

0.073

0.548

0.364

0.026

0.811

0.000

0.176

0.396

Qd =128;fs =256

 

1

2

3

4

5

6

7

8

a2

0.026

0.035

0.015

0.002

0.034

0.000

0.050

0.094

a1

0.252

0.045

0.006

0.041

0.037

0.014

0.196

0.184

a

0.105

0.778

2.389

0.060

0.923

0.000

0.253

0.512

Qd =256;fs =128

 

1

2

3

4

5

6

7

8

a2

0.018

0.033

0.006

0.001

0.023

0.000

0.031

0.080

a1

0.209

0.047

0.016

0.027

0.029

0.019

0.189

0.162

a

0.088

0.697

0.364

0.025

0.773

0.000

0.162

0.495

Qd =256;fs =256)

 

1

2

3

4

5

6

7

8

a2

0.031

0.049

0.015

0.002

0.035

0.000

0.050

0.127

a1

0.259

0.048

0.006

0.042

0.039

0.025

0.211

0.196

a

0.119

1.028

2.389

0.058

0.908

0.000

0.235

0.648

Next, let’s consider the gain and loss for the prefetch policy. Let’s define the terms:
  • Prefetch: Includes post-read and pre-read. Post-read means an additional size is attached to the last LBA; pre-read means an additional size is attached to the first LBA. For example, assume that the original command is to read LBA 10-20 and the pre-read size is 8. Then the extended command is to read LBA 2-20.

  • Gain: The additional cache hit obtained (b2) = the cache hit ratio at fetch length x (b3) - the cache hit ratio without prefetch (b0)

  • Loss: The prefetch commands occurs / the total read command (b1)

  • Gain-loss-ratio b = b2/b1: The higher, the better

For this case, let’s also consider the number of sequential streams and each stream’s length, besides the queue depth and the time interval. The procedure is similar to the previous case. Table 5-13 shows the result. You can see that App 5 is most significant for post-read; while App 6 is less significant. b is generally smaller than 1 for short queue depth. When prefetch size is very small (< 64), b is very small (<< 1). When the queue length and hole length is long, a prefetch policy is generally useful. Based on the value of b, you may say that this policy is generally useless for App 6, most significant for App 5 with post-read, and may be useful for Apps 2, 4, 7, and 8.
Table 5-13

Gain-Loss Analysis for Prefetch Policy

Qd =128;fs =128

 

1

2

3

4

5

6

7

8

b2

0.179

0.140

0.063

0.119

0.462

0.000

0.220

0.224

b1

0.818

0.859

0.937

0.880

0.528

1.000

0.769

0.764

b

0.219

0.163

0.067

0.136

0.875

0.000

0.286

0.294

Qd =128;fs =256

 

1

2

3

4

5

6

7

8

b2

0.307

0.417

0.197

0.383

0.607

0.000

0.398

0.321

b1

0.690

0.582

0.803

0.616

0.382

1.000

0.591

0.668

b

0.444

0.717

0.245

0.621

1.589

0.000

0.674

0.480

Qd =256;fs =128

 

1

2

3

4

5

6

7

8

b2

0.183

0.147

0.063

0.120

0.461

0.000

0.220

0.239

b1

0.814

0.851

0.937

0.879

0.527

1.000

0.769

0.748

b

0.225

0.173

0.067

0.137

0.875

0.000

0.286

0.319

Qd =256;fs =256

 

1

2

3

4

5

6

7

8

b2

0.310

0.427

0.197

0.384

0.607

0.000

0.398

0.342

b1

0.687

0.572

0.803

0.615

0.381

1.000

0.591

0.645

b

0.452

0.746

0.245

0.625

1.593

0.000

0.674

0.530

You may also analyze the relationship between post-read and sequential stream by counting 1) the stream numbers (c) and the ratio c1 = c/total read numbers; 2) each streams hit frequency c2; 3) the cache hit as b2 (only consider post read, redefine b2). This test can help to check if the infinity sequence read takes effect, such as only when the sequence length is long enough, the cache enters into infinity mode. In infinity mode, post-read is automatic. However, if it is not in infinity mode, you still need to consider the benefit of the post-read, after detecting a two-command sequential stream, like the possibility of three or more commands attached to the stream.

You can further find the relation between prefetch and hole filling by defining
  • Gain: The additional cache hit obtained (d2) = the cache hit ratio at fetch length x and hole length y (d3) - the cache hit ratio without prefetch and hole filling (d0)

  • Loss: The additional commands occurs / the total read command (d1)

  • Gain-loss-ratio d = d2/d1: The higher, the better

Two comparison methods are conducted here:
  • Two queues, one for prefetch (Q1) and other for hole filling (Q2), both under LRU. Consider the overlapped hit ratio, i.e., if a command hits in Q1, it also hits in Q2 and vice versa.

  • Two queues, one for prefetch (Q1) and other for hole filling + prefetch (Q3); both under LRU. Consider the additional hit ratio of Q3 over Q1.

Define x0 as the hit number without prefetch and hole filling policies, x¯1 = x1 + x0 as the hit with prefetch, x¯2 = x2 + x0 as the hit due to hole filling, and x¯3 as the hit due to combined prefetch and hole filling. Now if x0 is much larger than x1 and x2, it shows that there is not much difference between prefetch and hole filling; otherwise, check the difference between x1 and x2. If x1 is much larger than x2, it shows that the benefit of fetch is much larger than hole filling; otherwise, it is reasonable to do hole filling instead of prefetch. If x3 − x0 is much larger than x1, it indicates that the hole filling gained benefits from refetch; otherwise, prefetch is not beneficial.

Without post-read fetch, you can observe that x¯1 is generally higher than x¯2, and x1 and x0 are generally larger than x2. Except for Apps 5 and 6, x2 is generally smaller than 1% (Apps 4 and 8 are around 1% when queue length and fetch size are 256). This means that if post-read takes effect, then the hole filling’s influence is very small. The overlap between two policies (post only) are much higher than x1 and x1 + x2; x0 /(x1 + x2) is generally larger than 10 when prefetch size is over 128 (various prefetch size; fixed hole filling size 256). Now the key problem is whether it is worthy to do the hole filling for the additional 2% (queue length =128, fetch size =256/128) cache hit ratio at the cost of additional self-induced access (~ 10%).

Observe that b1 is generally larger than a1. However, the cost of each prefetch is generally smaller than that of a self-induced hole filling command. Note that this direct comparison may be unfair. An indirect method is to estimate the average service time for the prefetch policy and hole filling policy (WCD), so for prefetch, define the response time as write ratio ∗write access time + read ratio ∗(b3 ∗cache access time + (1–b3)∗ read access time), and for hole filling, define the response time as write ratio ∗write access time + read ratio ∗(a3 ∗cache access time + (1–a3)∗ read access time). However, I omit the details here.

In sum, you can see that hole filling policy can improve the cache ratio by introducing self-induced background filling commands.
  • The benefit (increased hit frequency) and the cost (additional self-induced commands) ratio of (post) hole filling is generally less than 1 if filling any hole within a certain distance; the average is 0.47 (queue depth = 256; hole size = 256 blks).

  • The benefit and the cost ratio of prefetch (post-read) is generally less than 1 if prefetching any non-hit command with a certain prefetch size; the average is 0.68 (queue depth = 256; prefetch size = 256 blks), which is better than a (post) hole filling policy, considering that some hole filling commands may need more recourses than prefetch commands.

  • If further applying (post) hole filling policy to the prefetch policy, the additional benefit/cost ratio is generally less than 0.2 (queue depth = 256; prefetch size = 256 blks, hole size = 256 blks). Unless a well-designed hole filling policy (e.g., based on data-correlation) is applied, it is not very useful for overall performance improvement.

Due to the similarity of hole filling and prefetch, the data-correlation methods may be applied to both hole filling and prefetch; thus the two policies might be merged.