Trace Characteristics

Trace is usually classified as three levels: block level, file level, and object level. They share many common metrics, although each has its own unique properties. In this chapter, I will discuss block-level trace metrics in detail since the block-level trace provides more fundamental information on storage systems than other two levels. For simplicity of representation, I divide the metrics into two categories: the basic ones and the advanced ones. The meanings and performance impacts of these metrics are explained in detail.

Workload Properties

Workload can be collected and viewed in different abstract levels. Usually, there are three different levels, as show in Figure 2-1 [37]. The functional view indicates the users’ behaviors, which aim to facilitate comparison between, say, a relational database system and a MapReduce system that serves the equivalent functional aims of some enterprise data warehouse management workload. It enables a large range of equivalent systems to be compared. It lacks tracing capabilities for large-scale, data-centric systems. The system view captures workload behavior at the highest level of abstraction that we can trace in large-scale data-centric systems currently. For example, this translates to the jobs steam and job characteristics in MapReduce. For enterprise network storage, this is the data accesses stream at the application, session, file, and directory levels. The physical view describes a workload in terms of hardware component behaviors, such as the CPU, memory, disk, and network activities during workload execution. It depends on hardware, software, or even configuration changes.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig1_HTML.png — Figure 2-1
Typical workload abstraction level

The basic workload metrics are generally explicit and easily observed directly or calculated with very simple formulations. Some examples are listed in Table 2-1. However, the advanced ones are implicit and formulated in relatively complex forms/rules. See Table 2-2 for few advanced metrics. More details will be described in the next two sections of this chapter.

Table 2-1

Examples of Basic Metrics

Basic Metrics
Read to write ratio	Request size distribution	LBA range/ randomness	Inter-arrival time
The ratio between read and write operations in command number or total size	Request size, usually further count for total, read, and write	In a virtual volume, LBA can be used to represent the randomness in space. Otherwise, specify the device number.	Request arrival rate, idle time, busy time
With system-level information, queue depth, average response time, bus time, etc. can be also included.

Table 2-2

Examples of Advanced Metrics

Advanced Metrics
Spatial locality	Temporal locality	Read/write dependency	Priority-related metrics
The small distance of two requests means that soon after referencing r_i you find a reference to the nearby address r_j	The same address is referenced again after d steps. A concept closely related to temporal locality is popularity.	Three categories: true (read on write), output (write on write), and anti (write on read)	Request priority determines the execution sequence; it also includes the properties of synchronized vs. asynchronous.
In most cases, cache-related, cluster-related, and/or networked-related metrics should be included.

Note that the meaning of “block” in HDD and SSD is different. However, the definition of one sector as one block in HDD is used here. In addition, there are some different formats for sectors, such as 512, 512e, 4k, 4kn, etc. Without particular comments, assume that one sector is equal to 512 bytes for representation simplicity.

A block-level trace usually contains some common fields, like arrival time, completion time, LBA (first or last LBA), request size, operational mode (read or write, sync or async), etc. Some other fields, like bus time and merge time, depend on the trace collection tools. Table 2-3 gives an example of trace. For the i^th request, r_i, denotes its arrival and completion action time as D_i and C_i. Usually, we arrange all N requests in sequence of D_i, i.e. (r₁,. . . , r_i, . . . ,r_n).

Table 2-3

An Example of Block Trace

Order	Arrival time (s)	Completion time (s)	First LBA	Size	Mode
1	0.026216	0.027506	902553856	1024	0
2	0.026790	0.027719	197306368	8	0
3	0.036680	0.039502	902554880	1024	0
4	0.039618	0.044770	197306368	16	1
5	0.039654	15.079936	197306368	16	1
6	0.044542	0.046394	902555904	1024	0
7	0.044865	0.046513	197306376	8	0
8	0.054996	0.055265	2657576	8	0
9	0.059638	0.059905	197306376	16	0
10	0.081950	0.083162	902556928	1024	0
11	0.089740	0.089960	197306384	8	1
12	0.092741	0.093955	902558976	1024	0
13	0.093261	0.095268	902557952	1024	0
14	0.112958	0.114461	902560000	1024	0
15	0.113097	0.115717	902561024	1024	0
16	0.114820	0.115926	197306384	8	0
17	0.135434	0.136744	902562048	1024	0
18	0.136436	0.136963	197306384	16	1
19	0.150173	0.151625	902563072	1024	0
20	0.150260	0.152809	902564096	1024	0

Basic Metrics

Basic metrics are usually simple and easy for observation. But this does not mean they contain less information than advanced metrics. In some cases, basic metrics are good enough to interpret the workload properties.

LBA Distribution

From the curve of LBA distribution vs. time, you can easily observe the randomness/sequence of IO access. Figure 2-2 gives two examples of the write-only requests’ LBA distribution. The left plot shows a sequential access from LBA 6.7 ∗ 10⁸ to 6.8 ∗ 10⁸. The right plot mixes with sequential and random write accesses. Figure 2-3 shows that the IO pattern is combined with random read accesses and mixed write accesses. It’s clear that write access is more sequential than read access, as the write access contains several sequential streams. Note that there may exist a small gap between two continuous requests sometimes, although it may look sequential visually from the plot. However, these near sequential IO patterns can be accessed sequentially in most cases.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig2_HTML.jpg — Figure 2-2
LBA distribution from a Ceph node

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig3_HTML.jpg — Figure 2-3
LBA distribution from a Hadoop node

Size Distribution

In general, the drive (whatever HDD or SSD) is in favor of sequential access, so the sequential access has much higher throughput than the random access, in particular for HDD. And the request size also matters especially when the disk queue is short, which will be clearly illustrated in Chapter 4 in Figures 4-2 and 4-3.

Figure 2-4 plots the request frequency and CDF (cumulative density function) vs. request size. For the distribution of write requests, you can see that the percentage of the requests with size 1024 blocks are almost 50% in this case, which usually means the IO pattern is dominated by large size requests. Note that due to OS/FS and disk drive limitation, the maximum request size is usually 1024 blocks, so even if the user requests a 1MB size file, it will be divided into two IO requests internally (assume 512 bytes per block).

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig4_HTML.jpg — Figure 2-4
Size distribution from a Hadoop node

If you further need to know the size distribution with respect to (wrt) LBA range, you may have plots like Figure 2-5, from which you can learn more about the hot range with size information. Since you know that the transfer speed of different location in a HDD (e.g., ID vs. MD vs. OD) is different, the LBA can roughly tell the relative speed with the request size information.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig5_HTML.jpg — Figure 2-5
Combined LBA and size distribution

Read/Write Distribution

Read and write accesses may have different performance in drives. For HDD, the difference may be slight. However, for SSD, the gap could be large for the consumer class. In addition, read and write access may have dependencies on each other. For example, when a new write is just completed, the data may be still in the cache. An immediate read may access the data from the cache directly, instead of a media operation. Thus the visualization of distribution time can provide an intuitive view on the dependency. You can plot the data in a single figure or in two separate figures, as in Figure 2-3.

Table 2-4

Required Fields for Metrics Calculation

	Arrival time	Completion time	LBA	Request size	Mode operation
LBA distribution	Y (or)	Y (or)	Y
Size distribution				Y
IOPS	Y	Y (or)
Throughput	Y	Y (or)		Y
Queue length/depth	Y (or)	Y (or)
Busy/idle time	Y	Y
Read/write distribution	Y (or)	Y (or)			Y
Inter-arrival time distribution	Y
Inter-completion time distribution		Y

Inter-Arrival and Inter-Completion Time

Inter-arrival time is defined as the time interval between two sequentially arrived requests: δt_i = D_i − D_i−1. Similarity, inter-completion time is defined as δt¯_i = C¯_i − C¯_i−1, where C¯_i is reordered C_i based on a completion time sequence. δt_i is a direct indictor of the workload burstiness [38], together with the supposed-to-be average completion time of requests. When the average δt_i is much smaller than δt¯_i, it usually means the system is under a stressed workload, which may be beyond the storage device’s capability. In a sense, these two indictors are closely related to IOPS.

IOPS and Throughput

IOPS is usually defined as the IO number ∆n per unit time ∆t seconds, such as $IOPS=\frac{\varDelta n}{\varDelta t}$ . The unit time is preset by users, at perhpas 1 second. Similarly, throughput is the request size ∆S per unit time, so $TP=\frac{\varDelta s}{\varDelta t}$ . Note that they are in a sense average values within a given unit time/time-window. For different ∆t, IOPS and throughput may have different values. Usually, a larger ∆t leads to a smoother curve. Figure 2-6 shows an example with data collected from a HDD node of a Ceph cluster. You can see that the IOPS ranges from 60 to 160 in the figure of ∆t=1 second, while it is 80–120 when ∆t=6 seconds. In particular, when the workload contains many bursts, the maximum and minimum IOPS values for different ∆t may vary largely.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig6_HTML.png — Figure 2-6
Average IOPS and throughput with different time window

The curves of the two metrics vs. time can be used to observe the workload burst visually. However, choose ∆t carefully. A too-large ∆t may smooth the curve but remove the IO burst. A practical choice is related to the average inter-arrival time, so ∆t may be a few times the average inter-arrival time, but not too much.

Alternatively, you can fix ∆n. For example, you may set ∆n=20, and let ∆t be various.

A complex setting is that you may let both ∆t and ∆n various. For example, let ∆n≤ 10 and ∆t = 1 second as the constraint. If within ∆t = 1 second, there are ∆n within the range, do the average. Otherwise, if ∆n > 10, choose a time window that ∆n = 10, and then do the average.

Another trivial issue when drawing the curve is the time, so the average value happens at the beginning, middle, or end of ∆t. For example, if you choose the end of ∆t, you may have the following formulations:

IO(t_j ) = the total number of IOs based on the range calculated by D_i, so IO(D_i), where D_i is within an interval [t_j−1 t_j ]
R(t_j ) = the summation of request size of IO(D_i)
Average IOPS at t_j = IO(t_j−1)/∆t
Average request size at t_j = R(t_j−1)/IO(t_j−1)
Average throughput at t_j = R(t_j−1)/∆t

You can also apply moving average techniques here; there is an overlap between two continuous “average” metrics.

Response Time

Response time is generally combined by the waiting (or queuing) time T_w and access (or service) time T_a, i.e., T_res = T_a + T_w . The service time is the time cost in service, such as disk access time. The queuing time is the cost when the request waits in the queue before it is sent for actual execution. When the trace is collected from the external bus (IO driver), such as SATA/SAS, the response time of r_i is calculated by C_i − D_i.

Queue Length/Depth

There are at least two queues in HDDs. One is along with the IO driver, which is influenced by the OS and FS, so the OS and/or FS may change the order of IOs to send to the disk drive based on some schedulers. The length is usually up to a trace collected externally only reflects the length in the IO driver. You can estimate the queue depth based on either arrival time or complete time of requests. They may have slightly differences. Let’s denote one queue’s D and C as D_i and C_i in sequence.

IO driver queue depth just before command arrived (Q_d1)
- The queue depth just before D_i (instant queue depth Q(D_i)) = the number of requests whose C time >= D_i and D time < D_i
IO driver queue depth just after command completed (Q_d2)
- The queue depth just after C_i (instant queue depth Q(C_i)) = the number of requests whose C time >= C_i and D time < C_i
Average queue depth
- Estimated average queue depth during non-idle period: ∑_i((Q_d1+Q_d2)/2 ∗ (D_i − C_i))/∑_i(D_i − C_i)
- Effective average queue depth in time interval: Sampling at ∆t seconds

Figure 2-7 gives an illustration of Q_d1 and Q_d2, where blue signs with arrows show the IO requests in time order. It also illustrates a few Q_d1 and Q_d2 for these requests. Table 2-5 further gives an example to show how the queue depth is calculated based on Q_d1. Figure 2-8 shows the estimated queue depth from a Hadoop node. It looks like the average queue depth is quite high, which means a high workload. However, zoom into the plot to check the intervals in-between the queues. In this example, the scale of the x-axis is 100 instead of 12000 to show more information.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig7_HTML.jpg — Figure 2-7
Illustration of queue length

Table 2-5

Estimated queue depth and idle time

Arrival time	Completion time	Queue depth	Idle time
0.00007	0.00036	0	-
0.01130	0.01157	0	1.09E–02
0.01134	0.01288	1	0
0.02622	0.02751	0	1.33E–02
0.02679	0.02772	1	0
0.03668	0.03950	0	8.96E–03
0.03962	0.04477	0	1.17E–04
0.03965	15.07994	1	0
0.04454	0.04639	2	0
0.04486	0.04651	2	0
0.05500	0.05526	1	0

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig8_HTML.jpg — Figure 2-8
Queue depth using arrival and completed time from a Hadoop node

The other is the HDD internal queue for actual execution with mechanical parts. Some scheduling algorithms, such as SCAN and RPO (rotational positioning optimization), are applied here to attempt to minimize a given operation time. Vender special commands (VSCs) can be used to collect the internal queue information. RPO tries to minimize the summation of seek time and rotational latency of requests in the queue. Theoretically, a longer queue means the scheduler has more events to select, and the overall operational time can be reduced further (for a pure random access). However, in practice, when the queue length reaches a certain level, the performance does not increase, which is illustrated in Figures 4-2 and 4-3. Without VSC, we may only use the external IO queue status to approximate the internal queue status. This is reasonable in a sense because a HDD can only handle one request per time.¹ However, for SSD, the situation is much more complex due to parallel processing.

Busy/Idle Time

When a disk is under an IO operation (including both foreground and background activities), the disk is busy; otherwise, it is in idle mode. These two metrics are useful when you intend to design some background operations for disks, such as defragmentation, address remapping, sector scanning, etc. As some of these operations require over hundred milliseconds to complete, the estimated busy or idle time will be estimated to show whether the performance will be affected by background operations.

However, without VSC to extract the drive internal trace log, you may not get the busy/idle time directly. Thus we may only approximate the time from the external trace when the internal trace is unavailable. The basic idea is to compare the completion time of requests in the queue with the new requests’ arrival time. If the completion time is later than the new arrival time, it means the drive is still busy at this arrival time. Otherwise, it is counted as idle time (although there may be some background activities inside disk drives; but you can assume that the user/external requests have higher priority than background activities). In other words, the calculation of idle time is heavily related to the queue depth, so only when queue depth is zerois there a chance to exist an idle interval.

Let’s consider the trace in Table 2-3. The completion time 0.027506s of the first request is later than the arrival time of 0.026790s of the second request, so the disk is busy at 0.026790s. However, the completion time 0.027719s of the second request is earlier than the arrival time 0.036680s of the third request, you may estimate that the drive have the idle time of 0.036680–0.027719=0.009s. Table 2-5 shows the result.

Figure 2-9 has the same data set as Figure 2-8. Although the queue depth is high in Figure 2-8, you can still observe many idle intervals for the disk.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig9_HTML.jpg — Figure 2-9
Estimated idle time from a Hadoop node

Advanced Metrics

The workload from the real world is usually complex. Even for those claimed as sequential workloads, such as a Hadoop HDFS write or Ceph Bluestore write, the actual IO pattern in the block level is mixed with sequence and randomness, as well as read and write. The advanced metrics attempt to provide insights into these traces.

Sequence vs. Randomness

The sequence and randomness of a workload are somehow subjective in some engineers’ view, as it is not easy to clearly state whether a workload is more sequential or random when it is mixed. In addition, the “feeling” of sequence is also different in different cases. For example, some requests with very small gaps among them may be also considered as (near) sequential.

In my view, it may be quantified under different scenarios, such as the scheduler algorithm, queue length, etc. That means this property is objective when all the conditions are fixed. For example, consider a burst workload of write requests within a small LBA range. Some of the requests are actually continuous in LBA, although they do not arrive in order. If a FIFO scheduling policy is applied, this workload is random, as the disk has to seek to different locations for each request. When a simple SCAN or RPO scheduling algorithm is applied, the requests will be reordered and some will become sequential, if there is a long enough queue. Assume that there are N requests. So there are up to N−1 seeks (in a strict definition, a request is considered as a random one, if there is a seek in term of LBA). Let the random access number as N_r and sequential access as N_s, so you can obtain a ratio of randomness vs. sequence for a given workload under a fixed scenario.

Sequential and Near Sequential Streams

This metric directly indicates the sequence degree of the workload. In general, a command is sequential when it comes after a similar command whose requested block LBAs just prior to this new command. Hence, sequential reads follow read commands and sequential writes follow write commands; but there are some subtleties to this definition. (Strictly) sequential stream means that the current and previous commands are of the same type (r/w) and the new command’s starting LBA immediately follows the previous one. Near sequential stream means that there must be an earlier command of the same type (read or write) whose LBA range comes before and near the start of this new command. For a sequential stream, multiple streams may interleave with each other. There are user settings to affect these subtleties, allowing us to describe the variations of sequential command situations.

Queued sequence stream : In a time-order request stream, some are adjacent to each other (possible within a time window), such that last LBA(r_i)+1 = first_LBA (r_j ), where i < j, D_j − D_i < δt and δt > 0 is an aging time.

In this way, j ≥ i + 1 is possible because of interleaved sequence streams. However in practice, a command queue with queue length N is required to identify the sequence, instead of only checking the exactly adjacent requests wherein. Once a new request enters the queue, it searches the queue to check the sequence match. If a match is found, it’s merged into the queue; otherwise, it adds a new entry to the queue. If the queue is full, it removes the entry using a replacement rule (e.g., FIFO or LRU).

Generally, the larger N, the larger number of sequence requests detected (M₂) in each stream, and the smaller number of sequence streams (M₁). The key is to find a large enough N such that the number of sequence streams detected is stable.

Queued sequence stream with constraint : In practice, a large block request S (e.g., S ≥ 1024) is also counted as a sequential request. Hence an alternative method in considering the stream i is to determine the total size of a stream S_i together with M₂, so if S_i ≥ S and M₂ ≥ M , then i is considered a sequence stream, where S and M are thresholds.

Queued near-sequence stream : Finally, it is possible that a small gap between LBA requests exists such that in a time-ordered request stream, some requests are near to each other within a time window. Once a new request is considered near-sequential to the previous one, the stream’s last LBA is updated as the new request’s last LBA (hole is filled for simplicity), so last LBA(r_i) +1≤ first LBA(r_j ) and last LBA(r_i) +δd ≥ first LBA(r_j ), where i < j and 0<D_j –D_i < δt. δd > 0 is a small integer in blocks, such as 64. A command queue is required to detect the interleaved streams and the size constraints are also applicable to near-sequence streams. See Figure 2-10.

../images/468166_1_En_2_Chapter/468166_1_En_2_Fig10_HTML.jpg — Figure 2-10
An example of near sequence stream

Let’s look at an example in Table 2-6. Assume that these requests arrive in a t-seconds time window. It is obvious that all requests are random by comparing the last LBA of the pervious request and first LBA of the current request in terms of FIFO. In this case, there are seven times of seeks and N_r =8 and N_s =0. In this sense, the workload is random because there is no sequential stream. However, if you reorder the requests with the aid of queue in term of LBA, so the request order < 263158(7)4 >, you have N_r =3 and N_s =4, in terms of (strict) sequential streams. Note that the request number 7 is absorbed by the request number 8 during the reordering. So M₂=2 (< 2631 > and < 58(7) >) for (strictly) sequential stream. With two constraints, S = 1024 and M₂ = 2, you have (< 58(7) > and < 4 >). The stream of < 2631 > is removed due to small size.

When considering a near sequential stream with ∆d=64 and 512, you have M₂=3 (< 2631 > and < 58(7) > and < 4 >) and 1 (< 263158(7)4 >), respectively, under strictly sequential stream.

Table 2-6

An Example of LBA Sequence and Randomness

Order	First LBA	Last LBA	Order	First LBA	Last LBA
1	128	135	5	256	511
2	0	7	6	8	63
3	64	127	7	640	647
4	1536	2047	8	512	1279

From this example, here are the following variables:

N_s1: The number of requests with sequential access
N_s2: The number of requests with near sequential access and LBA gap
N_r : The number of the remaining requests

These variables have strong connection with the logical seek defined in the next section.

The practical gap size ∆d in a real application is usually determined by the system performance tolerance. Usually, it is up to few track sizes. For example, if an acceptable tolerance is 5ms, then ∆d can be up to 1536 blocks based on disk model of 10K RPM drive with track/head switch time at 1ms and average track size at 1.5MB. For random access, the average rotational latency is 3ms. Assume a 50% chance of the new request in the same track and 50% chance in next track (the actual probability is coupled with ∆d). Then you have 5-(3+1)*50%=3ms for further rotation, which is about half a track size, so 1536 blocks.²

Sequential ratio: Due to the different views on the sequence, the definition of sequential ratio is also varied. Below, few examples are listed.

DEF1: The summation of sequential commands detected with S and M₂ thresholds / total commands
DEF2: (The summation of sequential commands detected with S and M₂ thresholds - total sequential streams) / total commands
DEF3: The summation of the request size of sequential commands detected with S and M₂ thresholds / total request size of all commands
DEF4: (The summation of the request size of sequential commands detected with S and M₂ thresholds - the summation of the size of the first command in sequential streams) / total request size of all commands

If you remove size constraint S, you get another four definitions.

Spatial Locality and Logical Seek Distance

Locality , as a special case of correlation of a variable with itself over short to medium ranges, can be visualized as a 2D probability surface p(s, d), showing the probability that the first time an address s bytes away will be referenced in exactly d cycles. There are generally two types of locality: spatial locality and temporal locality. For spatial locality, if the distance s = |x_i − x_j | is small, it implies that not long after referencing address x_i you will discover a reference to the adjacent address x_j . Temporal locality will be discussed in the next subsection.

Note that only LBAs are provided in most traces, not the physical block address (PBA). Because different HDDs may not have exactly same data layouts (such as serpentine), identical LBA layouts between different HDDs can result in different PBAs. However, the difference is usually very small if their data layout is similar. Therefore, analyzing logical distance is also meaningful.

Below I discuss two distances to indicate spatial and temporal localities, respectively [35].

Logical Seek Distance

This metric is an indicator for spatial locality. It defines the logical block/LBA distance between “consecutively” device-received IOs.

Non-queued next: The simplest case, which calculate the distance between two exactly consecutive IOs.
Non-queued closest: The absolute closest distance = min (||last_LBA(r_j−1)– first_LBA(r_j ) ||, ||last_LBA(r_j ) – first_LBA(r_j+1)||), where r_j is the current IO. The closest distance is the signed value of the absolute close distance, where ||·|| indicates the absolute value.
Queued next: Simulates a queue with certain rules such that the absolute closest distance = min (||last_LBA(r_i)– first_LBA(r_j )||), i=1,. . . ,n_i ≤ N , where n_i ≤ N is the current queue length, and r_i is the IO in the queue.
Queued closest: Simulates a queue such that the absolute closest distance = min(||last_LBA(r_i) – first_LBA(r_j ) ||_i=1,...,ni, ||last_LBA(r_j ) – first_LBA(r_j+1)||).

In general, ||Queued closest ||≤ ||Non-queued closest ||≤||Non-queued next||. For the distance, you can further define three values: mean, median, mode. Mode indicates the most frequent number such that the larger the counter for mode = 0, the better the sequence.

Temporal Locality and Logical Stack Distance

Temporal locality means that the same address is referenced again after d steps. Popularity is a terminus that is closely related to temporal locality.

Logical Stack Distance

If an LBA in r_i is referred to again after some time by another request r_j , the command distance between the two requests is defined as (simplified) the logical stack distance, an important temporal locality index. Let’s take write requests as an example. If the distance is small enough (e.g., smaller than the HDD’s DRAM cache queue size), it might be a hit in the DRAM cache; otherwise, a disk update is required for the write. If the frequency of the write for a certain range of distance is high, it means the update frequency is high.

Unlike the case in [35], we are more interested in queued stack distance with consideration of cache. Therefore, let’s also look into read/write dependency. The details will be discussed in later in this chapter.

Burstiness and Self-Similarity

I/O requests almost never occur singly but tend to arrive in groups because, if there were long intervals without arrival, there were intervals that had far more arrivals than their even share. They are generally related to queue depth and request arrival rate and inter-arrival time. This phenomenon is often called long-range dependence and is typically explained as (stochastic) self-similarity because that is the only explanation that avoids non-stationarity. The phenomenon of self-similarity describes how a property of an object is preserved while scaling in space and/or in time. In other words, in addition to the long-range dependence property, the scale invariance property holds at any different time scale, like in a fractal shape.

Statistical Properties Visualization and Evaluation

Besides the statistical properties mentioned in previous sections, there are several other properties of both the marginal and the joint probability distributions of the interested attributes that may strong influence the quantitative analysis of the system behavior [35]:

Normality: A probability distribution can be accurately approximated by a normal distribution. Although perfect normal distribution is rare in reality, it is often used a reference model.
Nonstationary: In a stationary process, the outputs (job size, inter-arrival time) vary, but the stochastic process that produces them does not.
Long-tailness and power-law: Some regions far from the mean or the median of the probability distribution, like the extreme values in the tails of the probability distribution, are assigned relatively high probabilities following a sub-exponential or polynomial law, contrary to what happens to the family of normal distributions, where the tails fall exponentially.
Heavy-tailness: The long tails of the distribution fall polynomially and the self-similarity property holds.
Cyclic behavior and seasonal variations: They are an indication of a non-stationary workload and must be treated separately.
Autocorrelation, cross-correlation, and short-range dependence: Autocorrelation is also known as serial correlation or short-term/range memory, where the autocorrelation at short time scales is significant and long-range dependence is also known as long-term memory, where the autocorrelation at long time scales is significant.

These properties can be possibly visualized via graphical plotting. The empirical cumulative distribution function (EDF) and the complementary EDF (CEDF) are often used to observe the sample distribution. In particular, the log-log EDF and log-log CEDF plots are usually applied to compare the body and the tails of the sample probability distribution, respectively. Similarly, the Q-Q plot (a plot of the quantiles of the first data set against the quantiles of the second data set) can be employed for evaluating possible differences, especially in the tails, between the empirical probability distribution and another reference probability distribution, either theoretical or empirical.

The mass-count disparity plot and the Lorenz curve can be used to look for evidence of the power-law property. The mass-count disparity plot displays the “mass” probability distribution (given by the probability that a unit of mass belong to an item smaller than a predefined x) against the “count” probability distribution (given by the CDF) in order to show possible disparities between these two probability distributions. The Lorenz curve is an alternative way to illustrate the relationship between the “count” distribution and the “mass” distribution; it is generated by pairing percentiles that correspond to the same value (i.e. a point (p_c, p_m) in the curve is such that ${p}_m={F}_m(x)={F}_m\left({F}_c^{-1}\left({p}_c\right)\right)$ where F_m(·) and F_c(·) are the cumulative distribution functions of the “mass” and “count” distributions, respectively, and ${F}_c^{-1}\left(\cdot \right)$ is the inverse of F_c(·).

The run-sequence plot and the autocorrelation plot for investigating for the presence of both short-range and long-range dependence as long as for periodic patterns and trends. The run-sequence plot displays observed data in a time sequence; it is constructed by plotting values of the interested (univariate) attribute according to the temporal order as they appear; this plot is particularly useful for finding both shifts in location and scale, for locating outliers and, in general, for getting insights about the trend of observed data. The autocorrelation plot (also known as a correlogram) is a plot of the sample autocorrelation function (ACF), which is of the sample autocorrelation at increasing time lags; it is used for checking for randomness in a sequence of observations of the same attribute. If random, such autocorrelations should be near to 0 for any and all time-lag separations; conversely, if non-random, then one or more of the autocorrelations will be significantly different from 0.

Some hypothesis-testing techniques can be used for quantitative evaluations of these properties, such as F-test, T-test, K-S (Kolmogorov-Smirnov) test, Mann-Whitney U-test, and H-test. These tests can compare the differences between two empirical distribution functions/samples. The Pearsons r and the Spearmans ρ correlation coefficients are utilized to discover linear and generic correlations, respectively, among the interested attributes. Both coefficients are within the range of [–1,+1], where +1 means a strong positive correlation, while –1 means a strong negative correlation. 0 means no significant correlation. More details will be described in Chapter 9.

Read /Write Dependency

Dependencies are generally classified into four categories [39]:

Read-on-write (ROW), or read after write (RAW), or true dependencies
Write-on-write (WOW) or write after write (WAW) or write update, or output dependencies
Write-on-read (WOR), or write after read (WAR ), or anti-dependencies
Read-on-read (ROR) or read cache hit, or input-dependencies

Within a certain time window, ROR and ROW are directly related to the cache replacement algorithm and read hit performance, while WOR and WOW are related to the cache update policy. The existence of a ROW between two operations relies on the situation that if the first operation writes a block that is later read by the other operation and there is no intervening operation on the block. WOW and WOR are similarly defined. ROR is a very common property to check read cache efficiency. ROW can check if the so-called “write once read many (WORM)” is possible, which is an important value for SMR (the higher the better).

Write Update (Write on Write)

A high WOW ratio generally means high block update ratio. Therefore, when replicate blocks exist, it might be better to update one of the copies and invalidate the remainder rather update all the copies, if the WOW ratio is quite high within a short time window, for IO performance consideration. By comparing ROW and WOW, you can conclude the likelihood of blocks getting updated vs. being read or not. SMR generally expects less write update, resulting in smaller write amplification. If an out-of-place policy is applied for write updates, you can expect better spatial efficiency.

In order to have a better view on WOW, I define three different types of update ratios below. Their different purposes are shown in Table 2-7 .

Table 2-7

Comparison of Write Update Ratios

	Frequented	Timed/ordered	Stacked
Total updated blocks/commands	Yes	Yes	Yes
Timing issue	No	Yes	No
Update frequency	Yes	No	No
Memory stack	No	No	Yes

Frequented update ratio: During a time period, record the update frequency of each LBA. Any write hit is counted. If its frequency is larger than 1, the LBA is rewritten in the observation period. You can therefore provide a curve of x-axis vs. y-axis, where

x is the maximum update frequency (≤1 no rewritten; otherwise rewritten)
y is the percentage of blocks updated (updated blocks at a specified-frequency value / total updated blocks of write commands) or its cumulated distribution function (CDF).

This process gives a quick, coarse grain analysis of how LBAs are updated in a workload of conventional PMR drives. However, it may not reflect the actual ratio for SMR drives. In fact, due to the indirect mapping and log nature of SMR, it misses the actual write update; for example, a rewrite with an out-of-place update actually is “new” write to SMR drives.

Timed/ordered update ratio: During a time period, record the total blocks of updated write request (repeated or not). An update is an operation to rewrite a block that was previously written during the observation period. You can provide a curve of x = time or command IDs vs. y = total size of updated blocks or percentage of blocks updated (updated blocks at frequency x / total blocks of commands), so the percept is the percentage of total blocks that were updated. Note that a similar definition to frequented update ratio and timed/ordered update ratio are given in [4, 40].

Stacked update ratio: During a time period, record the update frequency of each write command (partial or full hit). Once a rewrite happens, it is counted as a new command (update frequency is always ≤1). You can provide a curve of x = logical stack distance vs. y = percentage of updated write commands (updated write commands/total commands), or a curve of x = logical stack distance vs. y = percentage of updated write size (updated write size/total commands). It shows the actual update size/commands and tells if an update in SSD/DRAM cache is necessary. Note that the stack distance can be also replaced by time interval.

Read on Write (ROW)

This metric is used to check if “write once read many (WORM)” is possible. In general, the higher the ROW ratio, the better the WORM property of the workload.

Frequented ROW ratio: During a time period record the read hit frequency of each LBA after a write command. You can then plot a curve of x = maximum hit numbers or frequency (<1 not updated, otherwise updated) vs. y = percentage of blocks or commands hit (updated blocks at frequency x / total number of HDD blocks).

Timed ROW ratio: During a time period, record the hit blocks of each read command since last write. You can provide a curve of x = time or command ID vs. y = percentage of blocks or commands of read hit commands (blocks of hit read commands/blocks of total commands).

Stacked ROW ratio: During a time period, record the hit frequency of each read command (partial or full hit) since last write. You can provide a curve of x = logical stack distance vs. y = percentage of read hit command (hit read commands/total read commands) or y = percentage of blocks of read hit command.

Beyond the material presented here, other qualities, such as self-similarity (for burst IO) and workload dependence among of nodes (i.e. how the tasks are distributed among nodes) [35, 39], are interesting metrics to be studied further. In the spirit of brevity, I include a targeted presentation of workload metrics, omitting these analyses .

Priority-Related Metrics

Requests have priority during execution. Usually, the foreground activities have higher priority than background activities. In some cases, such as synchronization in RAID5/6 and EC operations, these marked as synchronized requests may have higher priority due to time-out policies.

Other metrics may be related to cluster, multi-threads, network and so on [41]. For example, the network factors (I/O bandwidth distribution, channel utilization, instructions, packet sizes, source or destination of packets, page reference pattern, various forms of delay like transfer time and queuing delays, etc.) will also influence the final device performance.

Modeling Issues

There are two regular approaches to evaluating a system design. One employs the traced workload directly to drive a simulation. The other builds a model from the trace and uses the model for either analysis or simulation. Although the numerical trace-based approach is straightforward, the workload models may have some advantages over traces, such as adjustment with controlled modification, repetitions, stationarity, generalization, avoiding noise, increased understanding, added features, efficiency, etc. Therefore, it is important to understand the metrics via mathematical models. In the previous section, I pointed out that the hypothesis tests are based on some assumptions of statistical models. Tables 2-8 and 2-9 list few common sense details on the modeling issues of some basic and advanced metrics. For more details, refer to [35].

Table 2-8

Mathematical Models for Basic Metrics

Metrics	Modeling issues
Throughput	Commonly used distributions: Pareto distribution
Queuing time	Commonly used distributions: exponential distribution, Erlang distribution
Service time	Commonly used distributions: exponential distribution, Erlang distribution
Response time	Commonly used distributions: exponential distribution, Erlang distribution
Disk seek time	May be modeled by data fitting using a piecewise function
Read to write ratio	No obvious distribution, depends on read/write dependency
Request size distribution	Commonly used distributions: logarithmic distribution, trunked-normal distribution, Pareto distributions, exponential distribution, power-law distribution, log-uniform
LBA range/randomness	See spatial locality
Inter-arrival time	Commonly used distributions: Poisson distribution, exponential distribution, lognormal distribution

Table 2-9

Mathematical Models for Advanced Metrics

Metrics	Modeling issues
Locality	Commonly used distributions: Zipf distribution (Independent Reference Model). Other models: LRU stack model, Markovian models, fractal model
Spatial locality	The simplest way is by counting unique substrings of the reference stream. Formally measured by the size of the working set.
Temporal locality	The simplest measure is to look at the reference stream through a pinhole or using a simulation of an LRU stack.
Read/write dependency	Models: State machine, Markov chain, clustering
Priority-related metrics	They are completely determined by the OS/FS rules. A model can be built based on Markov chain with
Burstiness and self-similarity	Auto-covariance and covariance matrices are often used to describe self-similarity.

A general procedure to generate an analytical model is as follows:

First, decide the formulation: the characterization level and the workload basic component.
Second, collect the parameters of the workload to be modeled while it is executed.
Third, statistically analyze the measured data with some statistical distribution (e.g., logarithmic distribution, trunked-normal distribution, Pareto distributions, exponential distribution, power-law distribution, log-uniform distribution) and stochastic model (e.g., Markov models), including the sampling procedure and static/dynamic analysis.

An example using Markov chain can be stated in the following steps.

1.
Assuming the workload is approximately periodic with s states, model the IO trace of each data storage unit in one cycle as a specific Markov chain. Otherwise, all the corresponding Markov chain parameters should be time-varying.
2.
Given a historic workload trace L, represented as a D × T matrix: $X=\left[\begin{array}{cccc}{x}_{1,1}& {x}_{1,2}& \dots & {x}_{1,T}\\ {}{x}_{2,1}& {x}_{2,2}& \dots & {x}_{2,T}\\ {}\cdots & \cdots & \cdots & \cdots \\ {}{x}_{D,1}& {x}_{D,2}& \cdots & {x}_{D,T}\end{array}\right]$
where D is the total number of data storage units and T is the total number of time intervals. x_d,t is the IO intensity for any chosen data unit d at time t. x_d,t ∈ {1, 2, 3, ..., s}.

3.
Each Markov chain can be represented by its s by s state transition probability matrix: $P=\left[\begin{array}{cccc}{p}_{1,1}& {p}_{1,2}& \dots & {p}_{1,T}\\ {}{p}_{2,1}& {p}_{2,2}& \dots & {p}_{2,T}\\ {}\cdots & \cdots & \cdots & \cdots \\ {}{p}_{s,1}& {p}_{s,2}& \cdots & {p}_{s,s}\end{array}\right]$
where p_i,j is the probability of the data unit IO intensity state change to j at next time interval under the condition that its current IO intensity state is i, and i, j ∈ {1, 2, 3, ..., s}

4.
To simplify data allocation in the tiered storage system, you also need to classify the thousands of data units into a dozen of data clusters according to their dynamic IO intensities, such as a K-means algorithm using the state transition probability matrices as well as the residual time for each state s of every data units. By combining the initial state of the cluster S, you can obtain the predicted IO intensity of all the clusters in a whole period, represented as the following N × T matrix: $S=\left[\begin{array}{cccc}{s}_{1,1}& {s}_{1,2}& \dots & {s}_{1,T}\\ {}{s}_{2,1}& {s}_{2,2}& \dots & {s}_{2,T}\\ {}\cdots & \cdots & \cdots & \cdots \\ {}{s}_{N,1}& {s}_{N,2}& \cdots & {s}_{N,T}\end{array}\right]$ , where s_l,t is the IO intensity of the l^th data cluster at the t^th time interval.

Refer to [7, 8] to see the difference between simulation and modeling approaches. For more complex cases, you may use the so-called multiple-level process models. And, in these cases, the correlation must be considered.

Typical Applications

In this section, you will take a look at several trace metrics of some typical applications in order to get a quick view. Table 2-10 lists some typical applications, where only block size, read/write percentage, and random/sequential percentage are provided with some rough values. The dominant factors, IOPS or MBPS (throughput), actually show that whether the workload is random request dominant or sequential request dominant.

Table 2-10

Typical Application IO Workload Profiles

	Application IO profile
Application	Size (Byte)	R/W	Rand./seq.	Dominant
Web file server	4KB, 8KB, 64KB	95%/5%	75%/25%	IOPS
Database online transaction processing (OLTP)	8KB	70%/30%	100%/0%	IOPS
Exchange email	4KB	67%/33%	100%/0%	IOPS
OS drive	8KB	70%/30%	100%/0%	IOPS
Decision support systems (DSS)	1MB	100%/0%	100%/0%	IOPS
File server	8KB	90%/10%	75%/25%	IOPS
Video on demand	512KB	100%/0%	100%/0%	IOPS
Web server logging	8KB	0%/100%	0%/100%	MBPS
SQL server logging	64KB	0%/100%	0%/100%	MBPS
OS paging	64KB	90%/10%	0%/100%	MBPS
Media streaming	64KB	98%/2%	0%/100%	MBPS

IO workload characteristics are generally application-dependent nature. Many access patterns, such as read/write proportions and handling of writes differ by particular applications. Nevertheless, the majority of characteristics vary only by environments (operating conditions). Environment-dependent characteristics include the length of idle intervals, request arrival rate, workload randomness and sequentially, read and write performance, disk service time and response time of request, request size, etc. More importantly, there are characteristics of the overall IO workload that do remain consistent through applications and environments. A particular note here is workload burstiness (i.e. long-range dependence). The block-level workloads, in particular, request inter-arrival times and request seek distances, are long-range–dependent in general. As a measure of temporal locality in a time series, long-range dependence has a variety of consequences specifically with regards to predict overall system and particular resource saturation. Consequently, burstiness shall be taken into consideration when designing new storage systems, and resource management policies at various layers of the IO path. As a result, there is no universally good configuration for all workloads due to large difference in various applications [42].

For file system performance, keeping the file system block size close to the workloads I/O size can increase the efficiency of the system significantly [43].

Web traffic volume is increasing rapidly. Some researchers argue that there have been no dramatic changes in web server workload characteristics in the last 10 years [44]. They consist of one-time referencing behaviors, heavy-tailed file size distributions, non-Poisson aggregate request streams, high concentration of references, Poisson per-document request streams, and the dominance of remote requests.

A database usually has a significant inter-transaction locality, showing that real workloads transactions are generally dependent of each other. Another observation is that significant use of sequential accesses allows a prefetch policy to be applied. Sequentiality is a consequence of long-running queries that examine a large number of records, such as a join operation.

OLTP (online transaction processing) workloads are characterized by a large memory footprint, joined with a small critical working set, and by their reduced benefit from micro-architectural optimizations. In addition, index searching in OLTP workloads require a different cache design.

Let’s further look into two workloads with more details.³ Table 2-11 show the metrics values of the two traces. You can see that their properties have large differences. Sometimes, only two or three major metrics are used in the simple synthetic trace generators, although the applications’ actual trace is much more complex.

Table 2-11

Basic Metrics for Two Typical Workloads

Trace	Duration(s)	Traffic (G-B)	Total requests (×10⁶)	Avg. R/W size (KB)	R/W traffic ratio	Random read (×10⁶)	Random write (×10⁶)
OLTP	43712	18.491	5.335	3.466	0.1820	0.955	2.99
Search engine	3151.3	16.369	1.055	15.509	8762.9	0.994	2.08e-4

Traces in File- and Object-Levels

The other two types of traces generally share many common properties with block-level traces. However, they have their unique features in some scenarios. Table 2-12 gives some metrics of file-level traces, which are different from those of block-level traces.

Table 2-12

Some Metrics for File-Level Traces

Metrics	Explanation
File types	A file system is utilized in one of several ways: as a long term storage for persistent data, as a repository for data too large to fit in the main memory, as the site of storage for executables, and for storing logs of system usage for accounting and monitoring. The metadata to user data ratio is an important index
File age	The age is defined as the time from its last reference.
File access duration	The time from open to close
User behavior model	Users generate the references that constitute the workload.
Process and state model	Several aspects are included, such as how a process makes file service requests during its existence; the conditional activation table for access dependency; and the ratio of different operations, such as open, close, write, read, seek, etc. It may be molded as a closed Markov chain.
Reference model	How the requests are distributed among the files in the system

As mentioned, the metrics to be considered may be different based on different abstract levels. For example, in the system level, we usually consider the attributes listed in Table 2-13, where most of them could be in file or object-levels.

Table 2-13

Workload in System Level

Workload in systems views	Load arrival pattern	Data access pattern	Computation pattern
MapReduce	The time arrival of sequence of jobs	HDFS input and output paths, the data size for the input, shuffle, and output storage	Input data size, shuffle data size, output data size, job duration, map task time, reduce task time
Enterprise network storage	The time arrival sequence of application instances or user sessions	Read/write, sequential/random, single/repeated access, file sizes, file types	N.A.

In this chapter, I presented some trace metrics for performance evaluation and design of storage systems. Some typical applications were given to provide a quick impression on these metrics. In fact, due to the large difference of some applications, there is no one-for-all system design in general. I will discuss more details in the later chapters.