Trace is usually classified as three levels: block level, file level, and object level. They share many common metrics, although each has its own unique properties. In this chapter, I will discuss block-level trace metrics in detail since the block-level trace provides more fundamental information on storage systems than other two levels. For simplicity of representation, I divide the metrics into two categories: the basic ones and the advanced ones. The meanings and performance impacts of these metrics are explained in detail.
Workload Properties

Typical workload abstraction level
Examples of Basic Metrics
Basic Metrics | |||
---|---|---|---|
Read to write ratio | Request size distribution | LBA range/ randomness | Inter-arrival time |
The ratio between read and write operations in command number or total size | Request size, usually further count for total, read, and write | In a virtual volume, LBA can be used to represent the randomness in space. Otherwise, specify the device number. | Request arrival rate, idle time, busy time |
With system-level information, queue depth, average response time, bus time, etc. can be also included. |
Examples of Advanced Metrics
Advanced Metrics | |||
---|---|---|---|
Spatial locality | Temporal locality | Read/write dependency | Priority-related metrics |
The small distance of two requests means that soon after referencing ri you find a reference to the nearby address rj | The same address is referenced again after d steps. A concept closely related to temporal locality is popularity. | Three categories: true (read on write), output (write on write), and anti (write on read) | Request priority determines the execution sequence; it also includes the properties of synchronized vs. asynchronous. |
In most cases, cache-related, cluster-related, and/or networked-related metrics should be included. |
Note that the meaning of “block” in HDD and SSD is different. However, the definition of one sector as one block in HDD is used here. In addition, there are some different formats for sectors, such as 512, 512e, 4k, 4kn, etc. Without particular comments, assume that one sector is equal to 512 bytes for representation simplicity.
An Example of Block Trace
Order | Arrival time (s) | Completion time (s) | First LBA | Size | Mode |
---|---|---|---|---|---|
1 | 0.026216 | 0.027506 | 902553856 | 1024 | 0 |
2 | 0.026790 | 0.027719 | 197306368 | 8 | 0 |
3 | 0.036680 | 0.039502 | 902554880 | 1024 | 0 |
4 | 0.039618 | 0.044770 | 197306368 | 16 | 1 |
5 | 0.039654 | 15.079936 | 197306368 | 16 | 1 |
6 | 0.044542 | 0.046394 | 902555904 | 1024 | 0 |
7 | 0.044865 | 0.046513 | 197306376 | 8 | 0 |
8 | 0.054996 | 0.055265 | 2657576 | 8 | 0 |
9 | 0.059638 | 0.059905 | 197306376 | 16 | 0 |
10 | 0.081950 | 0.083162 | 902556928 | 1024 | 0 |
11 | 0.089740 | 0.089960 | 197306384 | 8 | 1 |
12 | 0.092741 | 0.093955 | 902558976 | 1024 | 0 |
13 | 0.093261 | 0.095268 | 902557952 | 1024 | 0 |
14 | 0.112958 | 0.114461 | 902560000 | 1024 | 0 |
15 | 0.113097 | 0.115717 | 902561024 | 1024 | 0 |
16 | 0.114820 | 0.115926 | 197306384 | 8 | 0 |
17 | 0.135434 | 0.136744 | 902562048 | 1024 | 0 |
18 | 0.136436 | 0.136963 | 197306384 | 16 | 1 |
19 | 0.150173 | 0.151625 | 902563072 | 1024 | 0 |
20 | 0.150260 | 0.152809 | 902564096 | 1024 | 0 |
Basic Metrics
Basic metrics are usually simple and easy for observation. But this does not mean they contain less information than advanced metrics. In some cases, basic metrics are good enough to interpret the workload properties.
LBA Distribution

LBA distribution from a Ceph node

LBA distribution from a Hadoop node
Size Distribution
In general, the drive (whatever HDD or SSD) is in favor of sequential access, so the sequential access has much higher throughput than the random access, in particular for HDD. And the request size also matters especially when the disk queue is short, which will be clearly illustrated in Chapter 4 in Figures 4-2 and 4-3.

Size distribution from a Hadoop node

Combined LBA and size distribution
Read/Write Distribution
Required Fields for Metrics Calculation
Arrival time | Completion time | LBA | Request size | Mode operation | |
---|---|---|---|---|---|
LBA distribution | Y (or) | Y (or) | Y | ||
Size distribution | Y | ||||
IOPS | Y | Y (or) | |||
Throughput | Y | Y (or) | Y | ||
Queue length/depth | Y (or) | Y (or) | |||
Busy/idle time | Y | Y | |||
Read/write distribution | Y (or) | Y (or) | Y | ||
Inter-arrival time distribution | Y | ||||
Inter-completion time distribution | Y |
Inter-Arrival and Inter-Completion Time
Inter-arrival time is defined as the time interval between two sequentially arrived requests: δti = Di − Di−1. Similarity, inter-completion time is defined as δt¯i = C¯i − C¯i−1, where C¯i is reordered Ci based on a completion time sequence. δti is a direct indictor of the workload burstiness [38], together with the supposed-to-be average completion time of requests. When the average δti is much smaller than δt¯i, it usually means the system is under a stressed workload, which may be beyond the storage device’s capability. In a sense, these two indictors are closely related to IOPS.
IOPS and Throughput



Average IOPS and throughput with different time window
The curves of the two metrics vs. time can be used to observe the workload burst visually. However, choose ∆t carefully. A too-large ∆t may smooth the curve but remove the IO burst. A practical choice is related to the average inter-arrival time, so ∆t may be a few times the average inter-arrival time, but not too much.
Alternatively, you can fix ∆n. For example, you may set ∆n=20, and let ∆t be various.
A complex setting is that you may let both ∆t and ∆n various. For example, let ∆n≤ 10 and ∆t = 1 second as the constraint. If within ∆t = 1 second, there are ∆n within the range, do the average. Otherwise, if ∆n > 10, choose a time window that ∆n = 10, and then do the average.
IO(tj ) = the total number of IOs based on the range calculated by Di, so IO(Di), where Di is within an interval [tj−1 tj ]
R(tj ) = the summation of request size of IO(Di)
Average IOPS at tj = IO(tj−1)/∆t
Average request size at tj = R(tj−1)/IO(tj−1)
Average throughput at tj = R(tj−1)/∆t
You can also apply moving average techniques here; there is an overlap between two continuous “average” metrics.
Response Time
Response time is generally combined by the waiting (or queuing) time Tw and access (or service) time Ta, i.e., Tres = Ta + Tw . The service time is the time cost in service, such as disk access time. The queuing time is the cost when the request waits in the queue before it is sent for actual execution. When the trace is collected from the external bus (IO driver), such as SATA/SAS, the response time of ri is calculated by Ci − Di.
Queue Length/Depth
- IO driver queue depth just before command arrived (Qd1)
The queue depth just before Di (instant queue depth Q(Di)) = the number of requests whose C time >= Di and D time < Di
- IO driver queue depth just after command completed (Qd2)
The queue depth just after Ci (instant queue depth Q(Ci)) = the number of requests whose C time >= Ci and D time < Ci
- Average queue depth
Estimated average queue depth during non-idle period: ∑i((Qd1+Qd2)/2 ∗ (Di − Ci))/∑i(Di − Ci)
Effective average queue depth in time interval: Sampling at ∆t seconds

Illustration of queue length
Estimated queue depth and idle time
Arrival time | Completion time | Queue depth | Idle time |
---|---|---|---|
0.00007 | 0.00036 | 0 | - |
0.01130 | 0.01157 | 0 | 1.09E–02 |
0.01134 | 0.01288 | 1 | 0 |
0.02622 | 0.02751 | 0 | 1.33E–02 |
0.02679 | 0.02772 | 1 | 0 |
0.03668 | 0.03950 | 0 | 8.96E–03 |
0.03962 | 0.04477 | 0 | 1.17E–04 |
0.03965 | 15.07994 | 1 | 0 |
0.04454 | 0.04639 | 2 | 0 |
0.04486 | 0.04651 | 2 | 0 |
0.05500 | 0.05526 | 1 | 0 |

Queue depth using arrival and completed time from a Hadoop node
The other is the HDD internal queue for actual execution with mechanical parts. Some scheduling algorithms, such as SCAN and RPO (rotational positioning optimization), are applied here to attempt to minimize a given operation time. Vender special commands (VSCs) can be used to collect the internal queue information. RPO tries to minimize the summation of seek time and rotational latency of requests in the queue. Theoretically, a longer queue means the scheduler has more events to select, and the overall operational time can be reduced further (for a pure random access). However, in practice, when the queue length reaches a certain level, the performance does not increase, which is illustrated in Figures 4-2 and 4-3. Without VSC, we may only use the external IO queue status to approximate the internal queue status. This is reasonable in a sense because a HDD can only handle one request per time.1 However, for SSD, the situation is much more complex due to parallel processing.
Busy/Idle Time
When a disk is under an IO operation (including both foreground and background activities), the disk is busy; otherwise, it is in idle mode. These two metrics are useful when you intend to design some background operations for disks, such as defragmentation, address remapping, sector scanning, etc. As some of these operations require over hundred milliseconds to complete, the estimated busy or idle time will be estimated to show whether the performance will be affected by background operations.
However, without VSC to extract the drive internal trace log, you may not get the busy/idle time directly. Thus we may only approximate the time from the external trace when the internal trace is unavailable. The basic idea is to compare the completion time of requests in the queue with the new requests’ arrival time. If the completion time is later than the new arrival time, it means the drive is still busy at this arrival time. Otherwise, it is counted as idle time (although there may be some background activities inside disk drives; but you can assume that the user/external requests have higher priority than background activities). In other words, the calculation of idle time is heavily related to the queue depth, so only when queue depth is zerois there a chance to exist an idle interval.
Let’s consider the trace in Table 2-3. The completion time 0.027506s of the first request is later than the arrival time of 0.026790s of the second request, so the disk is busy at 0.026790s. However, the completion time 0.027719s of the second request is earlier than the arrival time 0.036680s of the third request, you may estimate that the drive have the idle time of 0.036680–0.027719=0.009s. Table 2-5 shows the result.
Advanced Metrics
The workload from the real world is usually complex. Even for those claimed as sequential workloads, such as a Hadoop HDFS write or Ceph Bluestore write, the actual IO pattern in the block level is mixed with sequence and randomness, as well as read and write. The advanced metrics attempt to provide insights into these traces.
Sequence vs. Randomness
The sequence and randomness of a workload are somehow subjective in some engineers’ view, as it is not easy to clearly state whether a workload is more sequential or random when it is mixed. In addition, the “feeling” of sequence is also different in different cases. For example, some requests with very small gaps among them may be also considered as (near) sequential.
In my view, it may be quantified under different scenarios, such as the scheduler algorithm, queue length, etc. That means this property is objective when all the conditions are fixed. For example, consider a burst workload of write requests within a small LBA range. Some of the requests are actually continuous in LBA, although they do not arrive in order. If a FIFO scheduling policy is applied, this workload is random, as the disk has to seek to different locations for each request. When a simple SCAN or RPO scheduling algorithm is applied, the requests will be reordered and some will become sequential, if there is a long enough queue. Assume that there are N requests. So there are up to N−1 seeks (in a strict definition, a request is considered as a random one, if there is a seek in term of LBA). Let the random access number as Nr and sequential access as Ns, so you can obtain a ratio of randomness vs. sequence for a given workload under a fixed scenario.
Sequential and Near Sequential Streams
This metric directly indicates the sequence degree of the workload. In general, a command is sequential when it comes after a similar command whose requested block LBAs just prior to this new command. Hence, sequential reads follow read commands and sequential writes follow write commands; but there are some subtleties to this definition. (Strictly) sequential stream means that the current and previous commands are of the same type (r/w) and the new command’s starting LBA immediately follows the previous one. Near sequential stream means that there must be an earlier command of the same type (read or write) whose LBA range comes before and near the start of this new command. For a sequential stream, multiple streams may interleave with each other. There are user settings to affect these subtleties, allowing us to describe the variations of sequential command situations.
Queued sequence stream : In a time-order request stream, some are adjacent to each other (possible within a time window), such that last LBA(ri)+1 = first_LBA (rj ), where i < j, Dj − Di < δt and δt > 0 is an aging time.
In this way, j ≥ i + 1 is possible because of interleaved sequence streams. However in practice, a command queue with queue length N is required to identify the sequence, instead of only checking the exactly adjacent requests wherein. Once a new request enters the queue, it searches the queue to check the sequence match. If a match is found, it’s merged into the queue; otherwise, it adds a new entry to the queue. If the queue is full, it removes the entry using a replacement rule (e.g., FIFO or LRU).
Generally, the larger N, the larger number of sequence requests detected (M2) in each stream, and the smaller number of sequence streams (M1). The key is to find a large enough N such that the number of sequence streams detected is stable.
Queued sequence stream with constraint : In practice, a large block request S (e.g., S ≥ 1024) is also counted as a sequential request. Hence an alternative method in considering the stream i is to determine the total size of a stream Si together with M2, so if Si ≥ S and M2 ≥ M , then i is considered a sequence stream, where S and M are thresholds.

An example of near sequence stream
Let’s look at an example in Table 2-6. Assume that these requests arrive in a t-seconds time window. It is obvious that all requests are random by comparing the last LBA of the pervious request and first LBA of the current request in terms of FIFO. In this case, there are seven times of seeks and Nr =8 and Ns =0. In this sense, the workload is random because there is no sequential stream. However, if you reorder the requests with the aid of queue in term of LBA, so the request order < 263158(7)4 >, you have Nr =3 and Ns =4, in terms of (strict) sequential streams. Note that the request number 7 is absorbed by the request number 8 during the reordering. So M2=2 (< 2631 > and < 58(7) >) for (strictly) sequential stream. With two constraints, S = 1024 and M2 = 2, you have (< 58(7) > and < 4 >). The stream of < 2631 > is removed due to small size.
An Example of LBA Sequence and Randomness
Order | First LBA | Last LBA | Order | First LBA | Last LBA |
---|---|---|---|---|---|
1 | 128 | 135 | 5 | 256 | 511 |
2 | 0 | 7 | 6 | 8 | 63 |
3 | 64 | 127 | 7 | 640 | 647 |
4 | 1536 | 2047 | 8 | 512 | 1279 |
Ns1: The number of requests with sequential access
Ns2: The number of requests with near sequential access and LBA gap
Nr : The number of the remaining requests
These variables have strong connection with the logical seek defined in the next section.
The practical gap size ∆d in a real application is usually determined by the system performance tolerance. Usually, it is up to few track sizes. For example, if an acceptable tolerance is 5ms, then ∆d can be up to 1536 blocks based on disk model of 10K RPM drive with track/head switch time at 1ms and average track size at 1.5MB. For random access, the average rotational latency is 3ms. Assume a 50% chance of the new request in the same track and 50% chance in next track (the actual probability is coupled with ∆d). Then you have 5-(3+1)*50%=3ms for further rotation, which is about half a track size, so 1536 blocks.2
DEF1: The summation of sequential commands detected with S and M2 thresholds / total commands
DEF2: (The summation of sequential commands detected with S and M2 thresholds - total sequential streams) / total commands
DEF3: The summation of the request size of sequential commands detected with S and M2 thresholds / total request size of all commands
DEF4: (The summation of the request size of sequential commands detected with S and M2 thresholds - the summation of the size of the first command in sequential streams) / total request size of all commands
If you remove size constraint S, you get another four definitions.
Spatial Locality and Logical Seek Distance
Locality , as a special case of correlation of a variable with itself over short to medium ranges, can be visualized as a 2D probability surface p(s, d), showing the probability that the first time an address s bytes away will be referenced in exactly d cycles. There are generally two types of locality: spatial locality and temporal locality. For spatial locality, if the distance s = |xi − xj | is small, it implies that not long after referencing address xi you will discover a reference to the adjacent address xj . Temporal locality will be discussed in the next subsection.
Note that only LBAs are provided in most traces, not the physical block address (PBA). Because different HDDs may not have exactly same data layouts (such as serpentine), identical LBA layouts between different HDDs can result in different PBAs. However, the difference is usually very small if their data layout is similar. Therefore, analyzing logical distance is also meaningful.
Below I discuss two distances to indicate spatial and temporal localities, respectively [35].
Logical Seek Distance
Non-queued next: The simplest case, which calculate the distance between two exactly consecutive IOs.
Non-queued closest: The absolute closest distance = min (||last_LBA(rj−1)– first_LBA(rj ) ||, ||last_LBA(rj ) – first_LBA(rj+1)||), where rj is the current IO. The closest distance is the signed value of the absolute close distance, where ||·|| indicates the absolute value.
Queued next: Simulates a queue with certain rules such that the absolute closest distance = min (||last_LBA(ri)– first_LBA(rj )||), i=1,. . . ,ni ≤ N , where ni ≤ N is the current queue length, and ri is the IO in the queue.
Queued closest: Simulates a queue such that the absolute closest distance = min(||last_LBA(ri) – first_LBA(rj ) ||i=1,...,ni, ||last_LBA(rj ) – first_LBA(rj+1)||).
In general, ||Queued closest ||≤ ||Non-queued closest ||≤||Non-queued next||. For the distance, you can further define three values: mean, median, mode. Mode indicates the most frequent number such that the larger the counter for mode = 0, the better the sequence.
Temporal Locality and Logical Stack Distance
Temporal locality means that the same address is referenced again after d steps. Popularity is a terminus that is closely related to temporal locality.
Logical Stack Distance
If an LBA in ri is referred to again after some time by another request rj , the command distance between the two requests is defined as (simplified) the logical stack distance, an important temporal locality index. Let’s take write requests as an example. If the distance is small enough (e.g., smaller than the HDD’s DRAM cache queue size), it might be a hit in the DRAM cache; otherwise, a disk update is required for the write. If the frequency of the write for a certain range of distance is high, it means the update frequency is high.
Unlike the case in [35], we are more interested in queued stack distance with consideration of cache. Therefore, let’s also look into read/write dependency. The details will be discussed in later in this chapter.
Burstiness and Self-Similarity
I/O requests almost never occur singly but tend to arrive in groups because, if there were long intervals without arrival, there were intervals that had far more arrivals than their even share. They are generally related to queue depth and request arrival rate and inter-arrival time. This phenomenon is often called long-range dependence and is typically explained as (stochastic) self-similarity because that is the only explanation that avoids non-stationarity. The phenomenon of self-similarity describes how a property of an object is preserved while scaling in space and/or in time. In other words, in addition to the long-range dependence property, the scale invariance property holds at any different time scale, like in a fractal shape.
Statistical Properties Visualization and Evaluation
Normality: A probability distribution can be accurately approximated by a normal distribution. Although perfect normal distribution is rare in reality, it is often used a reference model.
Nonstationary: In a stationary process, the outputs (job size, inter-arrival time) vary, but the stochastic process that produces them does not.
Long-tailness and power-law: Some regions far from the mean or the median of the probability distribution, like the extreme values in the tails of the probability distribution, are assigned relatively high probabilities following a sub-exponential or polynomial law, contrary to what happens to the family of normal distributions, where the tails fall exponentially.
Heavy-tailness: The long tails of the distribution fall polynomially and the self-similarity property holds.
Cyclic behavior and seasonal variations: They are an indication of a non-stationary workload and must be treated separately.
Autocorrelation, cross-correlation, and short-range dependence: Autocorrelation is also known as serial correlation or short-term/range memory, where the autocorrelation at short time scales is significant and long-range dependence is also known as long-term memory, where the autocorrelation at long time scales is significant.
These properties can be possibly visualized via graphical plotting. The empirical cumulative distribution function (EDF) and the complementary EDF (CEDF) are often used to observe the sample distribution. In particular, the log-log EDF and log-log CEDF plots are usually applied to compare the body and the tails of the sample probability distribution, respectively. Similarly, the Q-Q plot (a plot of the quantiles of the first data set against the quantiles of the second data set) can be employed for evaluating possible differences, especially in the tails, between the empirical probability distribution and another reference probability distribution, either theoretical or empirical.
The mass-count disparity plot and the Lorenz curve can be used to look for evidence of the power-law property. The mass-count disparity plot displays the “mass” probability distribution (given by the probability that a unit of mass belong to an item smaller than a predefined x) against the “count” probability distribution (given by the CDF) in order to show possible disparities between these two probability distributions. The Lorenz curve is an alternative way to illustrate the relationship between the “count” distribution and the “mass” distribution; it is generated by pairing percentiles that correspond to the same value (i.e. a point (pc, pm) in the curve is such that where Fm(·) and Fc(·) are the cumulative distribution functions of the “mass” and “count” distributions, respectively, and
is the inverse of Fc(·).
The run-sequence plot and the autocorrelation plot for investigating for the presence of both short-range and long-range dependence as long as for periodic patterns and trends. The run-sequence plot displays observed data in a time sequence; it is constructed by plotting values of the interested (univariate) attribute according to the temporal order as they appear; this plot is particularly useful for finding both shifts in location and scale, for locating outliers and, in general, for getting insights about the trend of observed data. The autocorrelation plot (also known as a correlogram) is a plot of the sample autocorrelation function (ACF), which is of the sample autocorrelation at increasing time lags; it is used for checking for randomness in a sequence of observations of the same attribute. If random, such autocorrelations should be near to 0 for any and all time-lag separations; conversely, if non-random, then one or more of the autocorrelations will be significantly different from 0.
Some hypothesis-testing techniques can be used for quantitative evaluations of these properties, such as F-test, T-test, K-S (Kolmogorov-Smirnov) test, Mann-Whitney U-test, and H-test. These tests can compare the differences between two empirical distribution functions/samples. The Pearsons r and the Spearmans ρ correlation coefficients are utilized to discover linear and generic correlations, respectively, among the interested attributes. Both coefficients are within the range of [–1,+1], where +1 means a strong positive correlation, while –1 means a strong negative correlation. 0 means no significant correlation. More details will be described in Chapter 9.
Read /Write Dependency
Read-on-write (ROW), or read after write (RAW), or true dependencies
Write-on-write (WOW) or write after write (WAW) or write update, or output dependencies
Write-on-read (WOR), or write after read (WAR ), or anti-dependencies
Read-on-read (ROR) or read cache hit, or input-dependencies
Within a certain time window, ROR and ROW are directly related to the cache replacement algorithm and read hit performance, while WOR and WOW are related to the cache update policy. The existence of a ROW between two operations relies on the situation that if the first operation writes a block that is later read by the other operation and there is no intervening operation on the block. WOW and WOR are similarly defined. ROR is a very common property to check read cache efficiency. ROW can check if the so-called “write once read many (WORM)” is possible, which is an important value for SMR (the higher the better).
Write Update (Write on Write)
A high WOW ratio generally means high block update ratio. Therefore, when replicate blocks exist, it might be better to update one of the copies and invalidate the remainder rather update all the copies, if the WOW ratio is quite high within a short time window, for IO performance consideration. By comparing ROW and WOW, you can conclude the likelihood of blocks getting updated vs. being read or not. SMR generally expects less write update, resulting in smaller write amplification. If an out-of-place policy is applied for write updates, you can expect better spatial efficiency.
Comparison of Write Update Ratios
Frequented | Timed/ordered | Stacked | |
---|---|---|---|
Total updated blocks/commands | Yes | Yes | Yes |
Timing issue | No | Yes | No |
Update frequency | Yes | No | No |
Memory stack | No | No | Yes |
x is the maximum update frequency (≤1 no rewritten; otherwise rewritten)
y is the percentage of blocks updated (updated blocks at a specified-frequency value / total updated blocks of write commands) or its cumulated distribution function (CDF).
This process gives a quick, coarse grain analysis of how LBAs are updated in a workload of conventional PMR drives. However, it may not reflect the actual ratio for SMR drives. In fact, due to the indirect mapping and log nature of SMR, it misses the actual write update; for example, a rewrite with an out-of-place update actually is “new” write to SMR drives.
Timed/ordered update ratio: During a time period, record the total blocks of updated write request (repeated or not). An update is an operation to rewrite a block that was previously written during the observation period. You can provide a curve of x = time or command IDs vs. y = total size of updated blocks or percentage of blocks updated (updated blocks at frequency x / total blocks of commands), so the percept is the percentage of total blocks that were updated. Note that a similar definition to frequented update ratio and timed/ordered update ratio are given in [4, 40].
Stacked update ratio: During a time period, record the update frequency of each write command (partial or full hit). Once a rewrite happens, it is counted as a new command (update frequency is always ≤1). You can provide a curve of x = logical stack distance vs. y = percentage of updated write commands (updated write commands/total commands), or a curve of x = logical stack distance vs. y = percentage of updated write size (updated write size/total commands). It shows the actual update size/commands and tells if an update in SSD/DRAM cache is necessary. Note that the stack distance can be also replaced by time interval.
Read on Write (ROW)
This metric is used to check if “write once read many (WORM)” is possible. In general, the higher the ROW ratio, the better the WORM property of the workload.
Frequented ROW ratio: During a time period record the read hit frequency of each LBA after a write command. You can then plot a curve of x = maximum hit numbers or frequency (<1 not updated, otherwise updated) vs. y = percentage of blocks or commands hit (updated blocks at frequency x / total number of HDD blocks).
Timed ROW ratio: During a time period, record the hit blocks of each read command since last write. You can provide a curve of x = time or command ID vs. y = percentage of blocks or commands of read hit commands (blocks of hit read commands/blocks of total commands).
Stacked ROW ratio: During a time period, record the hit frequency of each read command (partial or full hit) since last write. You can provide a curve of x = logical stack distance vs. y = percentage of read hit command (hit read commands/total read commands) or y = percentage of blocks of read hit command.
Beyond the material presented here, other qualities, such as self-similarity (for burst IO) and workload dependence among of nodes (i.e. how the tasks are distributed among nodes) [35, 39], are interesting metrics to be studied further. In the spirit of brevity, I include a targeted presentation of workload metrics, omitting these analyses .
Priority-Related Metrics
Requests have priority during execution. Usually, the foreground activities have higher priority than background activities. In some cases, such as synchronization in RAID5/6 and EC operations, these marked as synchronized requests may have higher priority due to time-out policies.
Other metrics may be related to cluster, multi-threads, network and so on [41]. For example, the network factors (I/O bandwidth distribution, channel utilization, instructions, packet sizes, source or destination of packets, page reference pattern, various forms of delay like transfer time and queuing delays, etc.) will also influence the final device performance.
Modeling Issues
Mathematical Models for Basic Metrics
Metrics | Modeling issues |
---|---|
Throughput | Commonly used distributions: Pareto distribution |
Queuing time | Commonly used distributions: exponential distribution, Erlang distribution |
Service time | Commonly used distributions: exponential distribution, Erlang distribution |
Response time | Commonly used distributions: exponential distribution, Erlang distribution |
Disk seek time | May be modeled by data fitting using a piecewise function |
Read to write ratio | No obvious distribution, depends on read/write dependency |
Request size distribution | Commonly used distributions: logarithmic distribution, trunked-normal distribution, Pareto distributions, exponential distribution, power-law distribution, log-uniform |
LBA range/randomness | See spatial locality |
Inter-arrival time | Commonly used distributions: Poisson distribution, exponential distribution, lognormal distribution |
Mathematical Models for Advanced Metrics
Metrics | Modeling issues |
---|---|
Locality | Commonly used distributions: Zipf distribution (Independent Reference Model). Other models: LRU stack model, Markovian models, fractal model |
Spatial locality | The simplest way is by counting unique substrings of the reference stream. Formally measured by the size of the working set. |
Temporal locality | The simplest measure is to look at the reference stream through a pinhole or using a simulation of an LRU stack. |
Read/write dependency | Models: State machine, Markov chain, clustering |
Priority-related metrics | They are completely determined by the OS/FS rules. A model can be built based on Markov chain with |
Burstiness and self-similarity | Auto-covariance and covariance matrices are often used to describe self-similarity. |
First, decide the formulation: the characterization level and the workload basic component.
Second, collect the parameters of the workload to be modeled while it is executed.
Third, statistically analyze the measured data with some statistical distribution (e.g., logarithmic distribution, trunked-normal distribution, Pareto distributions, exponential distribution, power-law distribution, log-uniform distribution) and stochastic model (e.g., Markov models), including the sampling procedure and static/dynamic analysis.
- 1.
Assuming the workload is approximately periodic with s states, model the IO trace of each data storage unit in one cycle as a specific Markov chain. Otherwise, all the corresponding Markov chain parameters should be time-varying.
- 2.
Given a historic workload trace L, represented as a D × T matrix:
where D is the total number of data storage units and T is the total number of time intervals. xd,t is the IO intensity for any chosen data unit d at time t. xd,t ∈ {1, 2, 3, ..., s}.
- 3.
Each Markov chain can be represented by its s by s state transition probability matrix:
where pi,j is the probability of the data unit IO intensity state change to j at next time interval under the condition that its current IO intensity state is i, and i, j ∈ {1, 2, 3, ..., s}
- 4.
To simplify data allocation in the tiered storage system, you also need to classify the thousands of data units into a dozen of data clusters according to their dynamic IO intensities, such as a K-means algorithm using the state transition probability matrices as well as the residual time for each state s of every data units. By combining the initial state of the cluster S, you can obtain the predicted IO intensity of all the clusters in a whole period, represented as the following N × T matrix:
, where sl,t is the IO intensity of the lth data cluster at the tth time interval.
Refer to [7, 8] to see the difference between simulation and modeling approaches. For more complex cases, you may use the so-called multiple-level process models. And, in these cases, the correlation must be considered.
Typical Applications
Typical Application IO Workload Profiles
Application IO profile | ||||
---|---|---|---|---|
Application | Size (Byte) | R/W | Rand./seq. | Dominant |
Web file server | 4KB, 8KB, 64KB | 95%/5% | 75%/25% | IOPS |
Database online transaction processing (OLTP) | 8KB | 70%/30% | 100%/0% | IOPS |
Exchange email | 4KB | 67%/33% | 100%/0% | IOPS |
OS drive | 8KB | 70%/30% | 100%/0% | IOPS |
Decision support systems (DSS) | 1MB | 100%/0% | 100%/0% | IOPS |
File server | 8KB | 90%/10% | 75%/25% | IOPS |
Video on demand | 512KB | 100%/0% | 100%/0% | IOPS |
Web server logging | 8KB | 0%/100% | 0%/100% | MBPS |
SQL server logging | 64KB | 0%/100% | 0%/100% | MBPS |
OS paging | 64KB | 90%/10% | 0%/100% | MBPS |
Media streaming | 64KB | 98%/2% | 0%/100% | MBPS |
IO workload characteristics are generally application-dependent nature. Many access patterns, such as read/write proportions and handling of writes differ by particular applications. Nevertheless, the majority of characteristics vary only by environments (operating conditions). Environment-dependent characteristics include the length of idle intervals, request arrival rate, workload randomness and sequentially, read and write performance, disk service time and response time of request, request size, etc. More importantly, there are characteristics of the overall IO workload that do remain consistent through applications and environments. A particular note here is workload burstiness (i.e. long-range dependence). The block-level workloads, in particular, request inter-arrival times and request seek distances, are long-range–dependent in general. As a measure of temporal locality in a time series, long-range dependence has a variety of consequences specifically with regards to predict overall system and particular resource saturation. Consequently, burstiness shall be taken into consideration when designing new storage systems, and resource management policies at various layers of the IO path. As a result, there is no universally good configuration for all workloads due to large difference in various applications [42].
For file system performance, keeping the file system block size close to the workloads I/O size can increase the efficiency of the system significantly [43].
Web traffic volume is increasing rapidly. Some researchers argue that there have been no dramatic changes in web server workload characteristics in the last 10 years [44]. They consist of one-time referencing behaviors, heavy-tailed file size distributions, non-Poisson aggregate request streams, high concentration of references, Poisson per-document request streams, and the dominance of remote requests.
A database usually has a significant inter-transaction locality, showing that real workloads transactions are generally dependent of each other. Another observation is that significant use of sequential accesses allows a prefetch policy to be applied. Sequentiality is a consequence of long-running queries that examine a large number of records, such as a join operation.
OLTP (online transaction processing) workloads are characterized by a large memory footprint, joined with a small critical working set, and by their reduced benefit from micro-architectural optimizations. In addition, index searching in OLTP workloads require a different cache design.
Basic Metrics for Two Typical Workloads
Trace | Duration(s) | Traffic (G-B) | Total requests (×106) | Avg. R/W size (KB) | R/W traffic ratio | Random read (×106) | Random write (×106) |
---|---|---|---|---|---|---|---|
OLTP | 43712 | 18.491 | 5.335 | 3.466 | 0.1820 | 0.955 | 2.99 |
Search engine | 3151.3 | 16.369 | 1.055 | 15.509 | 8762.9 | 0.994 | 2.08e-4 |
Traces in File- and Object-Levels
Some Metrics for File-Level Traces
Metrics | Explanation |
---|---|
File types | A file system is utilized in one of several ways: as a long term storage for persistent data, as a repository for data too large to fit in the main memory, as the site of storage for executables, and for storing logs of system usage for accounting and monitoring. The metadata to user data ratio is an important index |
File age | The age is defined as the time from its last reference. |
File access duration | The time from open to close |
User behavior model | Users generate the references that constitute the workload. |
Process and state model | Several aspects are included, such as how a process makes file service requests during its existence; the conditional activation table for access dependency; and the ratio of different operations, such as open, close, write, read, seek, etc. It may be molded as a closed Markov chain. |
Reference model | How the requests are distributed among the files in the system |
Workload in System Level
Workload in systems views | Load arrival pattern | Data access pattern | Computation pattern |
---|---|---|---|
MapReduce | The time arrival of sequence of jobs | HDFS input and output paths, the data size for the input, shuffle, and output storage | Input data size, shuffle data size, output data size, job duration, map task time, reduce task time |
Enterprise network storage | The time arrival sequence of application instances or user sessions | Read/write, sequential/random, single/repeated access, file sizes, file types | N.A. |
In this chapter, I presented some trace metrics for performance evaluation and design of storage systems. Some typical applications were given to provide a quick impression on these metrics. In fact, due to the large difference of some applications, there is no one-for-all system design in general. I will discuss more details in the later chapters.