Hadoop is one of the most popular distributed big data platforms in the world. Besides computing power, its storage subsystem capability is also a key factor in its overall performance. In particular, there are many intermediate file exchanges for MapReduce. This chapter presents the block-level workload characteristics of a Hadoop cluster by considering some specific metrics. The analysis techniques presented can help you understand the performance and drive characteristics of Hadoop in production environments. In addition, this chapter also identifies whether SMR drives are suitable for the Hadoop workload.
Users of large systems must deal with explosive data generation, often from a multitude of different sources and formats. The observation and extraction of potential value contained in this large, generally unstructured data lead to great challenges, but also opportunities in data storage, management, and processing. From a data storage perspective, huge capacity (byte) growth is expected, with HDDs supplying most capacity workloads for the foreseeable future, although SSDs and NVMs are also widely used in time-sensitive scenarios (performance workload). The interaction of HDDs with these capacity workloads must be well understood so that these devices algorithms.
From data management and processing, big data arises as a trendy technology, while Hadoop emerges as a leading solution. Originating from Google’s GFS and MapReduce framework, the open-source Hadoop has gained much popularity due to its availability, scalability, and good economy of scale. Hadoop’s performance has been illustrated for batch MapReduce tasks in many cases [66, 67], though more exploration is ongoing for other applications within Hadoop’s umbrella of frameworks.
To best understand Hadoop’s performance, a common approach is workload analysis [66, 67, 68, 69, 70]. The workload can be collected and viewed in different abstract levels. The references [37, 35] suggest three classifications: functional, system, and physical (see Figure 2-1 in Chapter 2). However, most workload analysis works in this area are studied with a system view.
Kavulya et al. [71] analyzed 10 months of MapReduce logs from the Yahoo! M45 cluster, applied learning techniques to predict job completion times from historical data, and identified potential performance problems in their dataset. Abda et al. [72] analyzed six-month traces from two large Hadoop clusters at Yahoo! and characterized the file popularity, temporal locality, and arrival patterns of the workloads, while Ren et al. [66] provided MapReduce workload analysis of 2000+ nodes in Taobao’s e-commerce production environment. Wang et al. [67] evaluated Hadoop job schedulers and quantified the impact of shared storage on Hadoop system performance, and therefore synthesize realistic cloud workloads. Shafer et al. [70] investigated the root causes of performance bottlenecks in order to evaluate trade-offs between portability and performance in Hadoop via different workloads. Ren et al. [73] analyzed three different Hadoop clusters to explore new issues in application patterns and user behavior and to understand key performance challenges related to IO and load balance. And many other notable examples exist as well [69, 68, 37].
While predominately system-focused, some works provide a functional view [69, 70, 73] where average traffic volume from historical web logs are discussed. Some simulators and synthetic workload generators are also suggested, such as Ankus [66], MRPerf [74], and its enhancement [67].
However, to my best knowledge, there is no direct analysis work for a Hadoop system at the block level. A block-level analysis is timely and more valuable now that device manufacturers are pressured to develop/improve products to meet capacity or performance demands, such as emerging hybrid or shingled magnetic recording (SMR) drives [5, 40, 4]. When considering the SMR drives (and the coming energy/heat assistant magnetic recording (EAMR/HAMR) drives) which have much higher data density than conventional drives, such an analysis is indispensable, due to their distinguished features from the conventional drives. For example, SMR drives introduce characteristics such as shingled tracks, which make the device more amenable to sequential writes over random, as well as indirect block address mapping, garbage collection, and more, which all modify how these devices interact with user workloads. A big question at the device level is if the block-level Hadoop workload is suitable in SMR drives.
Defining some new block-level workload metrics, such as stacked write update, stacked ROW, and queued seek distance to fulfill particular requirement of disk features
Identifying some characteristics of a big organization’s internal Hadoop cluster, and relating them to findings of other published Hadoop clusters with similarity and difference
Providing some suggestions on Hadoop performance bolstered from drive-level support
Providing analysis for the applicability of SMR drives in Hadoop workloads
This chapter will first cover the overall background of SMR drives and the Hadoop cluster and trace collection procedure. Then it will cover the analysis results based on these metrics.
Hadoop Cluster
Numerous workload studies have been conducted at various levels, from the user perspective to system/framework-level analysis. Most of these types of analysis only capture certain characteristics of the system. Harter et al. go as far as to take traces at a system view and apply a simple simulator to provide a physical view of the workload [77]. However, leaving the physical view of a system to simulation can miss details that may be critical to understanding a workload. Therefore, it is generally more reasonable to analyze the real workloads when considering performance.
The workflow of a Hadoop cluster is to import large and unstructured datasets from global manufacturing facilities into the HDFS. Once imported, the data is crunched and then organized into more structured data via MapReduce applications. Finally, this data is given to HBase for real-time analysis of the once previously unstructured data. While some of this structured data is kept on the HDFS (or moved to another storage location), the unstructured data is deleted daily in preparation to receive new manufacturing data.
The Hadoop cluster is used to store the manufacturing information, such as the data from 200 million devices per year created by the company. For example, in phase I manufacturing (clean room assembly), while drive components are assembled, data is captured by various sensors at each construction step for every drive. From a manufacturing point of view, creating 50-60 million devices a quarter creates petabytes of information that must be collected, stored, and disseminated in the organization for different needs. Some user scenarios include the query to the particular models, the summarization of quality of one batch, the average media density of one model, etc.
PIG/Hive is used for MapReduce indirectly, actually PIG/Hive converts SQL-like code to Java code to run MapReduce. MapReduce then does SQL-like statements to process raw test data to generate drill-downs and dashboards for product engineering R&D and failure analysis (FA). However, the cluster is mainly for analytics: MapReduce use cases (∼80%) and also some Hbase online search use cases (∼20%).
WD-HDP1 Cluster Configuration
WD-HDP1: 100 Servers | |
---|---|
CPU | Intel Xeon E3-1240v2, 4 Core |
RAM | 32 GB DDR3 |
OS HDD | WDC WD3000BLFS 10kRPM 300GB |
Hadoop HDD | WDC WD2000FYYX 7.2kRPM 2-4TB |
Hadoop Version | 1.2.x |

Data flow in Hadoop system

Trace collection using Ganglia and Blktrace
File System and Write Cache Settings
Node ID | FS | Write cache |
---|---|---|
147 | XFS | 16:21 disabled; 22:25 enabled |
148 | EXT4 | 16:21 disabled; 22:25 enabled |
149 | EXT4 | 16:21 enabled; 22:25 disabled |
150 | XFS | 16:21 enabled; 22:25 disabled |
Workload Metrics Evaluation
In this section, I discuss my observations of disk activity from a sample of nodes within the Hadoop cluster. I then explain how these observations relate to metrics discussed in the previous section. Finally, I conclude with observations from a system level and suggestions for addressing performance issues which could arise from my recorded observations.
Block-Level Analysis
There exist some tools to parse and analyze raw block traces. For example, seekwatcher [78] generates graphs from blktrace runs to help visualize IO patterns and performance. Iowatcher [79] graphs the results of a blktrace run. However, those tools cannot capture the advanced metrics defined in Chapter 2. Thus you can apply the Matlab-based tool introduced before for these advanced properties.
General View

Average values of workloads for different file systems

Average size and IOPS

Write request size distribution in different range
Summary of Observations and Implications
SMR characteristics | Hadoop observation |
---|---|
Write once read many | 40% read and 60% write Relatively low write update ratio 35.4% stacked ROW ratio |
Sequential read to random write | Relatively small random write ratio ROW ratio ˜60% and small size, so insignificant impact |
Out-of-place update | Stacked WOW: 70% within first 10 minutes Usefulness of large-size SSD/DRAM/AZR cache to performance improvement (write update in cache) |
Sequential write | Large size write requests (S ≥1024 blocks) > 64% Mode ratio: write 65% and read 70% Sequential ratio (S ≥1024): write 74% (66%) and read 82% (7%); Near sequential ratio (S ≥1024): write 87% (65%) and read 95% (85%) |
Garbage collection (GC) | Frequent small idle time and large idle time periodically Low device utilization Relatively short queue length Relatively low write update ratio 116.3% frequented WOW small write amplification 8.5% write update cmds small rewritten ratio The dominant partial WOW hits are mainly large-size requests, while full hits at small-size requests |
Basic Statistics of Trace 148-21 (EXT4-WCD)
Combined | Read | Write | |
---|---|---|---|
Number of blocks | 917942 | 373424 | 544518 |
Average size (block) | 435.7 | 216.5 | 586.0 |
Read IOPS (r/s) | 4.322 | ||
Write IOPS (w/s) | 6.302 | ||
Blocks read per second | 935.802 | ||
Blocks written per second | 3693.157 |
Size and LBA Distribution

LBA and size distribution
These findings provide a different view from the “common sense” of sequential access for Hadoop system [70]. It is true that sequential reads and writes are generated at the HDFS level for large-size files (the settings for chunk size is 128MB), so large-size files are split into 128MB blocks and then stored into the HDFS, and the minimum access unit is therefore 128MB generally. However, when these accesses interact with local file systems such as EXT4 and XFS, the situation becomes much more complex.
IOPS and Throughput

IOPS and throughput
Utilization and Queue Depth
Both device workload (average 15%) and CPU workload (average 20%) are low. The average queue depth of the HDD (average value <0.3) further shows the low device utilization of this workload. As the overall workload is generally low, and the “periodic” bottom-peak curve is illustrative, the system therefore can exploit idle time to get the most potential benefits (such as garbage collection and defragmentation for space efficiency, or even block reorganization) for performance improvements. However, most idle time is not long enough for large background jobs. How to fully utilize these idle times is an interesting topic for future exploration.
Request Sequence
Let’s now look at the IO sequential pattern for both read and write requests. Some relevant concepts were introduced in Chapter 2.
For queued next seek distance, observe that the value of the mode (most frequent value) is equal to 0, and the mode ratio at N=64 (N=1) for read and write is 70.2% (61.6%) and 65.4% (61.3%), respectively. This implies a highly sequential workload (where higher is better). The mean absolute value drops quickly with queue length, which implies that there are many interleaved sequence streams. Therefore, the queue length used for sequence detection in a disk drive should be reasonably large to see better performance from the device.

Sequence stream with different N
Sequence Stream and Command Detection
M 2 ≥ 2 | Total streams | Total commands | ||
---|---|---|---|---|
N=1 | w/o | w/ | w/o | w/ |
Read | 52827 | 19987 | 282911 | 210081 |
Write | 25633 | 12412 | 359668 | 324007 |
N=128 | w/o | w/ | w/o | w/ |
Read | 43682 | 3579 | 308149 | 217373 |
Write | 27303 | 9209 | 393962 | 354384 |
Average Size of Sequence Stream and Command
Op (N) | Avg. cmd size w/o (blocks) | Avg. cmd size w/ (blocks) | Avg. stream size w/ (blocks) |
---|---|---|---|
Read (1) | 259.5 | 308.9 | 3247.3 |
Write (1) | 952.0 | 910.3 | 24851.7 |
Read (128) | 245.4 | 307.9 | 18700.5 |
Write (128) | 797.7 | 882.2 | 31685.2 |

Sequence ratio with S=1024
Therefore, you can see that total ratio of “sequential” read/write is over 82% and 74% (N=256), respectively (detected read/write sequential commands/total write/read commands), without size constraints. The ratio is reduced to 76% and 65.5%, respectively, with the size constraint (S =1024), which is more reasonable to indicate the sequence ratio.

Sequence ratio with with S=4096 and 8192

Near sequence ratio with S=1024
All the (near) sequence information discussed above provides a good reference for pre-fetch policy design for disk drives. For example, when considering a sequence detection algorithm, the gap is an important parameter. When designing a hot data identification algorithm, the definition of hit frequency may be changed slightly for these near sequence streams. For instance, the LBA hit within a certain region can be counted as a hot area to take post-read action. The observations also tell that the interleaved stream number is not large and a small queue may be good enough to detect the sequence (e.g., N≥16, compared with non-cache, N=50 can increase around 5% sequence).
Write Update
For frequent write updates, 86% of accessed blocks (maybe repeated) are only written once. A decreasing percentage of written blocks are written multiple times, which means only a small portion of hot blocks are rewritten. Write amplification is roughly 116.3% if all rewritten data is put into a new place.
For a timed write update, total write blocks occur 80% of total access blocks (read and write), and the updated blocks (at least write twice) are only 6.8%. Total write commands are 59% of the total commands and the update commands are 22.5%. The average size of write commands is around 586 blocks and the average size of overlapped blocks of update commands is 73.3.
Statistics for Logical Stack Distance (LSD)
LSD≤1000 | LSD≤2000 | LSD≤4000 | Overall | |
---|---|---|---|---|
Partial, Full | 2.8%, 14.9% | 3.5%, 17.6% | 4.4%, 19.9% | 7.5%, 27.3% |

Write update LBA and size distribution
Now let’s check stacked write update ratio. Based on write IOPS, distance 4000 is roughly 10 minutes. In this period, you can see 72.9% full write hit and 58.7% partial write hit of commands. With the knowledge of cache size and structure, you can estimate the hit ratio. As the stack distance is generally longer than DRAM cache length, updates on disk cannot be avoided. Therefore, a “caching/buffering” location on the media is necessary, so a larger-size SSD cache is necessary for performance improvement.

Frequented, timed, and stacked write update (from top to bottom)
Read on Write (ROW)

ROW ratio

ROW hit and size distribution
Write Cache Enabled vs. Disabled
Write cache is an important feature of HDDs. Many studies show that the performance of WCE (write cache enabled) can be increased significantly over WCD (write cache disabled) due to write-back policy, in particular for some small-size random write workloads. However, the data reliability concern (e.g., power loss leads to dirty data loss in DRAM write cache) results in most data centers disabling this feature.
Basic IO metrics for WCD vs. WCE
Average | IOPS | Throughput KBPS | Size blocks | |||
---|---|---|---|---|---|---|
21 | 22 | 21 | 22 | 21 | 22 | |
148 | 6.3 | 7.0 | 1846.6 | 2064.4 | 586.0 | 589.5 |
149 | 6.6 | 6.8 | 1782.2 | 1881.9 | 541.8 | 556.6 |
In the heavy workload cases, some interleaved sequential streams will be considered as “random” rather than sequential, causing an increased ratio of random writes, which is harmful to the overall disk performance. Therefore, some non-volatile memory or DRAM protection technologies may be applied in order to enable write cache, which becomes necessary for heavy workloads. Additionally, for green environments in data centers where the bottleneck is not the HDDs (at least under normal workloads), another benefit to WCE is energy savings due to less mechanical accesses to the HDDs.
System-Level View
Basic Information Collected from HDFS Logs
Name | Duration | Total IO Requests | Average IO Size(MB) | |||||
---|---|---|---|---|---|---|---|---|
Ave | Max | Min | Total | Read | Write | |||
wdc-x1 | 68 d | 0.627 | 5.476 | 0.125 | 60.96 | 3.14 | 99.75 | |
wdc-x2 | 98 d | 0.396 | 5.661 | 0.107 | 47.83 | 3.09 | 93.2 | |
R/W ratio | IOPS 105 | Throughput (MB/S) | ||||||
Avg | Max | Min | Total | Read | Write | Total | Read | Write |
0.0348 | 0.125 | 0.0054 | 0.726 | 0.536 | 0.19 | 18.48 | 0.632 | 17.85 |
0.0383 | 0.33 | 0.007 | 0.458 | 0.325 | 0.133 | 12.11 | 0.442 | 11.67 |
Notice that all daemons within the Hadoop framework write logs of their runtime throughout the course of the day. Depending on configuration, this can be written to the HDFS or to the local FS. If logs are written to the local FS, they account for some random IO to the HDD. Other daemons of frameworks which sit on top of Hadoops core framework can generate more random IOs at the device level. Hbase has similar logs from its region servers which must be written. Created for real-time data processing, Hbase will spawn small MapReduce jobs which access potentially small amounts of data, causing random read IOs in a HDFS instance. Frameworks like Hive and other KV stores, which sit on top of Hadoop, have similar logging structures which can potentially cause random IO amplification down at the device level.
Java-based (apache) MapReduce must write temporary intermediary files to disk during MapReduce jobs, some with repeating process IDs (PIDs) and others with single or minimal use PIDs. These intermediary files can either be dumped to locations in the HDFS (which attempts to serialize the IO if possible) or the local FS, which will be a random IO event. This is configurable via tuning the parameter mapreduce.task.tmp.dir.
After consulting the Hadoop XML configuration files, one thing that I noticed was that the temporary/intermediary space for all HDFS and MapReduce workloads were configured to store their output data to the HDFS data drive. Because the Hadoop framework is written in Java, it must contend with the properties of the Java Virtual Machine (JVM), meaning that memory addresses have no meaning between JVMs. Hence, when a MapReduce task passes data to another MapReduce task, it first writes data to a temporary file for the other JVM task to read. Additionally, when a MapReduce job launches, it must send the configuration parameters and executable jar file to the TaskTrackers so that they can correctly spawn map and reduce tasks. This data too was being stored to that configured local temporary space.
Couple this IO with the observation that the cluster runs close to 5500 MapReduce jobs per day, of which many are small task count jobs (due to the 20% HBase analysis performed by users), and the amount of random IO generated on these HDDs becomes very large. However, this IO is something that can be mitigated and was not completely responsible for the high level of random IO seen in the block-level traces. The HDFS itself also contributes to the amount of random IO seen in those traces.

Basic curves from HDFS logs
Some Further Discussions
In this chapter, I presented the block-level workload characteristics of the Hadoop cluster by considering some specific metrics. The analysis techniques presented can help others understand the performance and drive characteristics of Hadoop in their production environments. Collected by blktrace, I conducted a comprehensive analysis of these logs which identified new workload patterns with some unexpected behaviors. I showed that, while sequential and near-sequential requests represent the majority of the IO workload, a non-trivial amount of random IO requests exist in the Hadoop workloads. Additionally, the write update ratio on drives is not very high, which indicates that a small write amplification can occur if an out-of-place write policy is applied. Also note that the ROW ratio is small, which means WORM does not generally hold for the cluster’s workload. All these findings imply a relatively high spatial locality and lower-than-expected temporal locality, which show that Hadoop is generally a suitable application for SMR drives. However, further improvements in both Hadoop and drive sides are required.
Looking critically at the configuration of a Hadoop system, it is possible to fine-tune and minimize some, but not all, of the observed random IO. Factors that add to this random IO are several types of framework logging, intermediary files generated by MapReduce and HBase workloads, and metadata files of HDFS chunks. The verbosity of Hadoop daemon log files can be turned down to generate less data, and they along with temporary MapReduce output can be written to a storage location which will not impact HDFS chunk IO operations. Among these can be the HDFS itself (rather than local storage), which will attempt to make the IO more sequential, or on another physical/logical block device more suited to random block IO (while maintaining data locality). Some basic curves derived from HDFS logs are shown in Figure 8-16.
However, the final piece of the observed random IO is a consequence of HDFS write/update mechanism and cannot be easily mitigated because it currently must reside with the committed HDFS chunks on the HDFS data drives. The small IO caused by chunk metadata must then be serviced by a capacity block storage device which can either understand how to transform these small random IO into larger sequential access patterns, or a device that is simply designed to handle random IO. Without a device-level view, it is possible that this overhead would be dismissed as a problem elsewhere in the system rather than at the HDD device level, where some of these issues are very simple to correct, given the proper insight. For instance, a large DRAM buffer will be very useful for these scenarios with random read accesses, and non-volatile memory (e.g., NAND and conventional zone) for these random write accesses.
Hence, it is reasonable to study an integration of HDFS and the local files systems with consideration of device properties, such as a design in a global view so that there is no “misunderstanding” of the local metadata to the sequential write in HDFS. And the metadata and the “non-critical” intermediate/temporary data are assigned to proper disk location. Therefore, HDFS could take the responsibility of file/block accesses in DataNode, which may make the drive operation more efficient. The metadata location in the device shall also be carefully designed.
In addition, the drive-level cache and system-level cache may be unified with the consideration of the mechanism of Java, such that some temporary data may be absorbed by the unified cache/buffer instead of disk mechanical accesses. This unification could be difficult due to the current HDFS’s simple cache design and lack of direct interface between JVM and drives. However, it is possible for drive manufacturers to provide such an application-oriented interface for communication.
Furthermore, a certain intelligence might be useful for the drive to understand the nature of the access (e.g., random or sequential, access dependency), such that the drive can immediately switch to the optimal algorithm/behavior for better performance. An application-level hinting scheme with interaction between host and drive or a self-learning algorithm inside drive can be helpful.
In conclusion, the device level analysis of the in-house Hadoop cluster has provided new insights into how Hadoop interacts with the underlying file system and handles its lower-level IO. These new insights motivate me to continue studying how workload characteristics of big data frameworks and application tuning could help the performance of storage devices in the current data driven climate which we live in. This study is also applicable to Spark, an in-memory MapReduce system roughly. For example, a detailed workload analysis can provide some insights of the SCM application for Spark systems, which will benefit a cost-efficient design of hybrid SCM-DRAM structures.