Case Study: Hadoop

Hadoop is one of the most popular distributed big data platforms in the world. Besides computing power, its storage subsystem capability is also a key factor in its overall performance. In particular, there are many intermediate file exchanges for MapReduce. This chapter presents the block-level workload characteristics of a Hadoop cluster by considering some specific metrics. The analysis techniques presented can help you understand the performance and drive characteristics of Hadoop in production environments. In addition, this chapter also identifies whether SMR drives are suitable for the Hadoop workload.

Users of large systems must deal with explosive data generation, often from a multitude of different sources and formats. The observation and extraction of potential value contained in this large, generally unstructured data lead to great challenges, but also opportunities in data storage, management, and processing. From a data storage perspective, huge capacity (byte) growth is expected, with HDDs supplying most capacity workloads for the foreseeable future, although SSDs and NVMs are also widely used in time-sensitive scenarios (performance workload). The interaction of HDDs with these capacity workloads must be well understood so that these devices algorithms.

From data management and processing, big data arises as a trendy technology, while Hadoop emerges as a leading solution. Originating from Google’s GFS and MapReduce framework, the open-source Hadoop has gained much popularity due to its availability, scalability, and good economy of scale. Hadoop’s performance has been illustrated for batch MapReduce tasks in many cases [66, 67], though more exploration is ongoing for other applications within Hadoop’s umbrella of frameworks.

To best understand Hadoop’s performance, a common approach is workload analysis [66, 67, 68, 69, 70]. The workload can be collected and viewed in different abstract levels. The references [37, 35] suggest three classifications: functional, system, and physical (see Figure 2-1 in Chapter 2). However, most workload analysis works in this area are studied with a system view.

Kavulya et al. [71] analyzed 10 months of MapReduce logs from the Yahoo! M45 cluster, applied learning techniques to predict job completion times from historical data, and identified potential performance problems in their dataset. Abda et al. [72] analyzed six-month traces from two large Hadoop clusters at Yahoo! and characterized the file popularity, temporal locality, and arrival patterns of the workloads, while Ren et al. [66] provided MapReduce workload analysis of 2000+ nodes in Taobao’s e-commerce production environment. Wang et al. [67] evaluated Hadoop job schedulers and quantified the impact of shared storage on Hadoop system performance, and therefore synthesize realistic cloud workloads. Shafer et al. [70] investigated the root causes of performance bottlenecks in order to evaluate trade-offs between portability and performance in Hadoop via different workloads. Ren et al. [73] analyzed three different Hadoop clusters to explore new issues in application patterns and user behavior and to understand key performance challenges related to IO and load balance. And many other notable examples exist as well [69, 68, 37].

While predominately system-focused, some works provide a functional view [69, 70, 73] where average traffic volume from historical web logs are discussed. Some simulators and synthetic workload generators are also suggested, such as Ankus [66], MRPerf [74], and its enhancement [67].

However, to my best knowledge, there is no direct analysis work for a Hadoop system at the block level. A block-level analysis is timely and more valuable now that device manufacturers are pressured to develop/improve products to meet capacity or performance demands, such as emerging hybrid or shingled magnetic recording (SMR) drives [5, 40, 4]. When considering the SMR drives (and the coming energy/heat assistant magnetic recording (EAMR/HAMR) drives) which have much higher data density than conventional drives, such an analysis is indispensable, due to their distinguished features from the conventional drives. For example, SMR drives introduce characteristics such as shingled tracks, which make the device more amenable to sequential writes over random, as well as indirect block address mapping, garbage collection, and more, which all modify how these devices interact with user workloads. A big question at the device level is if the block-level Hadoop workload is suitable in SMR drives.

In this chapter, I analyze Hadoop workloads at a block device level and answer this big question. Aided by blktrace, I provide a clear view of Hadoop’s behavior in storage devices [75, 76]. The main contribution lies in the following aspects:

Defining some new block-level workload metrics, such as stacked write update, stacked ROW, and queued seek distance to fulfill particular requirement of disk features
Identifying some characteristics of a big organization’s internal Hadoop cluster, and relating them to findings of other published Hadoop clusters with similarity and difference
Providing some suggestions on Hadoop performance bolstered from drive-level support
Providing analysis for the applicability of SMR drives in Hadoop workloads

This chapter will first cover the overall background of SMR drives and the Hadoop cluster and trace collection procedure. Then it will cover the analysis results based on these metrics.

Hadoop Cluster

Numerous workload studies have been conducted at various levels, from the user perspective to system/framework-level analysis. Most of these types of analysis only capture certain characteristics of the system. Harter et al. go as far as to take traces at a system view and apply a simple simulator to provide a physical view of the workload [77]. However, leaving the physical view of a system to simulation can miss details that may be critical to understanding a workload. Therefore, it is generally more reasonable to analyze the real workloads when considering performance.

The workflow of a Hadoop cluster is to import large and unstructured datasets from global manufacturing facilities into the HDFS. Once imported, the data is crunched and then organized into more structured data via MapReduce applications. Finally, this data is given to HBase for real-time analysis of the once previously unstructured data. While some of this structured data is kept on the HDFS (or moved to another storage location), the unstructured data is deleted daily in preparation to receive new manufacturing data.

The Hadoop cluster is used to store the manufacturing information, such as the data from 200 million devices per year created by the company. For example, in phase I manufacturing (clean room assembly), while drive components are assembled, data is captured by various sensors at each construction step for every drive. From a manufacturing point of view, creating 50-60 million devices a quarter creates petabytes of information that must be collected, stored, and disseminated in the organization for different needs. Some user scenarios include the query to the particular models, the summarization of quality of one batch, the average media density of one model, etc.

PIG/Hive is used for MapReduce indirectly, actually PIG/Hive converts SQL-like code to Java code to run MapReduce. MapReduce then does SQL-like statements to process raw test data to generate drill-downs and dashboards for product engineering R&D and failure analysis (FA). However, the cluster is mainly for analytics: MapReduce use cases (∼80%) and also some Hbase online search use cases (∼20%).

The WD cluster configuration is shown in Table 8-1. A general data flow is shown in Figure 8-1, where HDFS has native configurable logging structure, while datanode needs the aid of blktrace. The read requests from client-obtained metadata information from the namenode and then namenode sends block ops to devices. Thus the client gets the corresponding data from datanotes. The write requests will change both data in datanodes and the metadata in namenodes.

Table 8-1

WD-HDP1 Cluster Configuration

WD-HDP1: 100 Servers
CPU	Intel Xeon E3-1240v2, 4 Core
RAM	32 GB DDR3
OS HDD	WDC WD3000BLFS 10kRPM 300GB
Hadoop HDD	WDC WD2000FYYX 7.2kRPM 2-4TB
Hadoop Version	1.2.x

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig1_HTML.jpg — Figure 8-1
Data flow in Hadoop system

The data collection structure is illustrated in Figure 8-2. The task tracker runs above the local file system (FS). The jobs will finally convert into the block-level accesses to the local devices in each datanode. The tool blktrace actually collects the data in the block IO layer from the local devices (see Appendix B for details). I ran blktrace (sometimes together with iostat and/or iotop) repeatedly on four datanodes with different file systems and write cache settings in tens of batches, where two nodes used XFS and another two used EXT4 as local file systems, with each run lasting for few hours to few days. I collected hundreds of GB of traces (100+ pieces) representing total 1500+ hours of cluster operation from May 2014 to January 2015.¹ I switched the write cache on or off to collect data from different situations. The workload to these four nodes was relatively stable based on Ganglia’s network in/out metrics except for a few pieces. In particular, I focus on the trace collected in January 2015 with batch ID from 16 to 25 in this book. They are all one-day duration traces with the settings as show in Table 8-2.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig2_HTML.jpg — Figure 8-2
Trace collection using Ganglia and Blktrace

Table 8-2

File System and Write Cache Settings

Node ID	FS	Write cache
147	XFS	16:21 disabled; 22:25 enabled
148	EXT4	16:21 disabled; 22:25 enabled
149	EXT4	16:21 enabled; 22:25 disabled
150	XFS	16:21 enabled; 22:25 disabled

Workload Metrics Evaluation

In this section, I discuss my observations of disk activity from a sample of nodes within the Hadoop cluster. I then explain how these observations relate to metrics discussed in the previous section. Finally, I conclude with observations from a system level and suggestions for addressing performance issues which could arise from my recorded observations.

Block-Level Analysis

There exist some tools to parse and analyze raw block traces. For example, seekwatcher [78] generates graphs from blktrace runs to help visualize IO patterns and performance. Iowatcher [79] graphs the results of a blktrace run. However, those tools cannot capture the advanced metrics defined in Chapter 2. Thus you can apply the Matlab-based tool introduced before for these advanced properties.

General View

Figure 8-3 shows some average values of request size, IOPS, and throughput. From the figure, you can observe that the write size and IOPS are more related to file system type, as the difference between EXT4 and XFS is obvious. However, the read size and IOPS seem to be more related to batch, as different batches may have different read sizes. Note that the overall throughput is similar for each batch, which means the workload to each node is nearly even.² By removing the maximum and minimum values, you can look at the average value and standard derivatives in Figure 8-4. The size pattern for write for different file system types is clearly illustrated. Figure 8-5 further shows that the write requests’ major size range is [1–127] blocks and 1024 blocks. The sum of ratios in the two ranges is almost equal to 1, which leads to the near symmetric curve around 0.475.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig3_HTML.jpg — Figure 8-3
Average values of workloads for different file systems

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig4_HTML.png — Figure 8-4
Average size and IOPS

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig5_HTML.jpg — Figure 8-5
Write request size distribution in different range

In order to get more insight views into the trace, let’s choose a typical piece of a trace (Node 148 and batch 21) with its basic IO properties close to the average value described earlier. Table 8-4 gives the basic information about the workload. Next, I discuss this trace deeply in several aspects, which are summarized in Table 8-3.

Table 8-3

Summary of Observations and Implications

SMR characteristics	Hadoop observation
Write once read many	40% read and 60% write Relatively low write update ratio 35.4% stacked ROW ratio
Sequential read to random write	Relatively small random write ratio ROW ratio ˜60% and small size, so insignificant impact
Out-of-place update	Stacked WOW: 70% within first 10 minutes Usefulness of large-size SSD/DRAM/AZR cache to performance improvement (write update in cache)
Sequential write	Large size write requests (S ≥1024 blocks) > 64% Mode ratio: write 65% and read 70% Sequential ratio (S ≥1024): write 74% (66%) and read 82% (7%); Near sequential ratio (S ≥1024): write 87% (65%) and read 95% (85%)
Garbage collection (GC)	Frequent small idle time and large idle time periodically Low device utilization Relatively short queue length Relatively low write update ratio 116.3% frequented WOW small write amplification 8.5% write update cmds small rewritten ratio The dominant partial WOW hits are mainly large-size requests, while full hits at small-size requests

Table 8-4

Basic Statistics of Trace 148-21 (EXT4-WCD)

	Combined	Read	Write
Number of blocks	917942	373424	544518
Average size (block)	435.7	216.5	586.0
Read IOPS (r/s)			4.322
Write IOPS (w/s)			6.302
Blocks read per second			935.802
Blocks written per second			3693.157

Size and LBA Distribution

The overall request LBA distribution is shown in Figure 8-6a. Figure 8-6b further illustrates the size distribution curve, from which you can see that minimum requests are 8 blocks while the maximum size is 1024 blocks. For write requests, the total ratio of 8-block and 1024-block requests is almost 90%. The ratio of large size write requests (≥1024 blocks) is greater than 55%, so write requests are more sequential. For reads, the ratio of requests with size ≤256 is around 50%, and the size distribution is more various. Thus, the read request is generally more random than the write requests. The LBA vs. size distribution figures further confirm this observation. In fact, for writes, you can see that most large size requests lie intensively in few ranges; middle size requests are very few, while for reads, the size distribution is more diverse.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig6_HTML.jpg — Figure 8-6
LBA and size distribution

These findings provide a different view from the “common sense” of sequential access for Hadoop system [70]. It is true that sequential reads and writes are generated at the HDFS level for large-size files (the settings for chunk size is 128MB), so large-size files are split into 128MB blocks and then stored into the HDFS, and the minimum access unit is therefore 128MB generally. However, when these accesses interact with local file systems such as EXT4 and XFS, the situation becomes much more complex.

IOPS and Throughput

The average value of these two metrics depends on the statistical time window/interval. As an example, burstiness is very commonly observed in this trace, which leads to relatively rigid curves and high maximum IOPS for small time intervals, and a relatively smooth curve with low maximum IOPS for a large time interval. The IOPS for reads are generally higher than that of writes; however, the throughput of reads is generally lower than that of writes. In comparing the 600-second interval average with the 6-second interval average shown in Figure 8-7, the average value has a large difference. Note that the read IOPS in the 6-second interval figure are higher than 400, which is not an “error.” The reason is due to the near sequential behavior described earlier. To verify the tool, I compared the parsed curve with the one collected by iostat and iotop, and obtained a consistent result.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig7_HTML.png — Figure 8-7
IOPS and throughput

Utilization and Queue Depth

Both device workload (average 15%) and CPU workload (average 20%) are low. The average queue depth of the HDD (average value <0.3) further shows the low device utilization of this workload. As the overall workload is generally low, and the “periodic” bottom-peak curve is illustrative, the system therefore can exploit idle time to get the most potential benefits (such as garbage collection and defragmentation for space efficiency, or even block reorganization) for performance improvements. However, most idle time is not long enough for large background jobs. How to fully utilize these idle times is an interesting topic for future exploration.

Request Sequence

Let’s now look at the IO sequential pattern for both read and write requests. Some relevant concepts were introduced in Chapter 2.

For queued next seek distance, observe that the value of the mode (most frequent value) is equal to 0, and the mode ratio at N=64 (N=1) for read and write is 70.2% (61.6%) and 65.4% (61.3%), respectively. This implies a highly sequential workload (where higher is better). The mean absolute value drops quickly with queue length, which implies that there are many interleaved sequence streams. Therefore, the queue length used for sequence detection in a disk drive should be reasonably large to see better performance from the device.

Figure 8-8 illustrates sequence streams with different N, starting from 2 (and doubling to 256). The specific values for N=1 and N=128 are given in Table 8-5 (“w/” and “w/o” denote the cases with or without size constraints). You can see that the streams with only two requests dominate, while the streams with larger request numbers are relatively few. When a size constraint is enforced (S = 1024 blocks), the dominate N for read is moved to the value ≥3.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig8_HTML.png — Figure 8-8
Sequence stream with different N

Table 8-5

Sequence Stream and Command Detection

M 2 ≥ 2	Total streams		Total commands
N=1	w/o	w/	w/o	w/
Read	52827	19987	282911	210081
Write	25633	12412	359668	324007
N=128	w/o	w/	w/o	w/
Read	43682	3579	308149	217373
Write	27303	9209	393962	354384

However, for the write the value of N is still 2. This shift indicates the request size of reads in sequential streams is generally smaller than that of writes. This is further confirmed by the average request size shown in Table 8-6, so the size of sequence stream/commands of writes is much larger than that of reads. Therefore, the difference for read/write between size-constraint (S = 1024 blocks) and non-size-constraint requests is shown to be significant. It is noted with increased queue length N, the total stream number is generally decreased while the average stream size is increased and average command size is decreased.

Table 8-6

Average Size of Sequence Stream and Command

Op (N)	Avg. cmd size w/o (blocks)	Avg. cmd size w/ (blocks)	Avg. stream size w/ (blocks)
Read (1)	259.5	308.9	3247.3
Write (1)	952.0	910.3	24851.7
Read (128)	245.4	307.9	18700.5
Write (128)	797.7	882.2	31685.2

The total sequence stream detected is illustrated in Figure 8-9. This figure shows that the write request is much more sequential than read (considering the ratio). Note that for this figure, “combined” is not a simple sum of “read” and “write;” it is detected in the FIFO rule with all commands. The value displayed in Table 8-5 is consistent to the mode counter of queued next seek distance. In fact, you can easily calculate that the frequency of mode is the total command number of streams minus the total number of streams.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig9_HTML.png — Figure 8-9
Sequence ratio with S=1024

Therefore, you can see that total ratio of “sequential” read/write is over 82% and 74% (N=256), respectively (detected read/write sequential commands/total write/read commands), without size constraints. The ratio is reduced to 76% and 65.5%, respectively, with the size constraint (S =1024), which is more reasonable to indicate the sequence ratio.

With increased S, the ratio of sequential read/write commands drops slightly, as shown in Figure 8-10. It shows that the sequence of write is rather strong, as the sequence streams are generally large, so the ratio of writes is reduced from 82% to 62% and 58% when S is changed from 1024 to 4096 and 8196, respectively.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig10_HTML.png — Figure 8-10
Sequence ratio with with S=4096 and 8192

As shown in Figure 8-11, near-sequential for read is very strong. The total ratio for writes and reads is over 87% and 95%, respectively (detected write or read sequential commands/total write or read commands), without considering size constraint. It reduces to 65% and 85%, respectively, with the size constraint, which is more reasonable to indicate the sequence ratio of reads is higher than that of writes. With increased S, the ratio of near-sequential read/write commands slightly drops, similar to that of sequence ratio. Note that the distance is generally larger than 8 blocks; when δd ≥ 16, the increment is significant.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig11_HTML.png — Figure 8-11
Near sequence ratio with S=1024

All the (near) sequence information discussed above provides a good reference for pre-fetch policy design for disk drives. For example, when considering a sequence detection algorithm, the gap is an important parameter. When designing a hot data identification algorithm, the definition of hit frequency may be changed slightly for these near sequence streams. For instance, the LBA hit within a certain region can be counted as a hot area to take post-read action. The observations also tell that the interleaved stream number is not large and a small queue may be good enough to detect the sequence (e.g., N≥16, compared with non-cache, N=50 can increase around 5% sequence).

Write Update

For frequent write updates, 86% of accessed blocks (maybe repeated) are only written once. A decreasing percentage of written blocks are written multiple times, which means only a small portion of hot blocks are rewritten. Write amplification is roughly 116.3% if all rewritten data is put into a new place.

For a timed write update, total write blocks occur 80% of total access blocks (read and write), and the updated blocks (at least write twice) are only 6.8%. Total write commands are 59% of the total commands and the update commands are 22.5%. The average size of write commands is around 586 blocks and the average size of overlapped blocks of update commands is 73.3.

Before you look at the stacked write update, check the stacked distance first. You will find that small requests have a high probability to be full hits as opposed to large sized request (since the hit size is much smaller than the average size). In fact, the average overlapped sizes for partial hits (only a part of blocks are the same) and full hits (two requests are the same) are 139.4 and 55.1 blocks, respectively. You can conclude that a partial hit is more likely to happen for large size requests by using the numbers in Tables 8-4 and 8-7.

Table 8-7

Statistics for Logical Stack Distance (LSD)

	LSD≤1000	LSD≤2000	LSD≤4000	Overall
Partial, Full	2.8%, 14.9%	3.5%, 17.6%	4.4%, 19.9%	7.5%, 27.3%

A further check can be obtained by considering the full and partial hits separately by referring to Figure 8-12 for the hit frequency vs. LBA and size. This confirms the dominant partial hit at large size, while full hit at small size. The hits of the requests with medium-size (64-1023 blocks) are much less.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig12_HTML.png — Figure 8-12
Write update LBA and size distribution

Now let’s check stacked write update ratio. Based on write IOPS, distance 4000 is roughly 10 minutes. In this period, you can see 72.9% full write hit and 58.7% partial write hit of commands. With the knowledge of cache size and structure, you can estimate the hit ratio. As the stack distance is generally longer than DRAM cache length, updates on disk cannot be avoided. Therefore, a “caching/buffering” location on the media is necessary, so a larger-size SSD cache is necessary for performance improvement.

The third plot in Figure 8-13 shows that over 65% of the overlapped request size happens in the first 1% of overall time for partial/full hits, which further confirms the necessity of large SSD cache. As conventional disk drives have not provided such a big cache, it may be beneficial to implement this cache via a hybrid drive (SSD+HDD) or at a higher level in the system, such as array controllers or aggregate controllers. For SMR drive, a convenient way is to allow conventional zones accompanying with shingled zones for random write access.³ It is also called random access zone (RAZ) in [5, 4].

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig13_HTML.png — Figure 8-13
Frequented, timed, and stacked write update (from top to bottom)

Read on Write (ROW)

ROW ratio is mainly used to check if “write once read many (WORM)” is possible. You will find that the total ROW ratio is around 35.4% only, which implied that the written data is less likely to be read multiple times (i.e., a ratio much larger than 1). You can further check if the hit is only for small size requests. Figures 8-14 and 8-15 show that the written data is less likely to be immediately read back for most cases for similar reasons as explained in previous chapters.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig14_HTML.jpg — Figure 8-14
ROW ratio

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig15_HTML.png — Figure 8-15
ROW hit and size distribution

Write Cache Enabled vs. Disabled

Write cache is an important feature of HDDs. Many studies show that the performance of WCE (write cache enabled) can be increased significantly over WCD (write cache disabled) due to write-back policy, in particular for some small-size random write workloads. However, the data reliability concern (e.g., power loss leads to dirty data loss in DRAM write cache) results in most data centers disabling this feature.

Here we compare four traces (148-21, 148-22, 149-21, and 149-22) as nodes 148 and 149 always use different write cache setting and there is a setting change between batch ID 21 and 22 (see Table 8-2). You can see that there is no essential difference among write IOPS, throughput, and average size as shown in Table 8-8. Meanwhile, the ratio of large write requests (e.g., 1024 blocks) almost remains the same. Nevertheless, you can observe that the (near) sequence ratio of WCD is slightly smaller than that of WCE for both reads and writes (1% of absolute value). These factors show that the HDFS does not change its behavior according to HDD’s write cache settings, even though the local file system may respond to it. However, as the workload is far away from the drive’s boundary capability, the response is not significant.

Table 8-8

Basic IO metrics for WCD vs. WCE

Average	IOPS		Throughput KBPS		Size blocks
	21	22	21	22	21	22
148	6.3	7.0	1846.6	2064.4	586.0	589.5
149	6.6	6.8	1782.2	1881.9	541.8	556.6

In the heavy workload cases, some interleaved sequential streams will be considered as “random” rather than sequential, causing an increased ratio of random writes, which is harmful to the overall disk performance. Therefore, some non-volatile memory or DRAM protection technologies may be applied in order to enable write cache, which becomes necessary for heavy workloads. Additionally, for green environments in data centers where the bottleneck is not the HDDs (at least under normal workloads), another benefit to WCE is energy savings due to less mechanical accesses to the HDDs.

System-Level View

In this section, I present a brief analysis of how the random IO, observed in the previous section, of the block level traces led to a better analysis of the Hadoop cluster’s IO patterns. First, let’s look at the workload characteristics collected from HDFS logs, as shown in Table 8-9. Recall the system level analysis mentioned earlier; I captured data generation and deletion rates, job creation rates and characteristics, etc. with the low level IO obfuscated from us. Arguably, this is how a high-level framework should behave, fully or mostly insulated from the hardware devices below. However, those who have to maintain the full operational stack must be aware of the entire system, not just the user-level framework. Below, I analyze the randomness from three aspects: the Hadoop framework, the MapReduce policy, and HDFS mechanism.⁴

Table 8-9

Basic Information Collected from HDFS Logs

Name	Duration	Total IO Requests			Average IO Size(MB)
		Ave	Max	Min	Total	Read	Write
wdc-x1	68 d	0.627	5.476	0.125	60.96	3.14	99.75
wdc-x2	98 d	0.396	5.661	0.107	47.83	3.09	93.2
R/W ratio			IOPS 10⁵			Throughput (MB/S)
Avg	Max	Min	Total	Read	Write	Total	Read	Write
0.0348	0.125	0.0054	0.726	0.536	0.19	18.48	0.632	17.85
0.0383	0.33	0.007	0.458	0.325	0.133	12.11	0.442	11.67

Notice that all daemons within the Hadoop framework write logs of their runtime throughout the course of the day. Depending on configuration, this can be written to the HDFS or to the local FS. If logs are written to the local FS, they account for some random IO to the HDD. Other daemons of frameworks which sit on top of Hadoops core framework can generate more random IOs at the device level. Hbase has similar logs from its region servers which must be written. Created for real-time data processing, Hbase will spawn small MapReduce jobs which access potentially small amounts of data, causing random read IOs in a HDFS instance. Frameworks like Hive and other KV stores, which sit on top of Hadoop, have similar logging structures which can potentially cause random IO amplification down at the device level.

Java-based (apache) MapReduce must write temporary intermediary files to disk during MapReduce jobs, some with repeating process IDs (PIDs) and others with single or minimal use PIDs. These intermediary files can either be dumped to locations in the HDFS (which attempts to serialize the IO if possible) or the local FS, which will be a random IO event. This is configurable via tuning the parameter mapreduce.task.tmp.dir.

After consulting the Hadoop XML configuration files, one thing that I noticed was that the temporary/intermediary space for all HDFS and MapReduce workloads were configured to store their output data to the HDFS data drive. Because the Hadoop framework is written in Java, it must contend with the properties of the Java Virtual Machine (JVM), meaning that memory addresses have no meaning between JVMs. Hence, when a MapReduce task passes data to another MapReduce task, it first writes data to a temporary file for the other JVM task to read. Additionally, when a MapReduce job launches, it must send the configuration parameters and executable jar file to the TaskTrackers so that they can correctly spawn map and reduce tasks. This data too was being stored to that configured local temporary space.

Couple this IO with the observation that the cluster runs close to 5500 MapReduce jobs per day, of which many are small task count jobs (due to the 20% HBase analysis performed by users), and the amount of random IO generated on these HDDs becomes very large. However, this IO is something that can be mitigated and was not completely responsible for the high level of random IO seen in the block-level traces. The HDFS itself also contributes to the amount of random IO seen in those traces.

From the system logs as shown in Figure 8-16, you can see that the number of chunks created and deleted daily is quite high for the system. Each time the HDFS commits a chunk of data to the filesystem, it also creates a metadata file. This metadata file is proportional to the size of the HDFS chunk committed to the localFS, so a 128MB chunk will have a corresponding ≈1.2MB metadata file created where smaller chunks will have smaller metadata files. Hence, for the observed workload, there is a lot of small random IO due to tens of thousands of new chunks being generated daily on the HDFS. However, unlike the observed random IO for the MapReduce framework, wherein the location to store the temporary files is configurable, the location where these HDFS metadata files are stored is not a configurable property, and therefore cannot be delegated to another class of storage or storage location. When considering newer HDD technologies, wherein random IO (especially random writes) can greatly impact performance, understanding workload characteristics like these are paramount. Without a device-level analysis of the workload, these characteristics would have not been so clearly identified.

../images/468166_1_En_8_Chapter/468166_1_En_8_Fig16_HTML.jpg — Figure 8-16
Basic curves from HDFS logs

Some Further Discussions

In this chapter, I presented the block-level workload characteristics of the Hadoop cluster by considering some specific metrics. The analysis techniques presented can help others understand the performance and drive characteristics of Hadoop in their production environments. Collected by blktrace, I conducted a comprehensive analysis of these logs which identified new workload patterns with some unexpected behaviors. I showed that, while sequential and near-sequential requests represent the majority of the IO workload, a non-trivial amount of random IO requests exist in the Hadoop workloads. Additionally, the write update ratio on drives is not very high, which indicates that a small write amplification can occur if an out-of-place write policy is applied. Also note that the ROW ratio is small, which means WORM does not generally hold for the cluster’s workload. All these findings imply a relatively high spatial locality and lower-than-expected temporal locality, which show that Hadoop is generally a suitable application for SMR drives. However, further improvements in both Hadoop and drive sides are required.

Looking critically at the configuration of a Hadoop system, it is possible to fine-tune and minimize some, but not all, of the observed random IO. Factors that add to this random IO are several types of framework logging, intermediary files generated by MapReduce and HBase workloads, and metadata files of HDFS chunks. The verbosity of Hadoop daemon log files can be turned down to generate less data, and they along with temporary MapReduce output can be written to a storage location which will not impact HDFS chunk IO operations. Among these can be the HDFS itself (rather than local storage), which will attempt to make the IO more sequential, or on another physical/logical block device more suited to random block IO (while maintaining data locality). Some basic curves derived from HDFS logs are shown in Figure 8-16.

However, the final piece of the observed random IO is a consequence of HDFS write/update mechanism and cannot be easily mitigated because it currently must reside with the committed HDFS chunks on the HDFS data drives. The small IO caused by chunk metadata must then be serviced by a capacity block storage device which can either understand how to transform these small random IO into larger sequential access patterns, or a device that is simply designed to handle random IO. Without a device-level view, it is possible that this overhead would be dismissed as a problem elsewhere in the system rather than at the HDD device level, where some of these issues are very simple to correct, given the proper insight. For instance, a large DRAM buffer will be very useful for these scenarios with random read accesses, and non-volatile memory (e.g., NAND and conventional zone) for these random write accesses.

Hence, it is reasonable to study an integration of HDFS and the local files systems with consideration of device properties, such as a design in a global view so that there is no “misunderstanding” of the local metadata to the sequential write in HDFS. And the metadata and the “non-critical” intermediate/temporary data are assigned to proper disk location. Therefore, HDFS could take the responsibility of file/block accesses in DataNode, which may make the drive operation more efficient. The metadata location in the device shall also be carefully designed.

In addition, the drive-level cache and system-level cache may be unified with the consideration of the mechanism of Java, such that some temporary data may be absorbed by the unified cache/buffer instead of disk mechanical accesses. This unification could be difficult due to the current HDFS’s simple cache design and lack of direct interface between JVM and drives. However, it is possible for drive manufacturers to provide such an application-oriented interface for communication.

Furthermore, a certain intelligence might be useful for the drive to understand the nature of the access (e.g., random or sequential, access dependency), such that the drive can immediately switch to the optimal algorithm/behavior for better performance. An application-level hinting scheme with interaction between host and drive or a self-learning algorithm inside drive can be helpful.

In conclusion, the device level analysis of the in-house Hadoop cluster has provided new insights into how Hadoop interacts with the underlying file system and handles its lower-level IO. These new insights motivate me to continue studying how workload characteristics of big data frameworks and application tuning could help the performance of storage devices in the current data driven climate which we live in. This study is also applicable to Spark, an in-memory MapReduce system roughly. For example, a detailed workload analysis can provide some insights of the SCM application for Spark systems, which will benefit a cost-efficient design of hybrid SCM-DRAM structures.