The chapter provides the background of data storage systems and general trace analysis. I will show that wide applications of block storage devices motivate the intensive study of various block-level workload properties. I will also list the objectives and contributions of this book in this chapter.
Basics of Storage
- Volatile
The commonly used, such as DRAM (dynamic RAM), SRAM (static RAM), etc.
Those under development: T-RAM (thyristor RAM), Z-RAM (zero-capacitor), etc.
- Non-volatile
ROM (read-only memory), such as EPROM (Erasable Programmable ROM), EEPROM (Electrically E-PROM), MROM (Mask ROM), etc.
NVRAM, such as flash memory, PCM (phase change memory), ReRAM/RRAM (resistive RAM), MRAM (magnetoresistive RAM), FeRAM (ferroelectric RAM), etc.
Mechanical devices like HDD, magnetic tape, optical disc drives
When selecting a storage device or system, many factors must be considered carefully, such as price, performance, capacity, power efficiency, reliability, data integrity, durability, form factor, operating temperature, connection types, and so on, depending on the application scenarios. However, the performance of the devices is the major topic of this book.
Storage Devices
In this section, I discuss several types of non-volatile storage devices, as some volatile devices like RAM, will be used inside those non-volatile devices as cache.
HDD

Basic components of a HDD

Basic HDD electronics blocks
In particular, the servo is one of the most precise mechanical systems in the world. The disk head that reads and writes data to the medium is only few nanometers above the disc media; this is similar to a Boing 737 plane flying a few meters above the ground. The HSA (head stack assembly) is moved by applying a current to the wires wound around a loop at its back end. This coil forms an electromagnet. The amount of current used is calculated by servo electronics. By varying the current, very precise acceleration and deceleration can be programmed, increasing IO performance and servo head positioning accuracy.
HDD can be divided into two categories: consumer and enterprise. Consumer HDDs are mostly used in desktop and mobile devices like a notebook. Consumer electronics HDDs are often embedded into digital video recorders, smart TVs, and automotive vehicles. Enterprise HDDs usually have higher reliability than consumer HDDs, with higher quality requirements for the media and head.
Disk drives have different spinning speeds (rotation per minute, RPM). For example, desktop HDDs are usually in 3.5-inch form with 7200 RPM, while mobile HDDs are in 2.5-inch form with 5400 RPM. Each disc surface is divided into different concentric zones. Inner zones (ID) have less physical space and contain less sectors than outer zones (OD). As the spinning speed is the same, the data transfer speed of OD is generally faster than that in ID. For a typical 3.5-inch desktop HDD, the sequential read speed in OD could be 1.5 to 2 times than that in ID.
For a typical HDD, the following formula calculates average access time (Ta) [2, 3]:
Ta = Ts + Tl + Tt + To(1.1)
Seek time as Ts: Time required to move the heads a desired distance. Typically specified at 1/3 the radius of the platter. The settle time is generally included in this part.
Rotational latency as Tl: Amount of time the drive must wait before data is under the read/write head.
Transfer time as Tt: Amount of time required to transfer data to or from the host.
Controller overhead as To: How long it takes the drive to decode a command from the host.
Note that the read head is usually different from the write head, and the internal firmware process for reads and writes is also different. So there will be a slight variance for read and write seek times. Usually, write access costs more because of the longer setting time of a write, which is caused by PES (position error signal) requirement, which means write access requires a stronger condition on PES than read access. By design, faster RPM drives have faster average access times than slower RPM drives due to shorter latency and seek times.
The response time (Tres) is a different concept from the access time. In fact, since the conventional disk drive can only process one request at one time, some incoming requests have to wait in a queue. For example, some write requests may be buffered in DRAM write cache first and must wait for the previous request to be completed. Note that although there are many arms and heads per drive, the arms must move together since there is only one VCM to drive them in general. Thus,
Tres = Ta + Tw(1.2)
where Tw is the waiting or queueing time just after the request enters the queue and before it is actually executed.
Owning to the influence of the command queue, cache has large impact on the performance of both read and write. Thus, a large portion of DRAM inside HDD is used for cache. Read cache and write cache commonly share the same space, so that part of write cache segments may be converted into read cache segments when necessary. However, some HDDs may have dedicated read or write cache for different purposes. In Chapter 4, I will show more details.
Conventional magnetic recording (CMR) is a relative concept. The longitudinal magnetic recording (LMR) HDD was a conventional concept to perpendicular magnetic recording (PMR) HDD in early 2000s. Nowadays, PMR is the dominant structure and is still in evolution. For example, SMR (shingled magnetic recording) is a new type of PMR already available in the market, while HAMR (heat-assistant magnetic recording) and MAMR (microwave-assistant magnetic recording) are emerging.
SMR HDD
SMR is the emergent technique being deployed to increase areal density in HDDs without drastic changes to the HDD mechanics [4, 5, 6, 7, 8]. Due to its shingled nature, SMR tends to favor large sequential writes over random ones. In this background section, I will give a general introduction to SMR characteristics.

Schematic of SMR
Due to this log-structure-like sequential write feature (which is beneficial for write performance), the conventional LBA-to-PBA mapping (direct/static mapping) may not work well since any change in a block results in a read-modify-write access to all its consecutive blocks in the same zone, which can cause performance penalties. Therefore, indirect/dynamic mapping is usually applied. When an update happens, instead of an in-place rewrite, an out-of-place “new” write will be carried out, which leads to write implications; in other words, the data in the previous place becomes garbage and the new write claims additional space. In order to reuse those garbage blocks, a garbage collection (GC) procedure must be implemented.
Another concern of the out-of-place update is the potential harm to the sequential read performance. If some LBA-continuous requests are written into several different physical zones, the later LBA-continuous read request in the same LBA range cannot gain the actual benefit of the “logically sequential” read. The corresponding data management scheme can be implemented in three levels: drive, middleware, or host side. Although a few academic works have introduced in-place updates via a special data layout design, the out-of-place policy remains the main approach.
In general, a SMR drive expects the workload to be read/write sequentially, with infrequent updates to the data. In addition, since garbage data will generally occur at some points (unless data is never deleted or modified), idle time should be sufficiently long as to allow GC to run periodically without impacting external/user IO performance. Hence, the write-once-read-many (WORM) workload (archival) is a natural extension to the characteristics of SMR drives. Few other recent suggestions on SMR optimizations are available in [9], e.g., hybrid strategy, parallel access and large form factor.
Other HDDs
The PMR technique reached its theoretical limitation of areal density for conventional design (1TB/in2) in recent years. The limiting factor is the onset of the super-paramagnetic limit as researchers strive towards smaller grained recording media. This levies a tradeoff between the signal-to-noise ratio (SNR) and the thermal stability of small grain media and the writability of a narrow track head, which restricts the ability to continue to scale CMR technology to higher areal densities [10].
New Techniques to Increase Areal Density
Approaches | Reduce grain size and make grains harder to switch | Reduce bit width and/or length | Increase areal density/size, add heads and disks |
---|---|---|---|
Solutions | HAMR, MARM | SMR, HAMR, T-DMR | Helium drive, advanced mechanical designs, form factor optimization |

Future options of HDD[10]
HAMR and MAMR are two implementations of energy-assisted magnetic recording (EAMR). HAMR is proposed to overcome head writability issues. The core components of proposed HARM and MAMR technologies are a laser and a spin torque-driven microwave oscillator, respectively. In HAMR, the media have to be stable at much smaller grain sizes yet be writable at suitably elevated temperatures. The integration of HAMR and BPMR enables an extension of both technologies, with projected areal density up to about 100 Tb/in2 based on the thermal stability of known magnetic materials [10]. Both WDC and Seagate announced their roadmap for EAMR. Seagate claimed that its HAMR-based HDDs will be due in late 2018,2 while WDC declared that its MAMR will store 40TB on a hard drive by 2025.3
TDMR still uses a relatively conventional perpendicular medium and head, while combining shingled write recording (SWR) and/or 2D read back and signal processing to promise particularly large gains. Recording with energy assist on BPM or 2D signal processing will enable the areal density beyond around 5 Tb/in2. However, there is no clear problem-free solution so far.
Note that for HAMR/MAMR HDDs, the sequential access properties may be similar to SMR/TDMR. As the heater start-up/cool-down requires time, sequential access to reduce the status change is preferred.

Predicated density of future techniques[10]
SSD
A solid-state drive/disk (SSD) is a solid-state storage device to store data persistently that utilizes integrated circuit assemblies as memory [12]. Electronic interfaces compatible with traditional HDDs, such as SAS and SATA, are primarily used in SSD technology. Recently, PCIe, SATA express, and M.2 have become more popular due to increased bandwidth requirements.
The internal memory chip of a SSD can be NOR-flash, NAND-flash, or some other emerging NVM (non-volatile memory). Most SSDs have started to use 3D TLC (tri-layer ceil) NAND-based flash memory as of 2017.
SSD is changing the storage industry. While the maximum areal storage density for HDDs is only 1.5 Tbit/in, the maximum for flash memory used in SSDs is 2.8 Tbit/in in laboratory demonstrations as of 2016. And SSD’s overall areal density increasing ratio of flash memory is over 40% per year, which is larger than 10-20% of HDD. And the price decreasing ratio of SSD ($ per GB) is dropping faster than that of HDD.
Comparison of HDD and SSD
Attribute | SSD | HDD |
---|---|---|
Start-up time | Almost no delay because there is no requirement to prepare mechanical components (some μs to ms). Usually a few ms to switch from an automatic power-saving mode. | Up to several seconds for disk spin-up. Up to few hundred million seconds to wake up from idle mode. |
Random access time | Typically less than 0.1 ms. Not a big performance bottleneck usually. | Typically from 2.5 (high-end server/enterprise drives) to 12 ms (laptop/mobile drives) mainly owing to seek time and rotational latency. |
Read latency time | Usually low due to the direct data access from any location. For applications constrained by the HDD’s positioning time, SSD has no issue in faster boot and application launch times (see Amdahl’s Law). A clean Windows OS may spend less than 6 seconds to boot up. | Generally much larger than SSDs. The time is different for each seek due to different data locations on the media and the current read head position.5 A clean Windows OS may spend more than 1 minute to boot up. |
Data transfer rate | Relatively consistent IO speed for relatively sequential IO. Performance is reduced when the portion of random smaller blocks is large. Typical value ranges from around 300MB/s to 2500MB/s for consumer products (commonly speed around 500MB/s at 2017). Up to multi-gigabyte per second for enterprise class. | Heavily depends on RPM, which usually ranges from 3,600 to 15,000 (although 20,000 RPM also exists). Typical transfer speed at about 200 MBps for 3.5-inch drive at 7200 RPM. Some high-end drives can be faster, up to 300 MBps. TPI and SPT are also influential factors. |
Read performance | Generally independent on the data location in SSD. In few cases, sequential access may be affected by fragmentation. | Random seek is expensive. Fragmented files lead to the location of the data in different areas of the platter; therefore, response times are increased by multiple seeks of fragments. |
Write performance | Write amplification may occur.6 Wear leveling techniques are implemented to get this effect. However, the drive may unavoidably degrade at an observable rate due to SSD’s nature. | CMR has no issue with write amplification. However, SMR may have an issue due to the out-of-place-update. GC is also required. |
Impacts of file system fragmentation | Relatively limited benefit to reading data sequentially, making fragmentation not significant for SSDs. Defragmentation would cause wear with additional writes.7 | Many file systems get fragmented over time if frequently updated. Optimum performance maintenance requires periodic defragmentation, although this may not be a problem for modern file systems due to node design and background garbage collection. |
Noise (acoustic) and vibration | SSDs are basically silent without moving parts. Sometimes, the high voltage generator (for erasing blocks) may produce pitch noise. Generally insensitive to vibration. | The moving parts (e.g., heads, actuator, and spindle motor) make characteristic sounds of whirring and clicking. Noise levels differ widely among models, and may be large.8 Mobile disks are relatively quiet due to better mechanical design. Generally sensitive to vibration.9 |
Data tiering | Hot data may move from slow devices to fast devices. It usually works with HDDs, although in some implementations, fast and slow SSDs are mixed. | HDDs are usually used as a slow tier in a hybrid system. Some striped disk arrays may provide comparable sequential access performance to SSDs. |
Weight and size | Essentially small and lightweight due to the internal structure. They usually have the same form factors (e.g., 2.5-inch) as HDDs, but thinner, with plastic enclosures. The M.2 (Next Generation Form Factor) format makes them even smaller. | HDDs are usually heavier than SSDs, since their enclosures are made of metal in general. 2.5-inch drives typically weigh around 200 grams while 3.5-inch drives weigh over 600 grams (depending on the enclosure materials, motors, disc magnets/number, etc.). Some slim designs for mobile could be less than 6mm thin. |
Reliability and lifetime | NO mechanical failure. However, the limited number of write cycles for each block may lead to data loss.10 A controller failure can lead to an unusable SSD. Reliability differs quite considerably depending on different manufacturers, procedures, and models.11 | Potential mechanical failures from the resulting wear and tear. The storage medium itself (magnetic platter) does not essentially degrade from R/W accesses.12 |
Cost per capacity13 | Consumer-class SSD’s NAND chip pricing has dropped rapidly: US$0.60 per GB in April, 2013, US$0.45, $0.37 and $0.24 per GB in April 2014, February 2015, and September 2016, respectively. The speed has slowed down since late 2016. Prices may change after 3D NAND becomes common.14 | Consumer HDDs cost about US$0.032 and $0.042 per GB for 3.5-inch and 2.5-inch drives in May 2017. The price for enterprise HDDs is generally more than 1.5 times over that for consumers. Relatively stable prices in 2017 may be broken after MAMR/HAMR release. |
Storage capacity | Sizes up to 60TB by Seagate were available as of 2016. 120 to 512GB models were more common and less expensive. | HDDs of up to 10TB and 12TB were available in 2015 and 2016, respectively. |
Read/write performance symmetry | Write speeds of less costly SSDs are typically significantly lower than their read speeds. (usually ≤1/3). Similar read and write speeds are expected in high-end SSDs. | Most HDDs have slightly longer/worse seek time for writing than for reading due to the longer settle time. |
Free block availability and TRIM command | Write performance is significantly influenced by the availability of free, programmable blocks. The TRIM command can reclaim the old data blocks no longer in use; however, fewer free blocks cause performance downgrade even with TRIM. | CMR HDDs do not gain benefits from TRIM because they are not affected by free blocks. However, SMR performance is also restricted by the available of free zones. TRIM is required for dirty zones sometimes. |
Power consumption | High performance flash-based SSDs generally require 1/2 to 1/3 of the power of HDDs. Emerging technologies like PCM/RRAM are more energy-efficient.15 | 2.5-inch drives consume 2 to 5 watts typically, while some highest-performance 3.5-inch drives may use around 12 watts on average, and up to about 20 watts. Some special designs for green data centers send the disk to idle/sleep when necessary. 1.8- inch format lower-power HDDs may use as little as 0.35 watts in idle mode.16 |
- Strength
A mature technology widely employed by industries
Large scale/density, applicable for 3D techniques
A single drain contact per device group is required compared with NOR.
Relatively cheaper than other emerging NVM types for dollar/GB
- Weakness
Asymmetric performance (slower write than read)
Program/erase cycle (block-based, no write-in-place)
Data retention (retention gets worse as flash scales down)
Endurance (limited write cycle compared with HDD and other emerging NVMs) 100-1000 slower than DRAM
10-1000 slower than PCM and FeRAM
Usually, the higher the capacity, the lower the performance.
- Opportunity
Scaling focused solely on density; density is higher than magnetic HDD in general.
Decreased cost, which will be comparable with HDD in the near future
3D schemes exist despite of complexity
Durability is improved to a certain degree together with fine-tuned wearing leverage algorithms.
Replacement for HDD in data centers as a mainstream choice (in particular, an all-flash array), although hybrid infrastructures will remain for some years.
- Threat
The extra connections used in the NOR architecture provide some additional flexibility when compared to NAND configuration.
The active development of MRAM/ReRAM may shake NAND flash’s dominate position.
The real question is the market share of the two technologies. It is important how you measure the market share. By money gets you a very different answer than by bit. In the money arena, SSDs will rapidly overtake HDDs spend in the very near future, while by bit, HDDs will still dominate for some years.
There are some other storage devices using flash memory. Flash thumb drives are similar to SSD but with much lower speed and they are commonly used for mobile applications. Kingston Digital released 1TB capacity drives with an USB 3.0 interface (data transfer speeds up to 240 MB/s read and 160 MB/s write) in early 2017 and 2TB drives (up to 300 MB/s read and 200 MB/s write) in late 2017, which is similar to HDD’s speed.
Small form size memory cards are also widely used in electronic devices, such as smartphones, tablets, cameras, and so on. Some common formats include CompactFlash, Memory Stick, SD/MicroSD/MiniSD, and xD. SanDisk introduced up to 1TB size of Extreme Pro series SD products in September 2016 and MicroSD up to 400GB in August 2017.
Hybrid Disk
A hybrid drive is a logical or physical storage device that integrate a fast storage medium such as a NAND/NOR flash SSD into a slow medium such as a HDD [15]. The fast device in a hybrid drive can act either as a cache for the data stored on the HDD or as a tier peering to HDD. In generally, the purpose is to improve the overall performance by keeping copies of the most frequently used data (hot data) on the faster component. Back in the mid-2000s, some hard drive manufacturers like Samsung and Seagate theorized the performance boost via SSD inside HDD. In 2007, Samsung and Seagate introduced the first hybrid drives using the Seagate Momentus PSD and Samsung SpinPoint MH80 products.
There are generally two types of hybrid disks. One is of a dual-drive structure (the tiering structure) where the SSD is the fast tier and HDD is the slow tier. Usually, the OS will recognize the devices with two sub-storage devices. Western Digital’s Black2 products introduced in 2013 and TarDisk’s TarDisk Pear in late 2015 are two examples of dual-drive devices. The other is an integrated structure (solid-state hybrid drive, SSHD) where the SSD acts as cache [16]. Users or OSs may see one storage device only without specific operations.
The hybrid disk drive can operate in either self-optimized (self-learning) mode or host-optimized mode. In the self-optimized mode of operation, the SSHD works independently from the host OS, so device drives determine all actions related to data identification and migration between the HDD and SSD. This mode lets the drive appear and operate to a host system exactly like a traditional drive. A typical drive is Seagate’s Mobile and Laptop SSHD. Host- optimized mode is also called host-hinted mode, so the host makes the decision for the data allocations in HDD and SSD via SATA interface (since SATA version 3.2). This mode usually requires software/driver support from the OSs. Microsoft started to support the host-hinted operations in Windows 8.1 (a patch for version 8 is available), while patches for the Linux kernel have been developed since October 2014. Western Digital’s first generation of SSHDs is in this category.
The performance is heavily application/workload dependent usually. But the drive may not be smart enough to be constrained by its resource.
Block level optimization is no better or worse than file/object level optimization due to less information on the workload. Thus it is not recommended to optimize the workload in the drive level.
It is not well suited for a data center infrastructure’s general purpose due to relatively static configurations of hybrid disks.
Comparison of Some NVMs
STT-MRAM | PCMS 3D Xpoint | ReRAM | Flash NAND | |
---|---|---|---|---|
Read latency | < 10ns | < 100ns | < 10ns | 10–100us |
Write latency | 5ns | > 150ns | 50ns | > 100us |
Power consumption | Medium | Medium | Medium | High |
Price (2016) | 200−3000/ Gb | ≤ 0.5/Gb | 100/Gb | ≤ 0.05/Gb |
Endurance(Nb cycles) | 1012 to unlimited | 108−109 | 105−1010 | 105−106 |
Tape and Disc
Magnetic tape was first used to record computer data in 1951. It usually works with some specific tape drives only. Despite its slow speed, it is still widely used for cold data archiving. IBM and FujiFilm demonstrated a prototype BaFe Tape with 123 Gb/in2 areal density and 220TB cartridge capacity in 2015. Sony and IBM further increased this number to 201 Gb/in2 and 330TB into a tiny tape cartridge in 2017.17 Instead of magnetic materials painted on the surface of conventional tape, Sony used a “sputtering” method to coat the tape with a multilayer magnetic metal film, which is thinner with narrower grains using vertical bits. Note that tape and HDD share many similarities in the servo control, such as servo pattern and nanometer precision.
An optical disc is a flat, usually circular disc that encodes binary data (bits) in the form of pits. An early optical disc system can be traced back to 1935. Since then, there have been four generations (a CD of about 700MB in the first generation, a DVD of about 4.7GB in the second generation, a standard Blu-ray disc of about 25GB in the third generation, and a fourth generation disc with more than 1TB data).
Both magnetic tapes and optical discs are usually accessed sequentially only. Some recent developments use robot arms make the change of tape/disc automatically. It is expected that tape and optical disc may still be active in the market for some years. In particular, due to much lower price per GB than other media, the tape seems to have a large potential market for extremely cold storage.
Emerging NVMs
Phase-change memory (PCM), such as 3D X-point
Magnetoresistive RAM (MRAM), such as STTRAM and Racetrack memory
Resistive RAM (RRAM/ReRAM), such as Memristor, Conductive-bridging RAM (CBRAM), Oxy-ReRAM
Ferroelectric RAM (FeRAM), such as FeFET
Others, such as conductive metal oxide (CMOx), solid electrolyte, NRAM (nano RAM), ZRAM (zero-capacitor), quantum dot RAM, carbon nanotubes, polymer printed memory, etc.
- Strength
Relatively mature (large-scale demos and products) compared with other emerging NVMs
Industry consensus on materials, like GeSbTe or GST
Large resistance contrast, which leads to analog states for MLC
Much longer endurance than NAND Flash
High scalability (still works at ultra-small F) and back-end-of-the-line compatibility
Potential very high speed (depending on material and doping)
- Weakness
RESET step to high resistance requires melting − > power-hungry and thermal crosstalk?
To keep switching power down − > sub-lithographic feature and high-current access device
To fill a small feature − > atomic layer deposition or chemical vapor deposition techniques − > difficult now to replace GST with a better material
MLC strongly impacted by relaxation of amorphous phase − > resistance drift
10-year retention at elevated temperatures (resistance drafts with time) can be an issue − > recrystallization
Device characteristics change over time due to elemental segregation − > device failure
Variability in small features broadens resistance distributions
- Opportunity
An order of magnitude lead over FeRAM, MRAM, etc.
NOR-replacement products now shipping − > if yield-learning successful and MLC (3-4 bits per cell successfully implemented in PCM technologies despite R-drift phenomenon in 2016)
Good for embedded NVM for SoC, Neuromorphic
Drift-mitigation and/or 3D access devices can offer high-density (i.e., low-cost), which means the opportunity for NAND replacement. Finally S-type, and then M-type SCM may follow.
Projected to reach 1.5B USD with an impressive CAGR of almost 84% by 2021
- Threat
Attained speed in practice is much slower than the theoretical speed; slow NOR-like interfaces
The current PCM SSD is only several times faster than SLC SSD, which is far away from the projection.
DRAM/SRAM replacement may be challenging due to fundamental endurance limitation.
PCM as a DRAM segment accounted for the major shares and dominated the market during 2016, which means a long way for S-SCM.
A key challenge is to reduce reset (write) current; contact dimension scaling will help, but will slow progress.
Engineering process
NAND techniques are also under active development, in particular, the 3D NAND. Compared with these emerging NVMs, NAND is relatively mature, dense, and cheap. However, it could be much slower than PCM and ReRAM. Meanwhile, its durance may be significantly lower than PCM, MRAM, and FeRAM in general.

Competitive outlook among emerging NVMs
According to Yole Development’s recent estimation,19 the emerging NVM market will reach USD 4.6 billion by 2021, exhibiting an impressive growth of +110% per year, although the market size in 2015 was USD 53 million only. SCM will be the clear go-to market for emerging NVM in 2021. Marketsandmarkets20 also predicts that the global non-volatile memory market is expected to reach USD 82.03 billion by 2022, at a CAGR of 9.50% between 2017 and 2022.
Storage Systems
This section discusses the system level storage infrastructure and implementation. RAID (redundant array of independent/inexpensive disks) and EC (erasure code) systems are mainly used for failure tolerance. Hybrid storage systems intend to achieve relatively high performance at low cost. Microserver and Ethernet drives have been employed in some object storage systems. Software-define systems separate the data flow and control flow. Some large-scale storage system implementations, like Hadoop/Spark, OpenStack, Ceph, are also introduced.
Infrastructure: RAID and EC
RAID as a data storage virtualization technology combines multiple physical drive components into a single logical unit or pool for the purposes of data redundancy, performance improvement, or both [20]. The Storage Networking Industry Association (SNIA) standardized RAID levels and their associated data formats from RAID 0 to RAID 6: “RAID 0 consists of striping, without mirroring or parity. RAID 1 consists of data mirroring, without parity or striping. RAID 2 consists of bit-level striping with dedicated Hammingcode parity. RAID 3 consists of byte-level striping with dedicated parity. RAID 4 consists of block-level striping with dedicated parity. RAID 5 consists of block-level striping with distributed parity. RAID 6 consists of block-level striping with double distributed parity.” RAID 2-4 are generally not for practical usage. RAID levels can be nested, as in hybrid RAID. For example, RAID 10 and 50, which is RAID 1 and 5 based on RAID 0.
RAID can be implemented by either hardware or software. Hardware RAID controllers are expensive and proprietary, and usually used in enterprise environments. Software-based implementations have gained more popularity recently. Some RAID software is provided by modern OSs and file systems, such as Linux, ZFS, GPFS, and Btrfs. Hardware-assisted RAID software implements RAID mechanisms in a standard drive controller chip with embedded proprietary firmware and drivers.
Nowadays, RAID systems are widely used in SMEs. Even in some data centers, RAID is still used as a fundamental structure for data protection. However, RAID is limited by its data reliability level, so only up to two disk failures can be tolerated by RAID 6, which is not secure enough for some critical applications. Thus, the erasure coding scheme emerged as an alternative to RAID. In EC, data is broken into fragments that are expanded and encoded with a configurable number of redundant pieces and are stored across different locations, such as disks, storage nodes, or geographical locations. Theoretically, EC can tolerate any number of disk failures, although up to four are used in a group practically. EC may also encounter some performance issues, particularly when the system is operated in downgraded or recovery mode.
Hybrid Systems
Although all-flash arrays are gaining in popularity, hybrid structures remain the mainstream in data centers, due to the trade-offs between cost, reliability, and performance. In early days, the hybrid storage system contained a HDD as the fast tier and tape as the backup tier [21] [22]. Later, fast access speed HDDs (such as 15kRPM and 10kRPM) acted as the performance tier, and slow speed HDDs (such as 7200RPM and 5400RPM) acted as the capacity tier [23]. With the development of non-volatile memory (NVM) technologies, such as NAND Flash [24], PCM [25], STTMRAM [18], and RRAM [19], the performance cost ratio of NVMs is increasing. Table 1-3 lists the performance and price comparison of some well-known NVMs. These NVMs with fast accessing speed can be used as the performance tier [17] [26] or cache [27] [28] [29] [30] in a modern hybrid system. Nowadays, SSD is the first choice of performance tier, and the high capacity shingled magnetic recording (SMR ) drive is used often as the back-up tier [31].

General algorithms for hybrid storage system
Data allocation: Data allocation is conducted by the host or device controller to allocate the incoming data to the most suitable storage location, such as hot data to SSD or cold data to HDD. Besides the properties of the data, the status of the devices is also considered during the allocation process, such as the queue length, capacity usage, bandwidth, etc.
Address mapping: Address mapping is required in a hybrid storage system because the capacities of the faster devices and slower devices are different. Due to the different address ranges, the accessing location of the incoming data needs to be translated to the actual address when the data is allocated to a different device. An address translation table is required to keep all these translation entries. If the address range is big, the memory consumption of the translation table is huge and the translation speed is reduced, which may affect the system performance.
Data migration (promotion/demotion): The data promotion is to migrate the data from the slower devices to the faster devices, and the data demotion is to migrate the data from the faster devices to the slower devices. This is called data migration. The data migration is usually conducted when the data in slower devices is identified as hot data or the data in faster devices is identified as cold data. In some research, the data migration is also done to balance the IOPS between the faster devices and slower devices.
Hot data identification: Hot data identification is important for the data migration to select the suitable data to promote and demote. It uses the properties of historical data to classify the incoming data as hot or cold. The classification is done by checking the accessing frequency and time of the data. Most frequently accessed and most recently accessed data are identified as hot data.

The overall categories of the hybrid storage architectures
Microservers and Ethernet Drives
A microserver is a server-class computer which is usually based on a system on a chip (SoC) architecture. The goal is to integrate most of the server motherboard functions onto a single microchip, except DRAM, boot FLASH, and power circuits. Ethernet Drive is one of its various forms.
In October 2013, Seagate Technology introduced its Kinetic Open Storage platform with claims that the technology would enable applications to talk directly to the storage device and eliminate the traditional storage server tier. The company shipped its first near-line Kinetic HDDs in late 2014. The Kinetic drive is described as a key-value server with dual Ethernet ports that support the basic put, get, and delete semantics of object storage, rather than read-write constructs of block storage. Clients access the drive through the Kinetic API that provides key-value access, third-party object access, and cluster, drive, and security management.
Introduced in May 2015, Toshiba’s KVDrive uses the key-value API that Seagate open sourced rather than reinventing the wheel. Ceph or Gluster could run directly on Toshiba’s KVDrive .
WDC/HGST’s converged microserver based on its Open Ethernet architecture supports any Linux implementation. Theoretically, any network operating system can run directly in such a microserver. Ceph and OpenStack Object Storage system have been demonstrated together with Red Had server. For example, in early 2016, WDC demonstrated a large scale Ceph distributed storage system with 504 drives and 4PB storage size.21
Software-Defined Storage
TechTarget :22 SDS is an approach to data storage in which the programming that controls storage-related tasks is decoupled from the physical storage hardware (which places the emphasis on storage-related services rather than storage hardware).
Webopedia :23 SDS is storage infrastructure that is managed and automated by intelligent software as opposed to the storage hardware itself. In this way, the pooled storage infrastructure resources in a SDS environment (which can provide functionality such as deduplication, replication, thin provisioning, snapshots, and other backup and restore capabilities across a wide range of server hardware components) can be automatically and efficiently used to match the application needs of an enterprise.
Wikipedia :24 SDS is computer data storage software to manage policy-based provisioning and management of data storage independent of hardware. Software-defined storage definitions typically include a form of storage virtualization to separate the storage hardware from the software that manages the storage infrastructure. The software enabling a software-defined storage environment may also provide policy management for feature options such as deduplication, replication, thin provisioning, snapshots, and backup.
Vmware :25 SDS is the dynamic composition of storage services (such as snaps, clones, remote replication, deduplication, caching, tiering, encryption, archiving, compliance, searching, intelligent logics) aligned on application boundaries and driven by policy.
Common Features of SDS
Level | Steps | Consequence |
---|---|---|
Data plane, Control plane | Abstract (decouple/standardization, pooling/virtualization), automation (policy-driven) | Faster, more efficient simpler |

The overall features of SDS
SDS also leads to some other concepts, such as a software-defined data center (SDDC). Based on the report by IDC and IBM,26 a SDDC is a loosely coupled set of software components that seek to virtualized and federate datacenter-wide hardware resources such as compute, storage, and network resources. The objective for a SDDC is to make the data center available in the form of an integrated service. Note that an implementation of SDS and SDDC may not be able to leave the support of another software defined concept, such as software-defined networking (SDN) , which provides a fundamental change to the network infrastructure.
Implementation
I focus on some most recent software implementations for large scale systems with distributed storage components in this section.
Hadoop
Apache Hadoop,27 an open-source implementation of MapReduce originating at Google, provides a software framework used for distributed storage and processing of big data sets. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed under a fundamental assumption that hardware failures commonly occur and should be automatically handled by the framework.
Hadoop Common has the fundamental libraries and utilities required by other Hadoop modules.
Hadoop Distributed File System (HDFS) is a distributed file-system written in Java that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN is a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
Hadoop MapReduce processes large scale data, as an implementation of the MapReduce programming model.
HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by a replication mechanism, such as replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (some RAID configurations are still useful, like RAID 0). Data is stored on three nodes with the default replication value, 3. Data nodes can communicate with each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not-full compliance is increased performance for data throughput and support for non-POSIX operations such as Append. Although HDFS is the default distributed file system, it can be replaced by other file systems, such as FTP file systems, Ceph, Amazon S3, Windows Azure storage blobs (WASB), and others.

Hadoop ecosystem
OpenStack

Openstack architecture [32]
OpenStack contains the block storage component called Cinder and an Object storage component called Swift. The Cinder system manages the creation, attaching, and detaching of the block devices to servers. Block storage volumes are fully integrated into Nova and the Horizon Dashboard, allowing cloud users to manage their own storage needs. Block storage is appropriate for performance-sensitive scenarios in both local server storage and storage platforms (e.g., Ceph, GlusterFS, GPFS, etc.). Swift is a scalable redundant storage system .
Ceph
Ceph,31 an open-sourced and free distributed storage platform, provides a unified interfaces for object-, block-, and file-level storage [33, 34]. Ceph was initially created by Sage Weil for his doctoral dissertation. In 2012, Inktank Storage was founded by Weil for professional services and to support for Ceph.

Ceph architecture
System Performance Evaluation
Common Metrics for Storage Devices
Metrics | Unit |
---|---|
Capacity | GB |
Areal density (TPI, SPT) | GB/inch |
Volumetric density | TB/liter |
Write/read endurance | Times/years |
Data retention time | Years |
Speed (latency of IO access time; rand.) | Million seconds |
Speed (bandwidth of IO access; seq.) | MB/second |
Power consumption | Watts |
Reliability (MTBF) | Hours |
Power on/off transit time | Seconds |
Shock and vibration | G-force |
Temperature resistance | °C |
Radiation resistance | Rad |
The system’s design
The system’s implementation
The system’s workload
These three factors influence and interact with each other. It is common for a system to perform well for one workload, but not for another. For a given storage system, its hardware design is usually fixed. However, it may provide some tuning parameters. If the parameters are also fixed in one scenario, the performance is usually “predictable” for a particular application. By running enough experiments, it is possible to obtain some patterns for the parameters related to the application’s general workload properties. Then you may further tune the parameters. Sometimes, due to design limitations, the range of tuning parameters may be too narrow. Then you must redesign the system.
Throughput , also named bandwidth, is related to the data transfer rate and is the amount of data transferred to or from the storage devices within a time unit. Throughput is often measured in KB/sec, MB/sec, or GB/sec. For disk drives, it usually refers to the sequential access performance.
IOPS means the IO operation transfer rate of the device or the number of transactions that can occur within a time unit. For disk drives, it usually refers to the random access performance.
Response time, also named latency, is the time cost between a host command sent to the storage device and returned to the host, so it’s the time cost of an IO request for the round trip. It is measured in milliseconds (ms) or microseconds (μs) and is often cited as an average (AVE) or maximum (MAX) response time. In a HDD specification, the average seek time and switch time are usually provided.
Block size, which is the data transfer length
Read/write ratio, which is the mix of read and write operations
Random/sequential ratio, which is the random or sequential nature of the data address requests
In addition, when considering consumer/client or enterprise devices/systems, the focus may be different. For example, in some client use cases, IOPS and bandwidth may be more critical than response time for HDD/SSD devices, as long as the response times are not excessively slow, since the typical client users would not usually notice a single IO taking a long time (unless the OS or software application is waiting for a single specific response). While client SSD use cases may mostly be interested in average response times, the enterprise use cases are often more concerned with maximum response times and the frequency and distribution of those times [12].
Performance vs. Workload
Workload can be categorized in several ways. From the domain point of view, the workload can be imposed to the CPU, memory, bus, network, etc. The level of details required in workload characterization relies on the goal of the evaluation. It can be in the computer component level or in the system application level. In the sense of applications, workload may be extracted from database, email, web service, desktop, etc.
An important difference among workload types is their rate [35], which makes the workload either static or dynamic. A static workload is one with a certain amount of work; when it is done, the job is completed. Usually, the job is several combinations of small sets of given applications. On the other hand, in a dynamic workload, work continues to arrive all the time; it is never done. It requires an identification of all possible jobs.
From practical point of view, the workload is divided into three categories: file-level, object-level, and block-level. In this book, I focus on block-level because most underlying storage devices are actually block devices, and the techniques applied to block-level analysis can be also used for file-level and object-level analysis .
Trace Collection and Analysis
Workload trace can be collected using both software and hardware tools, actively or passively. The inherited logging mechanism of some systems, which usually runs as background activates, is one of passive trace sources. Actively, you may require specific hardware (e.g., a data collection card, bus analyzer, etc.) and software (e.g., dtrace, iperf, blktrace, etc.) to collect traces purposely. These traces may be at different precision and detail levels. Sometimes you may also require the aid of benchmark tools when the environments of real applications are not available or inconvenient to obtain. Chapter 5 will discuss this in detail.
System Optimization
One of the main purposes of trace analysis is to identify the system performance bottleneck in various levels (e.g., component vs. system, user vs. kernel vs. hardware, etc.), and then optimize the overall system [36].

IO stack
In this book, I will provide some practical examples, ranging from single devices to complex systems, to show how the workload analysis can be applied to system optimization and design.