Introduction

The chapter provides the background of data storage systems and general trace analysis. I will show that wide applications of block storage devices motivate the intensive study of various block-level workload properties. I will also list the objectives and contributions of this book in this chapter.

Basics of Storage

In this information-rich world, data storage devices and systems are fundamental for information preservation. There are so many different types of storage devices available in the market, such as magnetic tape, optical disc drive, hard disk drive (HDD), solid state drive (SSD), flash memory, etc. Historically, there were even more types, like twister memory and drum memory. To narrow the focus, I will cover the modern computer storage/memory devices only. They can be generally divided into two categories [1].

Volatile
- The commonly used, such as DRAM (dynamic RAM), SRAM (static RAM), etc.
- Those under development: T-RAM (thyristor RAM), Z-RAM (zero-capacitor), etc.
Non-volatile
- ROM (read-only memory), such as EPROM (Erasable Programmable ROM), EEPROM (Electrically E-PROM), MROM (Mask ROM), etc.
- NVRAM, such as flash memory, PCM (phase change memory), ReRAM/RRAM (resistive RAM), MRAM (magnetoresistive RAM), FeRAM (ferroelectric RAM), etc.
- Mechanical devices like HDD, magnetic tape, optical disc drives

When selecting a storage device or system, many factors must be considered carefully, such as price, performance, capacity, power efficiency, reliability, data integrity, durability, form factor, operating temperature, connection types, and so on, depending on the application scenarios. However, the performance of the devices is the major topic of this book.

Storage Devices

In this section, I discuss several types of non-volatile storage devices, as some volatile devices like RAM, will be used inside those non-volatile devices as cache.

HDD

HDD was first introduced by IBM in 1956. Soon it became the dominant secondary storage device for general purpose computers. Even now, it is still the mainstream storage device, in particular for data centers. Despite the fact that disk drives are commodity products today, a disk drive is an extremely complex electromechanical system encompassing decades of finely honed research and development on an immense multitude of diverse disciplines, although the main components of the modern HDD have remained basically the same for the past 30 years. Figure 1-1¹ shows the components as assembled, and Figure 1-2 illustrates the basic electronics blocks.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig1_HTML.jpg — Figure 1-1
Basic components of a HDD

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig2_HTML.jpg — Figure 1-2
Basic HDD electronics blocks

In particular, the servo is one of the most precise mechanical systems in the world. The disk head that reads and writes data to the medium is only few nanometers above the disc media; this is similar to a Boing 737 plane flying a few meters above the ground. The HSA (head stack assembly) is moved by applying a current to the wires wound around a loop at its back end. This coil forms an electromagnet. The amount of current used is calculated by servo electronics. By varying the current, very precise acceleration and deceleration can be programmed, increasing IO performance and servo head positioning accuracy.

HDD can be divided into two categories: consumer and enterprise. Consumer HDDs are mostly used in desktop and mobile devices like a notebook. Consumer electronics HDDs are often embedded into digital video recorders, smart TVs, and automotive vehicles. Enterprise HDDs usually have higher reliability than consumer HDDs, with higher quality requirements for the media and head.

Disk drives have different spinning speeds (rotation per minute, RPM). For example, desktop HDDs are usually in 3.5-inch form with 7200 RPM, while mobile HDDs are in 2.5-inch form with 5400 RPM. Each disc surface is divided into different concentric zones. Inner zones (ID) have less physical space and contain less sectors than outer zones (OD). As the spinning speed is the same, the data transfer speed of OD is generally faster than that in ID. For a typical 3.5-inch desktop HDD, the sequential read speed in OD could be 1.5 to 2 times than that in ID.

For a typical HDD, the following formula calculates average access time (T_a) [2, 3]:

T_a = T_s + T_l + T_t + T_o(1.1)

where

Seek time as T_s: Time required to move the heads a desired distance. Typically specified at 1/3 the radius of the platter. The settle time is generally included in this part.
Rotational latency as T_l: Amount of time the drive must wait before data is under the read/write head.
Transfer time as T_t: Amount of time required to transfer data to or from the host.
Controller overhead as T_o: How long it takes the drive to decode a command from the host.

Note that the read head is usually different from the write head, and the internal firmware process for reads and writes is also different. So there will be a slight variance for read and write seek times. Usually, write access costs more because of the longer setting time of a write, which is caused by PES (position error signal) requirement, which means write access requires a stronger condition on PES than read access. By design, faster RPM drives have faster average access times than slower RPM drives due to shorter latency and seek times.

The response time (T_res) is a different concept from the access time. In fact, since the conventional disk drive can only process one request at one time, some incoming requests have to wait in a queue. For example, some write requests may be buffered in DRAM write cache first and must wait for the previous request to be completed. Note that although there are many arms and heads per drive, the arms must move together since there is only one VCM to drive them in general. Thus,

T_res = T_a + T_w(1.2)

where T_w is the waiting or queueing time just after the request enters the queue and before it is actually executed.

Owning to the influence of the command queue, cache has large impact on the performance of both read and write. Thus, a large portion of DRAM inside HDD is used for cache. Read cache and write cache commonly share the same space, so that part of write cache segments may be converted into read cache segments when necessary. However, some HDDs may have dedicated read or write cache for different purposes. In Chapter 4, I will show more details.

Conventional magnetic recording (CMR) is a relative concept. The longitudinal magnetic recording (LMR) HDD was a conventional concept to perpendicular magnetic recording (PMR) HDD in early 2000s. Nowadays, PMR is the dominant structure and is still in evolution. For example, SMR (shingled magnetic recording) is a new type of PMR already available in the market, while HAMR (heat-assistant magnetic recording) and MAMR (microwave-assistant magnetic recording) are emerging.

SMR HDD

SMR is the emergent technique being deployed to increase areal density in HDDs without drastic changes to the HDD mechanics [4, 5, 6, 7, 8]. Due to its shingled nature, SMR tends to favor large sequential writes over random ones. In this background section, I will give a general introduction to SMR characteristics.

The most significant feature of a SMR drive is its sequential write properties due to the shingled tracks. As shown in Figure 1-3, all physical sectors are written sequentially in a particular direction radially and are only rewritten after a wrap-around. Rewriting a previously written LBA will cause the previous write to be marked invalid and the LBA will be written to the next sequential physical sector.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig3_HTML.jpg — Figure 1-3
Schematic of SMR

Due to this log-structure-like sequential write feature (which is beneficial for write performance), the conventional LBA-to-PBA mapping (direct/static mapping) may not work well since any change in a block results in a read-modify-write access to all its consecutive blocks in the same zone, which can cause performance penalties. Therefore, indirect/dynamic mapping is usually applied. When an update happens, instead of an in-place rewrite, an out-of-place “new” write will be carried out, which leads to write implications; in other words, the data in the previous place becomes garbage and the new write claims additional space. In order to reuse those garbage blocks, a garbage collection (GC) procedure must be implemented.

Another concern of the out-of-place update is the potential harm to the sequential read performance. If some LBA-continuous requests are written into several different physical zones, the later LBA-continuous read request in the same LBA range cannot gain the actual benefit of the “logically sequential” read. The corresponding data management scheme can be implemented in three levels: drive, middleware, or host side. Although a few academic works have introduced in-place updates via a special data layout design, the out-of-place policy remains the main approach.

In general, a SMR drive expects the workload to be read/write sequentially, with infrequent updates to the data. In addition, since garbage data will generally occur at some points (unless data is never deleted or modified), idle time should be sufficiently long as to allow GC to run periodically without impacting external/user IO performance. Hence, the write-once-read-many (WORM) workload (archival) is a natural extension to the characteristics of SMR drives. Few other recent suggestions on SMR optimizations are available in [9], e.g., hybrid strategy, parallel access and large form factor.

Other HDDs

The PMR technique reached its theoretical limitation of areal density for conventional design (1TB/in²) in recent years. The limiting factor is the onset of the super-paramagnetic limit as researchers strive towards smaller grained recording media. This levies a tradeoff between the signal-to-noise ratio (SNR) and the thermal stability of small grain media and the writability of a narrow track head, which restricts the ability to continue to scale CMR technology to higher areal densities [10].

Several promising technology alternatives have been explored to increase the areal density beyond the limit, such as two-dimensional magnetic recording (TDMR) , heat-assisted magnetic recording, microwave-assisted magnetic recording [10], and bit-patterned magnetic recording (BPMR). Table 1-1 provides a brief category and Figure 1-5 shows the trends of these techniques.

Table 1-1

New Techniques to Increase Areal Density

Approaches	Reduce grain size and make grains harder to switch	Reduce bit width and/or length	Increase areal density/size, add heads and disks
Solutions	HAMR, MARM	SMR, HAMR, T-DMR	Helium drive, advanced mechanical designs, form factor optimization

In BPMR technology, each recording bit is stored in a fabricated magnetic island of around 10 nm. It has been proposed as a means for extending the super-paramagnetic limit of current granular media as illustrated in Figure 1-4 (a). The recording physics in BPMR are fundamentally different from conventional PMR, as the write and read scheme must be totally reestablished. A major shortcoming is the write synchronization requirement in which the write field must be timed to coincide with the locations of patterned islands. The total switching field distribution in the writing process, including various interference fields, must be less than the product of the bit length and the head field gradient to attain a high areal density up to 5 TB/in² theoretically.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig4_HTML.jpg — Figure 1-4
Future options of HDD[10]

HAMR and MAMR are two implementations of energy-assisted magnetic recording (EAMR). HAMR is proposed to overcome head writability issues. The core components of proposed HARM and MAMR technologies are a laser and a spin torque-driven microwave oscillator, respectively. In HAMR, the media have to be stable at much smaller grain sizes yet be writable at suitably elevated temperatures. The integration of HAMR and BPMR enables an extension of both technologies, with projected areal density up to about 100 Tb/in² based on the thermal stability of known magnetic materials [10]. Both WDC and Seagate announced their roadmap for EAMR. Seagate claimed that its HAMR-based HDDs will be due in late 2018,² while WDC declared that its MAMR will store 40TB on a hard drive by 2025.³

TDMR still uses a relatively conventional perpendicular medium and head, while combining shingled write recording (SWR) and/or 2D read back and signal processing to promise particularly large gains. Recording with energy assist on BPM or 2D signal processing will enable the areal density beyond around 5 Tb/in². However, there is no clear problem-free solution so far.

Note that for HAMR/MAMR HDDs, the sequential access properties may be similar to SMR/TDMR. As the heater start-up/cool-down requires time, sequential access to reduce the status change is preferred.

One of other recent techniques to increase single device’s capacity is the volumetric density scaling, such as adding more disc platters into one Helium-filled drive (less turbulence, thinner disks, and higher capacity) and designing a large form factor enclosure [9]. To further increase the capacity of single drives, some interesting designs have been suggested. For example, the dual-spindle design can access two disk clusters with two arms [11]. In this direction, more arms and disks are also possible, such as six arms and six disk spindles. Magnetic disk libraries and cartridge designs make the disk media exchangeable, similar to optical disks.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig5_HTML.jpg — Figure 1-5
Predicated density of future techniques[10]

SSD

A solid-state drive/disk (SSD) is a solid-state storage device to store data persistently that utilizes integrated circuit assemblies as memory [12]. Electronic interfaces compatible with traditional HDDs, such as SAS and SATA, are primarily used in SSD technology. Recently, PCIe, SATA express, and M.2 have become more popular due to increased bandwidth requirements.

The internal memory chip of a SSD can be NOR-flash, NAND-flash, or some other emerging NVM (non-volatile memory). Most SSDs have started to use 3D TLC (tri-layer ceil) NAND-based flash memory as of 2017.

SSD is changing the storage industry. While the maximum areal storage density for HDDs is only 1.5 Tbit/in, the maximum for flash memory used in SSDs is 2.8 Tbit/in in laboratory demonstrations as of 2016. And SSD’s overall areal density increasing ratio of flash memory is over 40% per year, which is larger than 10-20% of HDD. And the price decreasing ratio of SSD ($ per GB) is dropping faster than that of HDD.

Table 1-2 gives a comparison of SSD and HDD, where the SSD mainly refers to NAND-based devices, while HDD is conventional PMR devices.⁴

Table 1-2

Comparison of HDD and SSD

Attribute	SSD	HDD
Start-up time	Almost no delay because there is no requirement to prepare mechanical components (some μs to ms). Usually a few ms to switch from an automatic power-saving mode.	Up to several seconds for disk spin-up. Up to few hundred million seconds to wake up from idle mode.
Random access time	Typically less than 0.1 ms. Not a big performance bottleneck usually.	Typically from 2.5 (high-end server/enterprise drives) to 12 ms (laptop/mobile drives) mainly owing to seek time and rotational latency.
Read latency time	Usually low due to the direct data access from any location. For applications constrained by the HDD’s positioning time, SSD has no issue in faster boot and application launch times (see Amdahl’s Law). A clean Windows OS may spend less than 6 seconds to boot up.	Generally much larger than SSDs. The time is different for each seek due to different data locations on the media and the current read head position.⁵ A clean Windows OS may spend more than 1 minute to boot up.
Data transfer rate	Relatively consistent IO speed for relatively sequential IO. Performance is reduced when the portion of random smaller blocks is large. Typical value ranges from around 300MB/s to 2500MB/s for consumer products (commonly speed around 500MB/s at 2017). Up to multi-gigabyte per second for enterprise class.	Heavily depends on RPM, which usually ranges from 3,600 to 15,000 (although 20,000 RPM also exists). Typical transfer speed at about 200 MBps for 3.5-inch drive at 7200 RPM. Some high-end drives can be faster, up to 300 MBps. TPI and SPT are also influential factors.
Read performance	Generally independent on the data location in SSD. In few cases, sequential access may be affected by fragmentation.	Random seek is expensive. Fragmented files lead to the location of the data in different areas of the platter; therefore, response times are increased by multiple seeks of fragments.
Write performance	Write amplification may occur.⁶ Wear leveling techniques are implemented to get this effect. However, the drive may unavoidably degrade at an observable rate due to SSD’s nature.	CMR has no issue with write amplification. However, SMR may have an issue due to the out-of-place-update. GC is also required.
Impacts of file system fragmentation	Relatively limited benefit to reading data sequentially, making fragmentation not significant for SSDs. Defragmentation would cause wear with additional writes.⁷	Many file systems get fragmented over time if frequently updated. Optimum performance maintenance requires periodic defragmentation, although this may not be a problem for modern file systems due to node design and background garbage collection.
Noise (acoustic) and vibration	SSDs are basically silent without moving parts. Sometimes, the high voltage generator (for erasing blocks) may produce pitch noise. Generally insensitive to vibration.	The moving parts (e.g., heads, actuator, and spindle motor) make characteristic sounds of whirring and clicking. Noise levels differ widely among models, and may be large.⁸ Mobile disks are relatively quiet due to better mechanical design. Generally sensitive to vibration.⁹
Data tiering	Hot data may move from slow devices to fast devices. It usually works with HDDs, although in some implementations, fast and slow SSDs are mixed.	HDDs are usually used as a slow tier in a hybrid system. Some striped disk arrays may provide comparable sequential access performance to SSDs.
Weight and size	Essentially small and lightweight due to the internal structure. They usually have the same form factors (e.g., 2.5-inch) as HDDs, but thinner, with plastic enclosures. The M.2 (Next Generation Form Factor) format makes them even smaller.	HDDs are usually heavier than SSDs, since their enclosures are made of metal in general. 2.5-inch drives typically weigh around 200 grams while 3.5-inch drives weigh over 600 grams (depending on the enclosure materials, motors, disc magnets/number, etc.). Some slim designs for mobile could be less than 6mm thin.
Reliability and lifetime	NO mechanical failure. However, the limited number of write cycles for each block may lead to data loss.¹⁰ A controller failure can lead to an unusable SSD. Reliability differs quite considerably depending on different manufacturers, procedures, and models.¹¹	Potential mechanical failures from the resulting wear and tear. The storage medium itself (magnetic platter) does not essentially degrade from R/W accesses.¹²
Cost per capacity¹³	Consumer-class SSD’s NAND chip pricing has dropped rapidly: US$0.60 per GB in April, 2013, US$0.45, $0.37 and $0.24 per GB in April 2014, February 2015, and September 2016, respectively. The speed has slowed down since late 2016. Prices may change after 3D NAND becomes common.¹⁴	Consumer HDDs cost about US$0.032 and $0.042 per GB for 3.5-inch and 2.5-inch drives in May 2017. The price for enterprise HDDs is generally more than 1.5 times over that for consumers. Relatively stable prices in 2017 may be broken after MAMR/HAMR release.
Storage capacity	Sizes up to 60TB by Seagate were available as of 2016. 120 to 512GB models were more common and less expensive.	HDDs of up to 10TB and 12TB were available in 2015 and 2016, respectively.
Read/write performance symmetry	Write speeds of less costly SSDs are typically significantly lower than their read speeds. (usually ≤1/3). Similar read and write speeds are expected in high-end SSDs.	Most HDDs have slightly longer/worse seek time for writing than for reading due to the longer settle time.
Free block availability and TRIM command	Write performance is significantly influenced by the availability of free, programmable blocks. The TRIM command can reclaim the old data blocks no longer in use; however, fewer free blocks cause performance downgrade even with TRIM.	CMR HDDs do not gain benefits from TRIM because they are not affected by free blocks. However, SMR performance is also restricted by the available of free zones. TRIM is required for dirty zones sometimes.
Power consumption	High performance flash-based SSDs generally require 1/2 to 1/3 of the power of HDDs. Emerging technologies like PCM/RRAM are more energy-efficient.¹⁵	2.5-inch drives consume 2 to 5 watts typically, while some highest-performance 3.5-inch drives may use around 12 watts on average, and up to about 20 watts. Some special designs for green data centers send the disk to idle/sleep when necessary. 1.8- inch format lower-power HDDs may use as little as 0.35 watts in idle mode.¹⁶

In summary, here is a SWOT analysis for NAND SSD:

Strength
- A mature technology widely employed by industries
- Large scale/density, applicable for 3D techniques
- A single drain contact per device group is required compared with NOR.
- Relatively cheaper than other emerging NVM types for dollar/GB
Weakness
- Asymmetric performance (slower write than read)
- Program/erase cycle (block-based, no write-in-place)
- Data retention (retention gets worse as flash scales down)
- Endurance (limited write cycle compared with HDD and other emerging NVMs) 100-1000 slower than DRAM
- 10-1000 slower than PCM and FeRAM
- Usually, the higher the capacity, the lower the performance.
Opportunity
- Scaling focused solely on density; density is higher than magnetic HDD in general.
- Decreased cost, which will be comparable with HDD in the near future
- 3D schemes exist despite of complexity
- Durability is improved to a certain degree together with fine-tuned wearing leverage algorithms.
- Replacement for HDD in data centers as a mainstream choice (in particular, an all-flash array), although hybrid infrastructures will remain for some years.
Threat
- The extra connections used in the NOR architecture provide some additional flexibility when compared to NAND configuration.
- The active development of MRAM/ReRAM may shake NAND flash’s dominate position.

The real question is the market share of the two technologies. It is important how you measure the market share. By money gets you a very different answer than by bit. In the money arena, SSDs will rapidly overtake HDDs spend in the very near future, while by bit, HDDs will still dominate for some years.

There are some other storage devices using flash memory. Flash thumb drives are similar to SSD but with much lower speed and they are commonly used for mobile applications. Kingston Digital released 1TB capacity drives with an USB 3.0 interface (data transfer speeds up to 240 MB/s read and 160 MB/s write) in early 2017 and 2TB drives (up to 300 MB/s read and 200 MB/s write) in late 2017, which is similar to HDD’s speed.

Small form size memory cards are also widely used in electronic devices, such as smartphones, tablets, cameras, and so on. Some common formats include CompactFlash, Memory Stick, SD/MicroSD/MiniSD, and xD. SanDisk introduced up to 1TB size of Extreme Pro series SD products in September 2016 and MicroSD up to 400GB in August 2017.

Hybrid Disk

A hybrid drive is a logical or physical storage device that integrate a fast storage medium such as a NAND/NOR flash SSD into a slow medium such as a HDD [15]. The fast device in a hybrid drive can act either as a cache for the data stored on the HDD or as a tier peering to HDD. In generally, the purpose is to improve the overall performance by keeping copies of the most frequently used data (hot data) on the faster component. Back in the mid-2000s, some hard drive manufacturers like Samsung and Seagate theorized the performance boost via SSD inside HDD. In 2007, Samsung and Seagate introduced the first hybrid drives using the Seagate Momentus PSD and Samsung SpinPoint MH80 products.

There are generally two types of hybrid disks. One is of a dual-drive structure (the tiering structure) where the SSD is the fast tier and HDD is the slow tier. Usually, the OS will recognize the devices with two sub-storage devices. Western Digital’s Black2 products introduced in 2013 and TarDisk’s TarDisk Pear in late 2015 are two examples of dual-drive devices. The other is an integrated structure (solid-state hybrid drive, SSHD) where the SSD acts as cache [16]. Users or OSs may see one storage device only without specific operations.

The hybrid disk drive can operate in either self-optimized (self-learning) mode or host-optimized mode. In the self-optimized mode of operation, the SSHD works independently from the host OS, so device drives determine all actions related to data identification and migration between the HDD and SSD. This mode lets the drive appear and operate to a host system exactly like a traditional drive. A typical drive is Seagate’s Mobile and Laptop SSHD. Host- optimized mode is also called host-hinted mode, so the host makes the decision for the data allocations in HDD and SSD via SATA interface (since SATA version 3.2). This mode usually requires software/driver support from the OSs. Microsoft started to support the host-hinted operations in Windows 8.1 (a patch for version 8 is available), while patches for the Linux kernel have been developed since October 2014. Western Digital’s first generation of SSHDs is in this category.

The market of hybrid disk drives may be narrow due to some inherited limitations:

The performance is heavily application/workload dependent usually. But the drive may not be smart enough to be constrained by its resource.
Block level optimization is no better or worse than file/object level optimization due to less information on the workload. Thus it is not recommended to optimize the workload in the drive level.
It is not well suited for a data center infrastructure’s general purpose due to relatively static configurations of hybrid disks.

For the write path, some hold-up capacitors are used to simulate SCM (see the “Storage Devices” section of this chapter) with DRAM in some high-end SSDs. This essentially solves the write back problem. For the read path, customers generally prefer to manage different speed tiers of storage by themselves. They are very concerned with the access latency variance, and hybrid systems are very poor in this area. There is virtually no uptake of infrastructure-managed hybrid storage in Hyperscale or public cloud infrastructure. There are lots of deployments of hybrid structures. It is just managed at higher layers, not in the infrastructure itself. Table 1-3 provides more details.

Table 1-3

Comparison of Some NVMs

	STT-MRAM	PCMS 3D Xpoint	ReRAM	Flash NAND
Read latency	< 10ns	< 100ns	< 10ns	10–100us
Write latency	5ns	> 150ns	50ns	> 100us
Power consumption	Medium	Medium	Medium	High
Price (2016)	200−3000/ Gb	≤ 0.5/Gb	100/Gb	≤ 0.05/Gb
Endurance(Nb cycles)	10¹² to unlimited	10⁸−10⁹	10⁵−10¹⁰	10⁵−10⁶

Tape and Disc

Magnetic tape was first used to record computer data in 1951. It usually works with some specific tape drives only. Despite its slow speed, it is still widely used for cold data archiving. IBM and FujiFilm demonstrated a prototype BaFe Tape with 123 Gb/in² areal density and 220TB cartridge capacity in 2015. Sony and IBM further increased this number to 201 Gb/in² and 330TB into a tiny tape cartridge in 2017.¹⁷ Instead of magnetic materials painted on the surface of conventional tape, Sony used a “sputtering” method to coat the tape with a multilayer magnetic metal film, which is thinner with narrower grains using vertical bits. Note that tape and HDD share many similarities in the servo control, such as servo pattern and nanometer precision.

An optical disc is a flat, usually circular disc that encodes binary data (bits) in the form of pits. An early optical disc system can be traced back to 1935. Since then, there have been four generations (a CD of about 700MB in the first generation, a DVD of about 4.7GB in the second generation, a standard Blu-ray disc of about 25GB in the third generation, and a fourth generation disc with more than 1TB data).

Both magnetic tapes and optical discs are usually accessed sequentially only. Some recent developments use robot arms make the change of tape/disc automatically. It is expected that tape and optical disc may still be active in the market for some years. In particular, due to much lower price per GB than other media, the tape seems to have a large potential market for extremely cold storage.

Emerging NVMs

There are also many types of emerging NVMs on the way to mature or under an early stage of development [17, 10]:

Phase-change memory (PCM), such as 3D X-point
Magnetoresistive RAM (MRAM), such as STTRAM and Racetrack memory
Resistive RAM (RRAM/ReRAM), such as Memristor, Conductive-bridging RAM (CBRAM), Oxy-ReRAM
Ferroelectric RAM (FeRAM), such as FeFET
Others, such as conductive metal oxide (CMOx), solid electrolyte, NRAM (nano RAM), ZRAM (zero-capacitor), quantum dot RAM, carbon nanotubes, polymer printed memory, etc.

STT-MRAM [18] (spin-transfer torque MRAM), using electron spin-induced change in magnetic moment, can replace low-density SRAM and DRAM, particularly for mobile and storage devices. Phase-change memory (PCM), making thermally induced physical phase changes between amorphous and crystalline states, has the ability to achieve a number of distinct intermediary states, thereby having the ability to hold multiple bits in a single cell. PCMS 3D Xpoint, announced by Intel and Micron in 2015, is based on changes in the resistance of the bulk material faster and is more stable than traditional PCM materials. ReRAM/CBRAM (conductive-bridging RAM) uses a metallic filament formation in electrolyte to storage, and FeRAM uses a ferroelectric layer instead of a dielectric layer to achieve nonvolatility [1]. Table 1-3 shows a simple comparison of them with NAND. A few of them could be in mass production within the next few years [19], although it might be still early to confirm which NVM technique is a winner in the competition, as they have their advantages and disadvantages. For example, let’s use PCM as an example for its SWOT analysis.

Strength
- Relatively mature (large-scale demos and products) compared with other emerging NVMs
- Industry consensus on materials, like GeSbTe or GST
- Large resistance contrast, which leads to analog states for MLC
- Much longer endurance than NAND Flash
- High scalability (still works at ultra-small F) and back-end-of-the-line compatibility
- Potential very high speed (depending on material and doping)
Weakness
- RESET step to high resistance requires melting − > power-hungry and thermal crosstalk?
- To keep switching power down − > sub-lithographic feature and high-current access device
- To fill a small feature − > atomic layer deposition or chemical vapor deposition techniques − > difficult now to replace GST with a better material
- MLC strongly impacted by relaxation of amorphous phase − > resistance drift
- 10-year retention at elevated temperatures (resistance drafts with time) can be an issue − > recrystallization
- Device characteristics change over time due to elemental segregation − > device failure
- Variability in small features broadens resistance distributions
Opportunity
- An order of magnitude lead over FeRAM, MRAM, etc.
- NOR-replacement products now shipping − > if yield-learning successful and MLC (3-4 bits per cell successfully implemented in PCM technologies despite R-drift phenomenon in 2016)
- Good for embedded NVM for SoC, Neuromorphic
- Drift-mitigation and/or 3D access devices can offer high-density (i.e., low-cost), which means the opportunity for NAND replacement. Finally S-type, and then M-type SCM may follow.
- Projected to reach 1.5B USD with an impressive CAGR of almost 84% by 2021
Threat
- Attained speed in practice is much slower than the theoretical speed; slow NOR-like interfaces
- The current PCM SSD is only several times faster than SLC SSD, which is far away from the projection.
- DRAM/SRAM replacement may be challenging due to fundamental endurance limitation.
- PCM as a DRAM segment accounted for the major shares and dominated the market during 2016, which means a long way for S-SCM.
- A key challenge is to reduce reset (write) current; contact dimension scaling will help, but will slow progress.
- Engineering process

NAND techniques are also under active development, in particular, the 3D NAND. Compared with these emerging NVMs, NAND is relatively mature, dense, and cheap. However, it could be much slower than PCM and ReRAM. Meanwhile, its durance may be significantly lower than PCM, MRAM, and FeRAM in general.

Based on these NVMs, a special category called SCM (storage class memory) is introduced to fill the IO gap between SSD and DRAM (although it was initially for the gap between HDD and DRAM from IBM). It is further divided into storage-type SCM and memory-type SCM, depending on whether their speed is in magnitudes of microseconds or nanoseconds. Improved flash with 3D techniques, PCM, MRAM, RRAM, and FeRAM are some major techniques applied to SCM. This wide deployment of SCM to the computer/network systems and IoT systems will reshape the current architectures. In the very near future, we can see the impact of SCM to in-memory computing (e.g., application in cognitive computing), hyper-converge infrastructure, hybrid storage/cloud infrastructure (with remote direct memory access), etc. A brief outlook of these NVMs is illustrated in Figure 1-6, which is modified from the IBM’s prediction.¹⁸ In fact, the commercial version of Optane P4800X using 3D PCM-like techniques by Intel released in Nov 2017 has 750GB in capacity, 550K in IOPS, and 2.4/2.0 GB/ps in R/W throughput, while Z-NAND, a variant of 3D NAND by Samsung released in Jan 2018, has 800GB in capacity, 750K/150K in R/W IOPS, and 3.2 GB/ps in throughput.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig6_HTML.jpg — Figure 1-6
Competitive outlook among emerging NVMs

According to Yole Development’s recent estimation,¹⁹ the emerging NVM market will reach USD 4.6 billion by 2021, exhibiting an impressive growth of +110% per year, although the market size in 2015 was USD 53 million only. SCM will be the clear go-to market for emerging NVM in 2021. Marketsandmarkets²⁰ also predicts that the global non-volatile memory market is expected to reach USD 82.03 billion by 2022, at a CAGR of 9.50% between 2017 and 2022.

Storage Systems

This section discusses the system level storage infrastructure and implementation. RAID (redundant array of independent/inexpensive disks) and EC (erasure code) systems are mainly used for failure tolerance. Hybrid storage systems intend to achieve relatively high performance at low cost. Microserver and Ethernet drives have been employed in some object storage systems. Software-define systems separate the data flow and control flow. Some large-scale storage system implementations, like Hadoop/Spark, OpenStack, Ceph, are also introduced.

Infrastructure: RAID and EC

RAID as a data storage virtualization technology combines multiple physical drive components into a single logical unit or pool for the purposes of data redundancy, performance improvement, or both [20]. The Storage Networking Industry Association (SNIA) standardized RAID levels and their associated data formats from RAID 0 to RAID 6: “RAID 0 consists of striping, without mirroring or parity. RAID 1 consists of data mirroring, without parity or striping. RAID 2 consists of bit-level striping with dedicated Hammingcode parity. RAID 3 consists of byte-level striping with dedicated parity. RAID 4 consists of block-level striping with dedicated parity. RAID 5 consists of block-level striping with distributed parity. RAID 6 consists of block-level striping with double distributed parity.” RAID 2-4 are generally not for practical usage. RAID levels can be nested, as in hybrid RAID. For example, RAID 10 and 50, which is RAID 1 and 5 based on RAID 0.

RAID can be implemented by either hardware or software. Hardware RAID controllers are expensive and proprietary, and usually used in enterprise environments. Software-based implementations have gained more popularity recently. Some RAID software is provided by modern OSs and file systems, such as Linux, ZFS, GPFS, and Btrfs. Hardware-assisted RAID software implements RAID mechanisms in a standard drive controller chip with embedded proprietary firmware and drivers.

Nowadays, RAID systems are widely used in SMEs. Even in some data centers, RAID is still used as a fundamental structure for data protection. However, RAID is limited by its data reliability level, so only up to two disk failures can be tolerated by RAID 6, which is not secure enough for some critical applications. Thus, the erasure coding scheme emerged as an alternative to RAID. In EC, data is broken into fragments that are expanded and encoded with a configurable number of redundant pieces and are stored across different locations, such as disks, storage nodes, or geographical locations. Theoretically, EC can tolerate any number of disk failures, although up to four are used in a group practically. EC may also encounter some performance issues, particularly when the system is operated in downgraded or recovery mode.

Hybrid Systems

Although all-flash arrays are gaining in popularity, hybrid structures remain the mainstream in data centers, due to the trade-offs between cost, reliability, and performance. In early days, the hybrid storage system contained a HDD as the fast tier and tape as the backup tier [21] [22]. Later, fast access speed HDDs (such as 15kRPM and 10kRPM) acted as the performance tier, and slow speed HDDs (such as 7200RPM and 5400RPM) acted as the capacity tier [23]. With the development of non-volatile memory (NVM) technologies, such as NAND Flash [24], PCM [25], STTMRAM [18], and RRAM [19], the performance cost ratio of NVMs is increasing. Table 1-3 lists the performance and price comparison of some well-known NVMs. These NVMs with fast accessing speed can be used as the performance tier [17] [26] or cache [27] [28] [29] [30] in a modern hybrid system. Nowadays, SSD is the first choice of performance tier, and the high capacity shingled magnetic recording (SMR ) drive is used often as the back-up tier [31].

When designing a hybrid storage system, the algorithms for the tier and cache storage architectures are slightly different, although the general framework is similar (see Figure 1-7). Fundamentally, tier storage architecture moves data to the fast storage area instead of copying the data in the cache storage architecture. But both have four important steps to accomplish. Firstly, data allocation policies are needed to control the data flow between different devices. Secondly, there should be an efficient address mapping mechanism between the SSD cache address and the main storage address. Thirdly, due to the size limitation of SSD cache compared with main storage HDDs, only the frequently and recently accessed data, which is called hot data, can be stored in the SSD cache/tier to improve the system efficiency. Therefore, a suitable hot data identification algorithm should be applied to identify the hot/cold data. When the hot data is detected, the data needs to be promoted when necessary. Thus a data migration algorithm is needed to control the hot/cold data flow to improve the future access efficiency. Lastly, a caching scheduling algorithm is employed for queuing behaviors, such as the queue size, synchronization, execution sequence.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig7_HTML.jpg — Figure 1-7
General algorithms for hybrid storage system

Data allocation: Data allocation is conducted by the host or device controller to allocate the incoming data to the most suitable storage location, such as hot data to SSD or cold data to HDD. Besides the properties of the data, the status of the devices is also considered during the allocation process, such as the queue length, capacity usage, bandwidth, etc.

Address mapping: Address mapping is required in a hybrid storage system because the capacities of the faster devices and slower devices are different. Due to the different address ranges, the accessing location of the incoming data needs to be translated to the actual address when the data is allocated to a different device. An address translation table is required to keep all these translation entries. If the address range is big, the memory consumption of the translation table is huge and the translation speed is reduced, which may affect the system performance.

Data migration (promotion/demotion): The data promotion is to migrate the data from the slower devices to the faster devices, and the data demotion is to migrate the data from the faster devices to the slower devices. This is called data migration. The data migration is usually conducted when the data in slower devices is identified as hot data or the data in faster devices is identified as cold data. In some research, the data migration is also done to balance the IOPS between the faster devices and slower devices.

Hot data identification: Hot data identification is important for the data migration to select the suitable data to promote and demote. It uses the properties of historical data to classify the incoming data as hot or cold. The classification is done by checking the accessing frequency and time of the data. Most frequently accessed and most recently accessed data are identified as hot data.

The hybrid storage architectures can be roughly classified into four categories, which are shown in Figure 1-8: (1) SSDs as a cache (caching method) of HDDs, (2) SSDs as a (high) tier (tiering method) to HDDs, (3) SSDs as the combination of tier and cache, and (4) HDDs with special purposes, such as HDDs utilized as the cache of SSDs. There are also some hybrid storage systems incorporating other types of NVMs into design consideration.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig8_HTML.jpg — Figure 1-8
The overall categories of the hybrid storage architectures

Microservers and Ethernet Drives

A microserver is a server-class computer which is usually based on a system on a chip (SoC) architecture. The goal is to integrate most of the server motherboard functions onto a single microchip, except DRAM, boot FLASH, and power circuits. Ethernet Drive is one of its various forms.

In October 2013, Seagate Technology introduced its Kinetic Open Storage platform with claims that the technology would enable applications to talk directly to the storage device and eliminate the traditional storage server tier. The company shipped its first near-line Kinetic HDDs in late 2014. The Kinetic drive is described as a key-value server with dual Ethernet ports that support the basic put, get, and delete semantics of object storage, rather than read-write constructs of block storage. Clients access the drive through the Kinetic API that provides key-value access, third-party object access, and cluster, drive, and security management.

Introduced in May 2015, Toshiba’s KVDrive uses the key-value API that Seagate open sourced rather than reinventing the wheel. Ceph or Gluster could run directly on Toshiba’s KVDrive .

WDC/HGST’s converged microserver based on its Open Ethernet architecture supports any Linux implementation. Theoretically, any network operating system can run directly in such a microserver. Ceph and OpenStack Object Storage system have been demonstrated together with Red Had server. For example, in early 2016, WDC demonstrated a large scale Ceph distributed storage system with 504 drives and 4PB storage size.²¹

Software-Defined Storage

Software-defined storage is an emerging concept that is still in evolution. There are many different definitions from different organizations, such as the following:

TechTarget :²² SDS is an approach to data storage in which the programming that controls storage-related tasks is decoupled from the physical storage hardware (which places the emphasis on storage-related services rather than storage hardware).
Webopedia :²³ SDS is storage infrastructure that is managed and automated by intelligent software as opposed to the storage hardware itself. In this way, the pooled storage infrastructure resources in a SDS environment (which can provide functionality such as deduplication, replication, thin provisioning, snapshots, and other backup and restore capabilities across a wide range of server hardware components) can be automatically and efficiently used to match the application needs of an enterprise.
Wikipedia :²⁴ SDS is computer data storage software to manage policy-based provisioning and management of data storage independent of hardware. Software-defined storage definitions typically include a form of storage virtualization to separate the storage hardware from the software that manages the storage infrastructure. The software enabling a software-defined storage environment may also provide policy management for feature options such as deduplication, replication, thin provisioning, snapshots, and backup.
Vmware :²⁵ SDS is the dynamic composition of storage services (such as snaps, clones, remote replication, deduplication, caching, tiering, encryption, archiving, compliance, searching, intelligent logics) aligned on application boundaries and driven by policy.

Despite of these different views, there are some common factors and features, which are summarized in Table 1-4 and Figure 1-9. Table 1-4 actually shows the three steps for SDS. First, the hardware should be decoupled from the software, such as the abstraction of logical storage services and capabilities from the underlying physical storage systems. Second, the storage resource is virtualized, such as pooling across multiple implementations. Third, automation mechanism is created with policy-driven storage provisioning with service-level agreements replacing technology details. Typical SDS products include GlusterFS, Ceph, and VMwareVirtual SAN. Figure 1-9 further gives the features in five aspects: data organization, scaling, persistent data store, storage service, and delivery model.

Table 1-4

Common Features of SDS

Level	Steps	Consequence
Data plane, Control plane	Abstract (decouple/standardization, pooling/virtualization), automation (policy-driven)	Faster, more efficient simpler

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig9_HTML.png — Figure 1-9
The overall features of SDS

SDS also leads to some other concepts, such as a software-defined data center (SDDC). Based on the report by IDC and IBM,²⁶ a SDDC is a loosely coupled set of software components that seek to virtualized and federate datacenter-wide hardware resources such as compute, storage, and network resources. The objective for a SDDC is to make the data center available in the form of an integrated service. Note that an implementation of SDS and SDDC may not be able to leave the support of another software defined concept, such as software-defined networking (SDN) , which provides a fundamental change to the network infrastructure.

Implementation

I focus on some most recent software implementations for large scale systems with distributed storage components in this section.

Hadoop

Apache Hadoop,²⁷ an open-source implementation of MapReduce originating at Google, provides a software framework used for distributed storage and processing of big data sets. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed under a fundamental assumption that hardware failures commonly occur and should be automatically handled by the framework.

The base Apache Hadoop framework is composed of the following four major modules:

Hadoop Common has the fundamental libraries and utilities required by other Hadoop modules.
Hadoop Distributed File System (HDFS) is a distributed file-system written in Java that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN is a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
Hadoop MapReduce processes large scale data, as an implementation of the MapReduce programming model.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by a replication mechanism, such as replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (some RAID configurations are still useful, like RAID 0). Data is stored on three nodes with the default replication value, 3. Data nodes can communicate with each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not-full compliance is increased performance for data throughput and support for non-POSIX operations such as Append. Although HDFS is the default distributed file system, it can be replaced by other file systems, such as FTP file systems, Ceph, Amazon S3, Windows Azure storage blobs (WASB), and others.

Nowadays, Hadoop is a large ecosystem with tens of different components. Figure 1-10²⁸ shows a simplified Hadoop ecosystem with an active expansion. In 2014, an in-memory data processing engine named Spark²⁹ was released to speed the MapReduce processing. These two projects share many common components.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig10_HTML.png — Figure 1-10
Hadoop ecosystem

OpenStack

OpenStack³⁰ is an open-source and free software platform for cloud computing, mostly deployed as an infrastructure-as-a-service (IaaS). It consists of interrelated components that control diverse, multi-vendor hardware pools of computing, storage, and networking resources throughout a data center. Therefore, the components can be basically divided into the categories of compute, storage, networking, and interface. For example, Nova is the cloud computing fabric controller as the main component of an IaaS system. Neutron is the component for managing networks and IP addresses. Figure 1-11 shows the overall architecture [32].

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig11_HTML.jpg — Figure 1-11
Openstack architecture [32]

OpenStack contains the block storage component called Cinder and an Object storage component called Swift. The Cinder system manages the creation, attaching, and detaching of the block devices to servers. Block storage volumes are fully integrated into Nova and the Horizon Dashboard, allowing cloud users to manage their own storage needs. Block storage is appropriate for performance-sensitive scenarios in both local server storage and storage platforms (e.g., Ceph, GlusterFS, GPFS, etc.). Swift is a scalable redundant storage system .

Ceph

Ceph,³¹ an open-sourced and free distributed storage platform, provides a unified interfaces for object-, block-, and file-level storage [33, 34]. Ceph was initially created by Sage Weil for his doctoral dissertation. In 2012, Inktank Storage was founded by Weil for professional services and to support for Ceph.

Ceph applies replicates and erasure code to make it fault-tolerant, using commodity hardware and requiring no specific hardware support. As a consequence, the system is both self-healing and self-managing, aiming to minimize administration time and other costs. A general architecture is illustrated in Figure 1-12. The reliable autonomic distributed object store (RADOS) provides the foundation for unified storage. The software libraries of Ceph’s distributed object storage provide client applications with direct access RADOS system. Ceph’s RADOS Block Device (RBD) automatically stripes and replicates the data across the cluster and integrates with kernel-based virtual machines (KVMs). The Ceph file system (CephFS) runs on top of LIBRADOS/RADOS that provides object storage and block device interfaces.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig12_HTML.png — Figure 1-12
Ceph architecture

System Performance Evaluation

Many metrics are used to indicate the specification of storage devices or systems, in both static and dynamic sense. Table 1-5 gives a list of commonly used ones. When discussing performance, we usually refer to the dynamic specifications. In particular, IO performance is among the most important metrics.

Table 1-5

Common Metrics for Storage Devices

Metrics	Unit
Capacity	GB
Areal density (TPI, SPT)	GB/inch
Volumetric density	TB/liter
Write/read endurance	Times/years
Data retention time	Years
Speed (latency of IO access time; rand.)	Million seconds
Speed (bandwidth of IO access; seq.)	MB/second
Power consumption	Watts
Reliability (MTBF)	Hours
Power on/off transit time	Seconds
Shock and vibration	G-force
Temperature resistance	°C
Radiation resistance	Rad

Performance evaluation is an essential element of experimental computer science. It can be used to tune system parameters, to assess capacity requirements when assembling systems for production use, to compare the values of some different designs, and then provide guidance for the new development. As pointed out in [35], the three main factors that affect the performance of a computer system are

The system’s design
The system’s implementation
The system’s workload

These three factors influence and interact with each other. It is common for a system to perform well for one workload, but not for another. For a given storage system, its hardware design is usually fixed. However, it may provide some tuning parameters. If the parameters are also fixed in one scenario, the performance is usually “predictable” for a particular application. By running enough experiments, it is possible to obtain some patterns for the parameters related to the application’s general workload properties. Then you may further tune the parameters. Sometimes, due to design limitations, the range of tuning parameters may be too narrow. Then you must redesign the system.

The most important three basic performance indexes are input/output operations per second (IOPS), throughput (TP), and response time (RT) [12].

Throughput , also named bandwidth, is related to the data transfer rate and is the amount of data transferred to or from the storage devices within a time unit. Throughput is often measured in KB/sec, MB/sec, or GB/sec. For disk drives, it usually refers to the sequential access performance.
IOPS means the IO operation transfer rate of the device or the number of transactions that can occur within a time unit. For disk drives, it usually refers to the random access performance.
Response time, also named latency, is the time cost between a host command sent to the storage device and returned to the host, so it’s the time cost of an IO request for the round trip. It is measured in milliseconds (ms) or microseconds (μs) and is often cited as an average (AVE) or maximum (MAX) response time. In a HDD specification, the average seek time and switch time are usually provided.

And the most important three access patterns are

Block size, which is the data transfer length
Read/write ratio, which is the mix of read and write operations
Random/sequential ratio, which is the random or sequential nature of the data address requests

In addition, when considering consumer/client or enterprise devices/systems, the focus may be different. For example, in some client use cases, IOPS and bandwidth may be more critical than response time for HDD/SSD devices, as long as the response times are not excessively slow, since the typical client users would not usually notice a single IO taking a long time (unless the OS or software application is waiting for a single specific response). While client SSD use cases may mostly be interested in average response times, the enterprise use cases are often more concerned with maximum response times and the frequency and distribution of those times [12].

Performance vs. Workload

Workload can be categorized in several ways. From the domain point of view, the workload can be imposed to the CPU, memory, bus, network, etc. The level of details required in workload characterization relies on the goal of the evaluation. It can be in the computer component level or in the system application level. In the sense of applications, workload may be extracted from database, email, web service, desktop, etc.

An important difference among workload types is their rate [35], which makes the workload either static or dynamic. A static workload is one with a certain amount of work; when it is done, the job is completed. Usually, the job is several combinations of small sets of given applications. On the other hand, in a dynamic workload, work continues to arrive all the time; it is never done. It requires an identification of all possible jobs.

From practical point of view, the workload is divided into three categories: file-level, object-level, and block-level. In this book, I focus on block-level because most underlying storage devices are actually block devices, and the techniques applied to block-level analysis can be also used for file-level and object-level analysis .

Trace Collection and Analysis

Workload trace can be collected using both software and hardware tools, actively or passively. The inherited logging mechanism of some systems, which usually runs as background activates, is one of passive trace sources. Actively, you may require specific hardware (e.g., a data collection card, bus analyzer, etc.) and software (e.g., dtrace, iperf, blktrace, etc.) to collect traces purposely. These traces may be at different precision and detail levels. Sometimes you may also require the aid of benchmark tools when the environments of real applications are not available or inconvenient to obtain. Chapter 5 will discuss this in detail.

System Optimization

One of the main purposes of trace analysis is to identify the system performance bottleneck in various levels (e.g., component vs. system, user vs. kernel vs. hardware, etc.), and then optimize the overall system [36].

Take a simple IO stack as an example (Figure 1-13).³² Access patterns generated by software applications must traverse the IO stack in order to get from the user space to the storage devices and back again. This movement indicates that the IOs will be impacted by the file system and various drivers as they pass them up or down the IO stack, such as coalescing small IO data transfers into a fewer larger IO data transfers, splitting large sequential IO data transfers into multiple concurrent random IO data transfers, and using the faster file system cache while deferring IO commits to the SSD.

../images/468166_1_En_1_Chapter/468166_1_En_1_Fig13_HTML.jpg — Figure 1-13
IO stack

In this book, I will provide some practical examples, ranging from single devices to complex systems, to show how the workload analysis can be applied to system optimization and design.

Footnotes

All figures are provided in the source code download file for this book. To access the source code, go to www.apress.com/9781484239278 and click the Download Source Code button.

www.anandtech.com/show/11315/seagate-ships-35th-millionth-smr-hdd-confirms-hamrbased-hard-drives-in-late-2018

www.wdc.com/about-wd/newsroom/press-room/2017-10-11-western-digital-unveils-next-generation-technology-to-preserve-and-access-the-next-decade-of-big-data.html

Revised source from wiki and latest industry updates.

Usually, it ranges from 0 (sequential), ~ 0.5ms (1 track), ~ 0.2ms (head switch) to 10+ ms (long seek).

A performance degradation phenomenon where the NAND cells display a measurable drop in performance and may continue degrading throughout the SSD life cycle.

A practical limit on the number of fragmentation exists in a file system for sustainment. In fact, subsequent file allocations may fail once that limit is reached. Therefore, defragmentation may still be needed to a lesser degree.

Normally, the noise is high when the disk starts to spin up. The noise level of HDD is generally much lower than that of the cooling fans.

When moving HDDs from a warm condition to a cold condition before operating it (or vise verse), a certain amount of acclimation time is required. Otherwise, internal condensation may occur and immediate operation may lead to damage of its internal components. In addition, the sudden atmospheric pressure change may also crash the head into the disc media.

A non-common SSD, which is based on DRAM, does not have a wearing issue.

Leading SSDs have lower return rates than mechanical drives as of 2011, although some bad design and manufacturing results in return rates reaching 40% for specific drives. Power outage is one of the main SSD failure types. A survey in December 2013 for SSDs showed that survive rate from multiple power outages is low.

Carnegie Mellon University conducted a study for both consumer-and enterprise-class HDDs in 2007 and SSD in 2015 [13, 14]. HDDs’ average failure rate is 6 years, with life expectancy at 9-11 years.

https://pcpartpicker.com/trends/internal-hard-drive/

https://pcpartpicker.com/trends/price/internal-hard-drive/

High-performance DRAM-based SSDs generally require as much power as HDDs, and a power connection is always required even when the system is idle.

Disk spin-up takes much more power than that a normal operation. For a system with many drives, like a RAID or EC configured structure, staggered spin-up is needed to limit the peak power overload.

https://arstechnica.com/information-technology/2017/08/ibm-and-sony-cram-up-to-330tb-into-tiny-tape-cartridge/

IBM Almaden Research Center, Storage Class Memory, Towards a disruptively low-cost solid-state non-volatile memory, 2013

www.yole.fr/

www.marketsandmarkets.com/Market-Reports/non-volatile-memory-market-1371262.html

https://ceph.com/geen-categorie/500-osd-ceph-cluster/

http://searchsdn.techtarget.com/definition/software-defined-storage

www.webopedia.com/TERM/S/software-defined_storage_sds.html

https://en.wikipedia.org/wiki/Software-defined_storage

www.vmware.com/files/pdf/solutions/VMware-Perspective-on-software-defined-storage-white-paper.pdf

www-05.ibm.com/de/events/solutionsconnect/pdfs/SolCon2013IBMDietmarNollTrendsimBereichStorage14062013.pdf

http://Hadoop.apache.org/

http://hadoopecosystemtable.github.io

https://spark.apache.org /

www.openstack.org/

https://ceph.com

A detailed Linux storage stack diagram can be found at www.thomas-krenn.com/en/wiki/Linux Storage Stack Diagram. The latest version, 4.10, was created at March 2017. [12]

1. Introduction