Storage clusters

Now that we have an understanding of how to compute a cluster, let's move on to another application of clusters.

Instead of aggregating compute resources to decrease processing times, a storage cluster's main functionality is to aggregate available space to provide maximum space utilization while, at the same time, providing some form of redundancy. With the increased need for storing large amounts of data comes the need of being able to do it at a lower cost, while still maintaining increased data availability. Storage clusters help to solve this problem by allowing single monolithic storage nodes to work together as a large pool of available storage space. Thus, it allows storage solutions to reach the petascale mark without the need to deploy specialized proprietary hardware.

For example, say that we have a single node with 500 TB of available space and we need to achieve the 1-Petabyte (PB) mark while providing redundancy. This individual node becomes a single point of failure because, if it goes down, then there's no way the data can be accessed. Additionally, we've reached the maximum hard disk drive (HDD) capacity available. In other words, we can't scale horizontally.

To solve this problem, we can add two more nodes with the same configuration, as the already existing one provides a total of 1 PB of available space. Now, let's do some math here, 500 TB times 3 should be approximately 1.5 PB, correct? The answer is most definitely yes. However, since we need to provide high availability to this solution, the third node acts as a backup, making the solution tolerate a single-node failure without interrupting the client's communication. This capability to allow node failures is all thanks to the power of SDS and storage clusters, such as GlusterFS, which we'll explore next.