Out-of-core distributed graph processing

Back in Chapter 8Graph-Based Data Processing, we designed and built our very own system for implementing graph-based algorithms based on the Bulk Synchronous Parallel (BSP) model. Admittedly, our final implementation was heavily influenced by the ideas from the Google paper describing Pregel [4], a system that was originally built by Google engineers to tackle graph-based computation at scale.

While the bspgraph package from Chapter 8Graph-Based Data Processing, can automatically distribute the graph computation load among a pool of workers, it is still limited to running on a single compute node. As our Links 'R' Us crawler augments our link index with more and more links, we will eventually reach a point where the PageRank computation will simply take too long. Updating the PageRank scores for the entire graphs might take a day or, worse, even days!

We can try to buy ourselves some time by scaling up, in other words, running our PageRank calculator service on the most powerful (CPU-wise) machine we can get our hands on from our cloud provider. That would give us some breathing room until the graph becomes too large to fit in memory! Once we reach this point, our only viable alternative is to scale out, or launch multiple compute nodes and assign a section of the, now massive, graph to each node.

In the following sections, we will be applying (quite literally!) everything that we have learned so far to build, from scratch, a distributed version of the bspgraph package, which will live in the Chapter12/dbspgraph folder, which you can browse at this book's GitHub repository.

As we did in the previous chapters, we will be once again applying the SOLID principles for our design to re-use as much code as possible. To this end, the new package will be nothing more than a sophisticated wrapper that transparently imbues any existing bspgraph.Graph instance with distributed computing superpowers!

This practically means that we can design and test our algorithms on a single machine using the bspgraph framework from Chapter 8Graph-Based Data Processing, and once satisfied with their output, switch to dsbpgraph for out-of-core processing.

As we all are aware, building distributed systems is a difficult task. In an attempt to minimize the complexity of the system we will be creating and make the code easier to follow, we will be splitting the implementation into a bunch of smaller, independent components and dedicate a section to the implementation of each one. Don't worry though—by the end of this chapter, you will have a clear understanding of how all of the bits and bobs fit together!