System overview – what are we going to be building?

Throughout the next chapters, we will be assembling, piece by piece, our very own search-engine. As with all projects, we need to come up with a cool-sounding name for it. Let me introduce you to Links 'R' Us!

So, what are the core functionalities of the Links 'R' Us project? The primary, and kind of obvious, functionality is being able to search for content. However, before we can make our search engine available to the public, we first need to seed it with content. To this end, we need to provide the means for users to submit URLs to our search engine. The search engine would then crawl those links, index their content, and add any newly encountered links to its database for further crawling.

Is this all we need for launching Links 'R' Us? The short answer is no! While user searches would return results containing the keywords from the users' search queries, we would lack the capability to order them in a meaningful way, especially if the results range in the thousands.

Consequently, we need to introduce some sort of a link or content quality metric to our system and order the returned results by it. Instead of re-inventing the wheel, we will be stepping on the shoulders of search-engine giants (that would be Google) and implementing a battle-tested algorithm called PageRank.

The PageRank algorithm was introduced by a nowadays very popular and heavily cited paper titled The PageRank Citation Ranking: Bringing Order to the Web. The original paper was authored back in 1998 by Larry Page, Sergey Brin, Rajeev Motwani, and Terry Winograd [9] and, over the years, has served as the basis for the search-engine implementation at Google.

Given a graph containing links between web-pages, the PageRank algorithm assigns an importance score to each link in the graph taking into account the number of links that lead to it and their relative importance scores.

While PageRank was initially introduced as a tool for organizing web content, its generalized form applies to any type of link graph. For the last few years, there has been on-going research into applying PageRank ideas in a multitude of fields ranging from biochemistry [5] to traffic optimization [10].

We will be exploring the PageRank algorithm in more detail in Chapter 8, Graph-Based Data Processing, and Chapter 12, Building Distributed Graph-Processing Systems, as part of a larger discussion centered around the various approaches we can employ to facilitate processing of large graphs on a single node or across a cluster of nodes (out-of-core graph processing).