The link graph

The link graph is responsible for keeping track not only of all links that the crawler has discovered so far but also of how they are connected. It exposes interfaces for other components to add or remove links from the graph and, of course, query the graph.

Several other system components depend on the interfaces exposed by the link graph component:

The link provider queries the link graph to decide which links should be crawled next.
The link extractor sub-component of the crawler adds newly discovered links to the graph.
The PageRank calculator components require access to the entire graph's connectivity information so that it can calculate the PageRank score of each link.

Note that I am not talking about a single interface but I am using the plural form: interfaces. This is deliberate as the link graph component is a prime candidate for implementing the Command Query Responsibility Segregation (CQRS) pattern.

The CQRS pattern belongs to the family of architectural patterns. The key idea behind CQRS is to separate the write and read models exposed by a particular component so they can be optimized in isolation. Commands refer to operations that mutate the state of the model, whereas queries retrieve and return the current model state.

This separation allows us to execute different business logic paths for reads and writes, and, in effect, enables us to implement complex access patterns. For example, writes could be a synchronous process whereas reads might be asynchronous and provide a limited view over the data.

As another example, the component could utilize separate data stores for writes and reads. Writes would eventually trickle into the read store but perhaps the read store data could also be augmented with external data obtained from other downstream components.