As the crawler backend system,
I need to be able to obtain a list of sanitized links from the link graph,
so as to fetch and index their contents while at the same time expanding the link graph with newly discovered links.
The acceptance criteria for this user story are as follows:
- The crawler can query the link graph and receive a list of stale links that need to be crawled.
- Links received by the crawler are retrieved from the remote hosts unless the remote server provides an ETag or Last Modified header that the crawler has already seen before.
- Retrieved content is scanned for links and the link graph gets updated.
- Retrieved content is indexed and added to the search corpus.