Being good netizens

While our end-goal is to be able to crawl and index the entire internet, the truth of the matter is that the links that we are retrieving and indexing point to content that belongs to someone else. It can so happen that those third parties object to us indexing some or all links to the domains under their control.

Fortunately, there is a standardized way for web-masters to notify crawlers not only about which links they can crawl and which we are not allowed to but also to dictate an acceptable crawl speed to not incur a high load on the remote host. This is all achieved by authoring a robots.txt file and placing it at the root of each domain. The file contains a set of directives like the following:

To be good netizens, we need to ensure that our crawler implementation respects the contents of any robots.txt file that it encounters. Last but not least, our parser should be able to properly handle the various status codes returned by remote hosts and dial down its crawl speed if it detects an issue with the remote host or the remote host decides to throttle us.