While our end-goal is to be able to crawl and index the entire internet, the truth of the matter is that the links that we are retrieving and indexing point to content that belongs to someone else. It can so happen that those third parties object to us indexing some or all links to the domains under their control.
Fortunately, there is a standardized way for web-masters to notify crawlers not only about which links they can crawl and which we are not allowed to but also to dictate an acceptable crawl speed to not incur a high load on the remote host. This is all achieved by authoring a robots.txt file and placing it at the root of each domain. The file contains a set of directives like the following:
- User-Agent: The name of the crawler (user agent string) which the following instructions apply to
- Disallow: A regular expression that excludes any matching URL from being crawled
- Crawl-Delay: The number of seconds for the crawler to wait before crawling subsequent links from this domain
- Sitemap: A link to an XML file which defines all links within a domain and provides metadata such as a last-update timestamp that crawlers can use to optimize their link access patterns
To be good netizens, we need to ensure that our crawler implementation respects the contents of any robots.txt file that it encounters. Last but not least, our parser should be able to properly handle the various status codes returned by remote hosts and dial down its crawl speed if it detects an issue with the remote host or the remote host decides to throttle us.