Most of the pages on a website are free to be accessed by web scrapers and bots. Some of the reasons for allowing this are in order to be indexed by search engines or to allow pages to be discovered by content curators. Googlebot is one of the tools that most websites would be more than happy to give access to their content. However, there are some sites that may not want everything to show up in a Google search result. Imagine if you could google a person and instantly obtain all of their social media profiles, complete with contact information and address. This would be bad news for the person, and certainly not a good privacy policy for the company hosting the site. In order to control access to different parts of a website, you would configure a robots.txt file.
The robots.txt file is typically hosted at the root of the website in the /robots.txt resource. This file contains definitions of who can access which pages in this website. This is done by describing a bot that matches a User-Agent string, and specifying which paths are allowed and disallowed. Wildcards are also supported in the Allow and Disallow statements. The following is an example robots.txt file from Twitter:
User-agent: *
Disallow: /
This is the most restrictive robots.txt file you will encounter. It states that no web scraper can access any part of twitter.com. Violating this will put your scraper at risk of being blacklisted by Twitter's servers. On the other hand, websites like Medium are a little more permissive. Here is their robots.txt file:
User-Agent: *
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/edit$
Disallow: /*/*/edit$
Allow: /_/
Allow: /_/api/users/*/meta
Allow: /_/api/users/*/profile/stream
Allow: /_/api/posts/*/responses
Allow: /_/api/posts/*/responsesStream
Allow: /_/api/posts/*/related
Sitemap: https://medium.com/sitemap/sitemap.xml
Looking into this, you can see that editing profiles is disallowed by the following directives:
- Disallow: /*/edit$
- Disallow: /*/*/edit$
The pages that are related to logging in and signing up, which could be used for automated account creation, are also disallowed by Disallow: /m/.
If you value your scraper, do not access these pages. The Allow statements provide explicit permission to paths in the in /_/ routes, as well as some api related resources. Outside of what is defined here, if there is no explicit Disallow statement, then your scraper has permission to access the information. In the case of Medium, this includes all of the publicly available articles, as well as public information about the authors and publications. This robots.txt file also includes a sitemap, which is an XML-encoded file listing all of the pages available on the website. You can think of this as a giant index, which can come in very handy.
One more example of a robots.txt file shows how a site defines rules for different User-Agent instances. The following robots.txt file is from Adidas:
User-agent: *
Disallow: /*null*
Disallow: /*Cart-MiniAddProduct
Disallow: /jp/apps/shoplocator*
Disallow: /com/apps/claimfreedom*
Disallow: /us/help-topics-affiliates.html
Disallow: /on/Demandware.store/Sites-adidas-US-Site/en_US/
User-Agent: bingbot
Crawl-delay: 1
Sitemap: https://www.adidas.com/on/demandware.static/-/Sites-CustomerFileStore/default/adidas-US/en_US/sitemaps/adidas-US-sitemap.xml
Sitemap: https://www.adidas.com/on/demandware.static/-/Sites-CustomerFileStore/default/adidas-MLT/en_PT/sitemaps/adidas-MLT-sitemap.xml
This example explicitly disallows access to a few paths for all web scrapers, as well as a special note for bingbot. The bingbot must respect the Crawl-delay of 1 second, meaning it cannot access any pages more than once per second. Crawl-delays are very important to take note of, as they will define how quickly you can make web requests. Violating this may generate more errors for your web scraper, or it may be permanently blocked.