Excluding the Bot

There are a number of reasons you might want to block bots from all, or part, of your site. For example, if your site is not complete, if you have broken links, or if you haven’t prepared your site for a search engine visit, you probably don’t want to be indexed yet. You may also want to protect parts of your site from being indexed if those parts contain sensitive information or pages that you know cannot be accurately traversed or parsed.

Note

Google requests that you block URLs that will give the bot hiccups—for example, dynamic URLs that include calendar information that have the potential for infinite expansion. You can block individual URLs using a nofollow attribute value in the anchor tag of the URL itself. For example:

<a rel="nofollow" href="botcantgohere" />No follow me</a>

If you need to, you can make sure that part of your site does not get indexed by any search engine.

Note

Following the no-robots protocol is voluntary and based on the honor system. So all you can really be sure of is that a legitimate search engine that follows the protocol will not index the prohibited parts of your site from the root of your site (if there are external links to excluded pages, these may still be traversed regardless of your policy file). Don’t rely on search engine exclusion for security. Information that needs to be protected should be in password-protected locations, and protected by software hardened for security purposes.

To block bots from traversing your site, place a text file named robots.txt in your site’s web root directory (where the HTML files for your site are placed). The following syntax in the robots.txt file blocks all compliant bots from traversing your entire site:

User-agent: *
Disallow: /

You can exercise more granular control over which bots you ban and which parts of your site are off-limits as follows:

  • The User-agent line specifies the bot that is to be banished.

  • The Disallow line specifies a path relative to your root directory that is banned territory.

For example, you would tell the Google search bot not to look in your cgi-bin directory (assuming the cgi-bin directory is right beneath your web root directory) by placing the following two lines in your robots.txt file:

User-agent: googlebot
Disallow: /images

For more information about working with the robots.txt file, see the Web Robots FAQ. You can also find tools for managing and generating custom robots.txt files and robot meta tags (explained later) at http://www.rietta.com/robogen/ (an evaluation version is available for free download).

The Googlebot and many other web robots can be instructed not to index specific pages (rather than entire directories), not to follow links on a specific page, and to index but not cache a specific page, all via the HTML meta tag placed inside of the head tag.

The meta tag used to block a robot has two attributes: name and content. The name attribute is the name of the bot you are excluding. To exclude all robots, you’d include the attribute name="robots" in the meta tag.

To exclude a specific robot, the robot’s identifier is used. The Googlebot’s identifier is googlebot, and it is excluded by using the attribute name="googlebot". You can find the entire database of registered and excludable robots and their identifiers (currently about 300) at http://www.robotstxt.org/db.html.

The possible values of the content attribute are shown in Table 4-1. You can use multiple attribute values, separated by commas, but you should not use contradictory attribute values together (such as content="follow, nofollow").

For example, you can block Google from indexing a page, following links on a page, and caching the page using this meta tag:

<meta name="googlebot" content="noindex, nofollow, noarchive">

More generally, the following tag tells legitimate bots (including the Googlebot) not to index a page or follow any of the links on the page:

<meta name="robots" content="noindex, nofollow">

For more information about Google’s page-specific tags that exclude bots, and about the Googlebot in general, see http://www.google.com/bot.html.