Example

There are open source tools available that help parse robots.txt files and validate website URLs against them to see if you have access or not. One project that I would recommend is available on GitHub called robotstxt by user temoto. In order to download this library, run the following command in your terminal:

go get github.com/temoto/robotstxt

The $GOPATH referred to here is the one you set up during the installation of the Go programming language back in Chapter 1, Introducing Web Scraping and Go. This is the directory with the src/ bin/ and pkg/ directories.

This will install the library on your machine at $GOPATH/src/github/temoto/robotstxt. If you would like, you can read the code to see how it all works. For the sake of this book, we will just be using the library in our own project. Inside your $GOPATH/src folder, create a new folder called robotsexample. Create a main.go file inside the robotsexample folder. The following code for main.go shows you a simple example of how to use the temoto/robotstxt package:

package main

import (
  "net/http"

  "github.com/temoto/robotstxt"
)

func main() {
  // Get the contents of robots.txt from packtpub.com
  resp, err := http.Get("https://www.packtpub.com/robots.txt")
  if err != nil {
    panic(err)
  }
  // Process the response using temoto/robotstxt
  data, err := robotstxt.FromResponse(resp)
  if err != nil {
    panic(err)
  }
  // Look for the definition in the robots.txt file that matches the default Go User-Agent string
  grp := data.FindGroup("Go-http-client/1.1")
  if grp != nil {
    testUrls := []string{
      // These paths are all permissable
      "/all",
      "/all?search=Go",
      "/bundles",

      // These paths are not
      "/contact/",
      "/search/",
      "/user/password/",
    }

    for _, url := range testUrls {
      print("checking " + url + "...")

      // Test the path against the User-Agent group
      if grp.Test(url) == true {
        println("OK")
      } else {
        println("X")
      }
    }
  }
}

This example uses Go for each loop using the range operator. The range operator returns two variables, the first is the index of the iteration (which we ignore by assigning it to _), and the second is the value at that index.

This code checks six different paths against the robots.txt file for https://www.packtpub.com/, using the default User-Agent string for the Go HTTP client. If the User-Agent is allowed to access a page, then the Test() method returns true. If it returns false, then your scraper should not access this section of the website.