There are open source tools available that help parse robots.txt files and validate website URLs against them to see if you have access or not. One project that I would recommend is available on GitHub called robotstxt by user temoto. In order to download this library, run the following command in your terminal:
go get github.com/temoto/robotstxt
This will install the library on your machine at $GOPATH/src/github/temoto/robotstxt. If you would like, you can read the code to see how it all works. For the sake of this book, we will just be using the library in our own project. Inside your $GOPATH/src folder, create a new folder called robotsexample. Create a main.go file inside the robotsexample folder. The following code for main.go shows you a simple example of how to use the temoto/robotstxt package:
package main
import (
"net/http"
"github.com/temoto/robotstxt"
)
func main() {
// Get the contents of robots.txt from packtpub.com
resp, err := http.Get("https://www.packtpub.com/robots.txt")
if err != nil {
panic(err)
}
// Process the response using temoto/robotstxt
data, err := robotstxt.FromResponse(resp)
if err != nil {
panic(err)
}
// Look for the definition in the robots.txt file that matches the default Go User-Agent string
grp := data.FindGroup("Go-http-client/1.1")
if grp != nil {
testUrls := []string{
// These paths are all permissable
"/all",
"/all?search=Go",
"/bundles",
// These paths are not
"/contact/",
"/search/",
"/user/password/",
}
for _, url := range testUrls {
print("checking " + url + "...")
// Test the path against the User-Agent group
if grp.Test(url) == true {
println("OK")
} else {
println("X")
}
}
}
}
This code checks six different paths against the robots.txt file for https://www.packtpub.com/, using the default User-Agent string for the Go HTTP client. If the User-Agent is allowed to access a page, then the Test() method returns true. If it returns false, then your scraper should not access this section of the website.