Now that you know how to write a webbot that determines search rankings and how to perform an insertion parse, here are a few other things to think about.
Remember that search engines do not make money by displaying search results. The search-ranking webbot is a concept study and not a suggestion for a product that you should develop and place in a production environment, where the public uses it. Also—and this is important—you should not violate any search website’s Terms of Service agreement when deploying a webbot like this one.
Experience has taught me that some search sites serve pages differently if they think they’re dealing with an automated web agent. If you leave the default setting for the agent name (in LIB_http
) set to Test Webbot, your programs will definitely look like webbots instead of browsers.
It is not a good idea to spider Google or any other search engine. I once heard (at a hacking conference) that Google limits individual IP addresses to 250 page requests a day, but I have not verified this. Others have told me that if you make the page requests too quickly, Google will stop replying after sending three result pages. Again, this is unverified, but it won’t be an issue if you obey Google’s Terms of Service agreement.
What I can verify is that I have, in other circumstances, written spiders for clients where websites did limit the number of daily page fetches from a particular IP address to 250. After the 251st fetch within a 24-hour period, the service ignored all subsequent requests coming from that IP address. For one such project, I put a spider on my laptop and ran it in every Wi-Fi–enabled coffee house I could find in South Minneapolis. This tactic involved drinking a lot of coffee, but it also produced a good number of unique IP addresses for my spider, and I was able to complete the job more quickly than if I had run the spider (in a limited capacity) over a period of many days in my office.
Despite Google’s best attempts to thwart automated use of its search results, there are rumors indicating that MSN (Microsoft’s search engine before Bing) was spidering Google to collect records for its own search engine.[38]
If you’re interested in these issues, you should read Chapter 31, which describes how to respectfully treat your target websites.
If you are interested in pursuing projects that use Google’s data, you should investigate the Google developer API, a service (or Application Program Interface), which makes it easier for developers to use Google in noncommercial applications. At the time of this writing, Google provided information about its developer API at http://code.google.com/more.
[38] Jason Dowdell, “Microsoft Crawling Google Results For New Search Engine?” November 11, 2004, WebProNews (http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20041111MicrosoftCrawlingGoogleResultsForNewSearchEngine.html).