Chapter 31. Keeping Webbots out of Trouble

By this point, you know how to access, download, parse, and process nearly any of the 386 million websites on the Internet.[88] Knowing how to do something, however, does not give you the right to do it. While I have cast warnings throughout the book, I haven’t, until now, focused on the consequences of designing webbots or spiders that act selfishly and without regard to the rights of website owners or related infrastructure.[89]

Since many businesses rely on the performance of their websites to conduct business, you should consider interfering with a corporate website equivalent to interfering with a physical store or factory. When deploying a webbot or spider, remember that someone else is paying for hosting, bandwidth, and development for the websites you target. Writing webbots and spiders that consume irresponsible amounts of bandwidth, guess passwords, or capriciously reuse intellectual property may well be a violation of someone’s rights and will eventually land you in trouble. Back in the day—that is, before the popularization of the Internet—programmers had to win their stripes before they earned the confidence of their peers and gained access to networks or sensitive information. At that time, people who had access to data networks were less likely to abuse them because they had an ownership stake in the security of data and the performance of networks. One of the outcomes of the Internet’s free access to information, open infrastructure, and apparent anonymous browsing is that it is now easier than ever to act irresponsibly. A free Wi-Fi connection to the Internet gives anyone and everyone access to (and the opportunity to compromise) servers all over the world. With worldwide access to data centers and the ability to download quick exploits, it’s easy for people without a technical background (or a vested interest in the integrity of the Internet) to access confidential information or launch attacks that render services useless to others.

The last thing I want to do is pave a route for people to create havoc on the Internet. The purpose of this book is to help Internet developers think beyond the limitations of the browser and to develop webbots that do new and useful things. Even after more than two decades of existence, webbot development is still virgin territory, and there are still many new and creative things to do with the skills you’ve learned in this book. You simply lack imagination if you can’t develop webbots that do interesting things without violating someone’s rights.

Webbots (and their developers) generally get into trouble when they make unauthorized use of copyrighted information or use an excessive amount of a website’s infrastructure (bandwidth, servers, administration, etc.). This chapter addresses both of these areas. We’ll also explore the requests webmasters make to limit webbot use on their websites.

Note

This chapter introduces warnings that all webbot, screen scraper, and spider writers should understand and consider before embarking on projects. While I’m trying to help you, please remember that I’m not dispensing legal advice, so don’t even think of blaming me if you misbehave and are sued or find the FBI knocking at your door. This is my attempt to identify a few (but not all) issues related to developing webbots and spiders. Perhaps with this information, you will be able to at least ask an attorney intelligent questions. To reiterate, I am not a lawyer, and this is not legal advice. My responsibility is to tell you that, if misused, automated web agents can get you into deep trouble. In turn, you’re obligated to take responsibility for your own actions and to consult an attorney who is aware of local laws before doing anything that even remotely violates the rights of someone else. I urge you to think before you act.

Your career as a webbot developer will be short-lived if you don’t respect the rights of those who own, maintain, and rely upon the web servers your webbots and spiders target. Remember that websites are designed for people using browsers and that often a website’s profit model is dependent on those traffic patterns. In a matter of seconds, a single webbot can create as much web traffic as a thousand web surfers, without the benefit of generating commerce or ad revenue, or extending a brand. It’s helpful to think of webbots as “super browsers,” as webbots have increased abilities. But in order to walk among mere browsers, webbots and spiders need to comply with the norms and customs of the rest of the web agents on the Internet.

In Chapter 30 you read about website polices, robots.txt files, robots meta tags, and other tools server administrators use to regulate webbots and spiders. It’s important to remember, however, that obeying a webmaster’s webbot restrictions does not absolve webbot developers from responsibility. For example, even if a webbot doesn’t find any restrictions in the website’s Terms of Service agreement, robots.txt file, or meta tags, the webbot developer still doesn’t have permission to violate the website’s intellectual property rights or use inordinate amounts of the webserver’s bandwidth.



[88] This estimate of the number of websites on the Internet as of June 2011 comes from http://news.netcraft.com/archives/2011/06/07/june-2011-web-server-survey.html.

[89] If you interfere with the operation of one site, you may also affect other, non-targeted websites if they are hosted on the same (virtual) server.