Your strongest defenses against webbots are techniques that detect webbot behavior. Webbots behave differently because they are machines and don’t have the reasoning ability of people. Therefore, a webbot will do things that a person won’t do, and a webbot lacks information that a person either knows or can figure out by examining his or her environment.
A spider trap is a technique that capitalizes on the behavior of a spider, forcing it to identify itself without interfering with normal human use. The spider trap in the following example exploits the spider behavior of indiscriminately following every hyperlink on a web page. If some links are either invisible or unavailable to people using browsers, you’ll know that any agent that follows the link is a spider. For example, consider the hyperlinks in Example 30-3.
Example 30-3. Two spider traps
<a href="spider_trap.php"><a> <a href="spider_trap.php"><img src="spacer.gif" width="0" height="0"><a>
There are many ways to trap a spider. Some other techniques include image maps with hot spots that don’t exist and hyperlinks located in invisible frames without width
or height
attributes.
Once unwanted guests are detected, you can treat them to a variety of services.
Identifying a spider is the first step in dealing with it. Moreover, with browser-spoofing techniques, a spider trap becomes a necessity in determining which traffic is automated and which is human. What you do once you detect a spider is up to you, but Table 30-1 should give you some ideas. Just remember to act within commonsense legal guidelines and your own website policies.
Table 30-1. Strategies for Responding When You Identify a Spider
Strategy | Implementation |
---|---|
Banish | Record the IP addresses of spiders that reach the spider trap and configure the webserver to ignore future requests from these addresses. |
Limit access | Record the IP addresses of the spiders in the spider trap and limit the pages they can access on their next visit. |
Mislead | Depending on the situation, you could redirect known (unwanted) spiders with an alternate set of misleading web pages. As much as I love this tactic, you should consult with an attorney before implementing this idea. |
Analyze | Analyze the IP address and find out where the spider comes from, who might own it, and what it is up to. A good resource for identifying IP addresses registered in the United States is http://www.arin.net. You could even create a special log that tracks all activity from known hostile spiders. You can also use this technique to learn whether or not a spider is part of a distributed attack. |
Ignore | The default option is to just ignore any automated activity on your website. |