Final Thoughts

Years of experience have taught me a few tricks for emulating forms. While it’s not hard to write a webbot that submits a form, it is often difficult to do it right the first time. Moreover, as you read earlier, there are many reasons to submit a form correctly the first time. I highly suggest reading the later chapters on stealth, fault tolerance, and potential legal issues (Chapter 26, Chapter 28, and Chapter 31) before creating webbots that emulate forms. These chapters provide additional insight into potential problems and perils that you’re likely to encounter when writing webbots that submit data to webservers.

If you’re using a webbot to create a competitive advantage for a client, you don’t want that fact to be widely known—especially to the people that run the targeted site.

There are two ways a webbot can blow its cover while submitting a form:

Emulating a browser is easy, but you should verify that you’re doing it correctly. Your webbot can look like any browser you desire if you properly declare the name of your web agent. If you’re using the LIB_http library, the constant WEBBOT_NAME defines how your webbot identifies itself, and furthermore, how servers log your web agent’s name in their log files. In some cases, webservers verify that you are using a particular web browser (most commonly Internet Explorer) before allowing you to submit a form.

If you plan to emulate a browser as well as the form, you should verify that the name of your webbot is set to something that looks like a browser (as shown in Example 5-11). Obviously, if you don’t change the default value for your webbot’s name in the LIB_http library, you’ll tell everyone who looks at the server logs that you’re using a test webbot.

Strange user agent names will often be noticed by webmasters, since they routinely analyze logs to see which browsers people use to access their sites to ensure that they don’t run into browser compatibility problems.

Even more serious than using the wrong agent name is submitting a form that couldn’t possibly be sent from the form the webserver provides on its website. These mistakes are logged in the server’s error log and are subject to careful scrutiny. Situations that could cause server errors include the following:

Using the wrong method can have several undesirable outcomes. If your webbot sends too much data with a GET method when the form specifies a POST method, you risk the danger of losing some of your data. (Most webservers restrict the length of a GET method.[24]) Another danger of using the wrong form method is that many form handlers expect variables to be members of either a $_GET or $_POST array, which is a keyed name/value array similar to the $data_array used in LIB_http. If you’re sending the form a POST variable called 'name', and the server is expecting $_GET['name'], your webbot will generate an entry in the server’s error log because it didn’t send the variable the server was looking for.

Also, remember that protocols aren’t limited to the form method. If the form handler expects an SSL-encrypted https protocol, and you deliver the emulated form to an unencrypted http address, the form handler won’t understand you because you’ll be sending data to the wrong server port. In addition, you’re potentially sending sensitive data over an unencrypted connection.

The final thing to verify is that you are sending your emulated form to a web page that exists on the target server. Sometimes mistakes like this are the result of sloppy programming, but this can also occur when a webmaster updates the site (and form handler). For this reason, a proactive webbot designer verifies that the form handler hasn’t changed since the webbot was written.



[24] Servers routinely restrict the length of a GET request to help protect the server from extremely long requests, which are commonly used by hackers attempting to compromise servers with buffer overflow exploits.