Form Handlers, Data Fields, Methods, and Event Triggers

Web-based forms have four main parts, as shown in Figure 6-2:

I’ll examine each of these parts in detail and then show how a webbot emulates a form.

Parts of a form

Figure 6-2. Parts of a form

The action attribute in the <form> tag defines the web page that interprets the data entered into the form. We’ll refer to this page as the form handler. If there is no defined action, the form handler is the same as the page that contains the form, which generally means the form uses an HTTP GET method. The examples in Table 6-1 compare the location of form handlers in a variety of conditions.

Servers have no use for the form’s name, which is the variable that identifies the form. This variable is only used by JavaScript, which associates the form name with its form elements. Since servers don’t use the form’s name, webbots (and their designers) have no use for it either.

Form input tags define data fields and the name, value, and user interface used to input the value. The user interface (or widget) can be a text box, text area, select list, radio control, checkbox, or hidden element. Remember that while there are many types of interfaces, they are completely meaningless to the webbot that emulates the form and the server that handles the form. From a webbot’s perspective, there is no difference between data entered via a text box or a select list. The input tag’s name and its value are the only things that matter.

Every data field must have a name.[22] These names become form data variables, or containers for their data values. In Example 6-1, a variable called session_id is set to 0001, and the value for search is whatever was in the text box labeled Search when the user clicked the submit button. Again, from a webbot designer’s perspective, it doesn’t matter what type of data elements define the data fields (hidden, select, radio, text box, etc.). It is important that the data has the correct name and that the value is within a range expected by the form handler.

The form’s method describes the protocol used to send the form data to the form handler. The most common methods for form data transfers are GET and POST.

You are already familiar with the GET method, because it is identical to the protocol you used to request web pages in previous chapters. With the GET protocol, the URL of a web page is combined with data from form elements. The address of the page and the data are separated by a ? character, and individual data variables are separated by & characters, as shown in Example 6-2. The portion of the URL that follows the ? character is known as a query string.

Since GET form variables may be combined with the URL, the web page that accepts the form will not be able to tell the difference between the form submitted in Example 6-3 and the form emulation techniques shown in Example 6-4 and Example 6-5. In either case, the variables term and sort will be submitted to the web page http://www.schrenk.com/search with the GET protocol.[23]

Alternatively, you could use LIB_http to emulate the form, as in Example 6-4.

Example 6-4. Using LIB_http to emulate the form in Example 6-3 with data passed in an array

include("LIB_http.php");

$action = "http://www.schrenk.com/search.php";    // Address of form handler
$method="GET";                                    // GET method
$ref = "";                                        // Referer variable
$data_array['term'] = "hello";                    // Define term
$data_array['sort'] = "up";                       // Define sort
$response = http($target=$action, $ref, $method, $data_array, EXCL_HEAD);

Conversely, since the GET method places form information in the URL’s query string, you could also emulate the form with a script like Example 6-5.

Example 6-5. Emulating the form in Example 6-3 by combining the URL with the form data

include("LIB_http.php");

$action = "http://www.schrenk.com/search.php?term=hello&sort=up";
$method=""GET";
$ref = "" ;
$response = http($target=$action, $ref, $method, $data_array="", EXCL_HEAD);

The reason we might choose Example 6-4 over Example 6-5 is that the code is cleaner when form data is treated as array elements, especially when many form values are passed to the form handler. Passing form variables to the form’s handler with an array is also more symmetrical, meaning that the procedure is nearly identical to the one required to pass values to a form handler expecting the POST method.

While the GET method appends form data at the end of the URL, the POST method sends data in a separate file. The POST method has these advantages over the GET method:

Regardless of the advantages of POST over GET, you must match your method to the method of form you are emulating. Keep in mind that methods may also be combined in the same form. For example, forms with POST methods may also use form handlers that contains query strings.

To submit a form using the POST method with LIB_http, simply specify the POST protocol, as shown in Example 6-6.

Regardless of the number of data elements, the process is the same. Some form handlers, however, access the form elements as an array, so it’s always a good idea to match the order of the data elements that is defined in the HTML form.

There is one more type of form method, which is actually an extension of the POST method. This is a post method with multipart encoding. When used, as shown below in Example 6-7, HTML forms are capable of transferring complete files from a user’s computer to a web server.

Forms that facilitate file uploads commonly allow people to upload images to social networking websites or similar. While you can upload any type of file with a form that allows submitted files, it is important to recognize that, for security reasons, the form handler may only allow specific types of files that are appropriate for the situation. Servers also place restrictions on file size.

If you want your webbot to upload a file to a form that accepts file submissions, you may use a script like the one in Example 6-8.

Example 6-8. A script that could upload a file to a form like the one in Example 6-7

$post = array("uploadedfile"=>"@".$full_path_name_of_file);
// reference file to be uploaded in an array

$ch = curl_init();                                        // initialize PHP/CURL
curl_setopt($ch, CURLOPT_URL, $form_action_URL;      // point at the form handler
curl_setopt($ch, CURLOPT_POST, true);  // indicate that a POST method is required
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);       // add post array to operation
$response = curl_exec($ch);

Notice that in Example 6-8 there is no direct reference to the multipart encoding type in the webbot script. The POST method is always used when uploading files to servers and the full path name of the file to upload is inserted into the array of POST variables.

It’s worth repeating that when using PHP/CURL to upload files to form handlers, the full path name to that file to be uploaded must be stated and preceded with an @ symbol.

A submit button typically acts as the event trigger, which causes the form data to be sent to the form handler using the defined form method. While the submit button is the most common event trigger, it is not the only way to submit a form. It is very common for web developers to employ JavaScript to verify the contents of the form before it is submitted to the server. In fact, any JavaScript event like onClick or onMouseOut can submit a form, as can any other type of human-generated JavaScript event. Sometimes, JavaScript may also change the value of a form variable before the form is submitted. The use of JavaScript as an event trigger causes many difficulties for webbot designers, but these issues are remedied by the use of special tools, as you’ll soon see.



[22] The HTML value of any form element is only its stating or default value. The user may change the final element with JavaScript or by editing the form before it is sent to the form handler.

[23] In forms where no form method is defined, like the form shown in Example 6-3, the default form method is GET.