Setting PHP/CURL Options

The PHP/CURL session is configured with the curl_setopt() function. Each individual configuration option is set with a separate call to this function. The script in Example A-1 is unusual in its brevity. Usually there are many calls to curl_setopt(). Over 90 separate configuration options are available within PHP/CURL, making the interface very versatile.^[102] The average PHP/CURL user, however, uses only a small subset of the available options. The following sections describe the PHP/CURL options you are most likely to use. While these options are listed here in order of relative importance, you may declare them in any order. If the session is left open, the configuration may be reused as often as needed within the same session.

CURLOPT_URL

Use the CURLOPT_URL option to define the target URL for your PHP/CURL session, as shown in Example A-2.

Example A-2. Defining the target URL

curl_setopt($s, CURLOPT_URL, "http://www.schrenk.com/index.php");

You should use a fully formed URL describing the protocol, domain, and file in every PHP/CURL file request.

CURLOPT_RETURNTRANSFER

The CURLOPT_RETURNTRANSFER option must be set to TRUE, as in Example A-3, if you want the result to be returned in a string. If you don’t set this option to TRUE, PHP/CURL echoes the result to the terminal.

Example A-3. Telling PHP/CURL that you want the result to be returned in a string

curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE);          // Return in string

CURLOPT_REFERER

The CURLOPT_REFERER option allows your webbot to spoof a hyper-reference that was clicked to initiate the request for the target file. The example in Example A-4 tells the target server that someone clicked a link on http://www.a_domain.com/index.php to request the target web page.

Example A-4. Spoofing a hyper-reference

curl_setopt($s, CURLOPT_REFERER, "http://www.a_domain.com/index.php");

CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS

The CURLOPT_FOLLOWLOCATION option tells PHP/CURL that you want it to follow every page redirection it finds. It’s important to understand that PHP/CURL only honors header redirections and not redirections set with a refresh meta tag or with JavaScript, as shown in Example A-5.

Example A-5. Redirects that PHP/CURL can and cannot follow

# Example of redirection that PHP/CURL will follow
header("Location: http://www.schrenk.com");
?>

<!-- Examples of redirections that PHP/CURL will not follow-->
<meta http-equiv="Refresh" content="0;url=http://www.schrenk.com">
<script>document.location="http://www.schrenk.com"</script>

Anytime you use CURLOPT_FOLLOWLOCATION, set CURLOPT_MAXREDIRS to the maximum number of redirections you care to follow. Limiting the number of redirections keeps your webbot out of infinite loops, where redirections point repeatedly to the same URL. My introduction to CURLOPT_MAXREDIRS came while trying to solve a problem brought to my attention by a network administrator, who initially thought that someone (using a webbot I had written) had launched a DoS attack on his server. In reality, the server misinterpreted the webbot’s header request as a hacking exploit and redirected the webbot to an error page. Then a bug on the error page caused it to repeatedly redirect the webbot to the error page, causing an infinite loop (and near-infinite bandwidth usage). The addition of CURLOPT_MAXREDIRS solved the problem, as demonstrated in Example A-6.

Example A-6. Using the CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS options

curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow header redirections
curl_setopt($s, CURLOPT_MAXREDIRS, 4);         // Limit redirections to 4

CURLOPT_USERAGENT

Use this option to define the name of your user agent, as shown in Example A-7. The user agent name is recorded in server access log files and is available to server-side scripts in the $_SERVER['HTTP_USER_AGENT'] variable.

Example A-7. Setting the user agent name

$agent_name = "test_webbot";
curl_setopt($s, CURLOPT_USERAGENT, $agent_name);

Many web servers examine the user agent name to determine what content to send to specific browsers. For example, a single website may serve different content to standard browsers and mobile devices, depending on the user agent it sees in this parameter.

The list of applicable user agent names constantly changes. For an updated list, simply perform an Internet search with search terms like “user agent names.”

CURLOPT_NOBODY and CURLOPT_HEADER

These options tell PHP/CURL to return either the web page’s header or body. By default, PHP/CURL will always return the body but not the header. This explains why setting CURL_NOBODY to TRUE excludes the body and setting CURL_HEADER to TRUE includes the header, as shown in Example A-8.

Example A-8. Using the CURLOPT_HEADER and CURLOPT_NOBODY options

curl_setopt($s, CURLOPT_HEADER, TRUE);        // Include the header
curl_setopt($s, CURLOPT_NOBODY, TRUE);        // Exclude the body

CURLOPT_TIMEOUT

If you don’t limit how long PHP/CURL waits for a response from a server, it may wait forever—especially if the file you’re fetching is on a busy server or if you’re trying to connect to a nonexistent or inactive IP address. (The latter happens frequently when a spider follows dead links on a website.) Setting a time-out value, as shown in Example A-9, causes PHP/CURL to end the session if the download takes longer than the time-out value (in seconds).

Example A-9. Setting a socket time-out value

curl_setopt($s, CURLOPT_TIMEOUT, 30);    // Don't wait longer than 30 seconds

CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR

One of the slickest features of PHP/CURL is the ability to manage cookies sent to and received from a website. Use the CURLOPT_COOKIEFILE option to define the file where previously stored cookies exist. At the end of the session, PHP/CURL writes new cookies to the file indicated by CURLOPT_COOKIEJAR. Example A-10 is an example; I have never seen an application where these two options don’t reference the same file.

Example A-10. Telling PHP/CURL where to read and write cookies

curl_setopt($s, CURLOPT_COOKIEFILE, "c:\bots\cookies.txt"); // Read cookie file
curl_setopt($s, CURLOPT_COOKIEJAR,  "c:\bots\cookies.txt"); // Write cookie file

When specifying the location of a cookie file, always use the complete location of the file and do not use relative addresses. More information about managing cookies is available in Chapter 21.

CURLOPT_HTTPHEADER

The CURLOPT_HTTPHEADER configuration allows a PHP/CURL session to send an outgoing header message to the server. The script in Example A-11 uses this option to tell the target server the MIME type it accepts, the content type it expects, and that the user agent is capable of decompressing compressed web responses.

Note that CURLOPT_HTTPHEADER expects to receive data in an array.

Example A-11. Configuring an outgoing header

$header_array[] = "Mime-Version: 1.0";
$header_array[] = "Content-type: text/html; charset=iso-8859-1";
$header_array[] = "Accept-Encoding: compress, gzip";
curl_setopt($curl_session, CURLOPT_HTTPHEADER, $header_array);

CURLOPT_SSL_VERIFYPEER

You need to use this option only if the target website uses SSL encryption and the protocol in CURLOPT_URL is https:. An example is shown in Example A-12.

Example A-12. Configuring PHP/CURL not to use a local client certificate

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);    // No certificate

Depending on the version of PHP/CURL you use, this option may be required; if you don’t use it, the target server will attempt to download a client certificate, which is unnecessary in all but rare cases.

CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH

As shown in Example A-13, you may use the CURLOPT_USERPWD option with a valid username and password to access websites that use basic authentication. In contrast to when using a browser, you will have to submit the username and password to every page accessed within the basic authentication realm.

Example A-13. Configuring PHP/CURL for basic authentication schemes

curl_setopt($s, CURLOPT_USERPWD, "username:password");
curl_setopt($s, CURLOPT_UNRESTICTED_AUTH, TRUE);

If you use this option in conjunction with CURLOPT_FOLLOWLOCATION, you should also use the CURLOPT_UNRESTRICTED_AUTH option, which will ensure that the username and password are sent to all pages you’re redirected to, provided they are part of the same realm.

Exercise caution with using CURLOPT_USERPWD, as you can inadvertently send username and password information to the wrong server, where it may appear in access log files.

CURLOPT_POST and CURLOPT_POSTFIELDS

The CURLOPT_POST and CURLOPT_POSTFIELDS options configure PHP/CURL to emulate forms with the POST method. Since the default method is GET, you must first tell PHP/CURL to use the POST method. Then you must specify the POST data that you want to be sent to the target web server. An example is shown in Example A-14.

Example A-14. Configuring POST method transfers

curl_setopt($s, CURLOPT_POST, TRUE);             // Use POST method
$post_data = "var1=1&var2=2&var3=3";             // Define POST data values
curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);

Notice that the POST data looks like a standard query string sent in a GET method. Incidentally, to send form information with the GET method, simply attach the query string to the target URL.

CURLOPT_VERBOSE

The CURLOPT_VERBOSE option controls the quantity of status messages created during a file transfer. You may find this helpful during debugging, but it is best to turn off this option during the production phase, because it produces many entries in your server log file. A typical succession of log messages for a single file download looks like Example A-15.

Example A-15. Typical messages from a verbose PHP/CURL session

* About to connect() to www.schrenk.com port 80
* Connected to www.schrenk.com (66.179.150.101) port 80
* Connection #0 left intact
* Closing connection #0

If you’re in verbose mode on a busy server, you’ll create very large log files. Example A-16 shows how to turn off verbose mode.

Example A-16. Turning off verbose mode reduces the size of server log files.

curl_setopt($s, CURLOPT_VERBOSE, FALSE);        // Minimal logs

CURLOPT_PORT

By default, PHP/CURL uses port 80 for all HTTP sessions, unless you are connecting to an SSL-encrypted server, in which case port 443 is used.^[103] These are the standard port numbers for HTTP and HTTPS protocols, respectively. If you’re connecting to a custom protocol or wish to connect to a non-web protocol, use CURLOPT_PORT to set the desired port number, as shown in Example A-17.

Example A-17. Using nonstandard communication ports

curl_setopt($s, CURLOPT_PORT, 234);            // Use port number 234

Note

Configuration settings must be capitalized, as shown in the previous examples, because the option names are predefined PHP constants. Your code will fail if you specify an option as curlopt_port instead of CURLOPT_PORT.

^[102] You can find a complete set of PHP/CURL options at http://www.php.net/manual/en/function.curl-setopt.php.

^[103]Well-known and standard port numbers are defined at http://www.iana.org/assignments/port-numbers.