Setting PHP/CURL Options

The PHP/CURL session is configured with the curl_setopt() function. Each individual configuration option is set with a separate call to this function. The script in Example A-1 is unusual in its brevity. Usually there are many calls to curl_setopt(). Over 90 separate configuration options are available within PHP/CURL, making the interface very versatile.[102] The average PHP/CURL user, however, uses only a small subset of the available options. The following sections describe the PHP/CURL options you are most likely to use. While these options are listed here in order of relative importance, you may declare them in any order. If the session is left open, the configuration may be reused as often as needed within the same session.

Use the CURLOPT_URL option to define the target URL for your PHP/CURL session, as shown in Example A-2.

You should use a fully formed URL describing the protocol, domain, and file in every PHP/CURL file request.

The CURLOPT_RETURNTRANSFER option must be set to TRUE, as in Example A-3, if you want the result to be returned in a string. If you don’t set this option to TRUE, PHP/CURL echoes the result to the terminal.

The CURLOPT_REFERER option allows your webbot to spoof a hyper-reference that was clicked to initiate the request for the target file. The example in Example A-4 tells the target server that someone clicked a link on http://www.a_domain.com/index.php to request the target web page.

The CURLOPT_FOLLOWLOCATION option tells PHP/CURL that you want it to follow every page redirection it finds. It’s important to understand that PHP/CURL only honors header redirections and not redirections set with a refresh meta tag or with JavaScript, as shown in Example A-5.

Anytime you use CURLOPT_FOLLOWLOCATION, set CURLOPT_MAXREDIRS to the maximum number of redirections you care to follow. Limiting the number of redirections keeps your webbot out of infinite loops, where redirections point repeatedly to the same URL. My introduction to CURLOPT_MAXREDIRS came while trying to solve a problem brought to my attention by a network administrator, who initially thought that someone (using a webbot I had written) had launched a DoS attack on his server. In reality, the server misinterpreted the webbot’s header request as a hacking exploit and redirected the webbot to an error page. Then a bug on the error page caused it to repeatedly redirect the webbot to the error page, causing an infinite loop (and near-infinite bandwidth usage). The addition of CURLOPT_MAXREDIRS solved the problem, as demonstrated in Example A-6.

Use this option to define the name of your user agent, as shown in Example A-7. The user agent name is recorded in server access log files and is available to server-side scripts in the $_SERVER['HTTP_USER_AGENT'] variable.

Many web servers examine the user agent name to determine what content to send to specific browsers. For example, a single website may serve different content to standard browsers and mobile devices, depending on the user agent it sees in this parameter.

The list of applicable user agent names constantly changes. For an updated list, simply perform an Internet search with search terms like “user agent names.”

These options tell PHP/CURL to return either the web page’s header or body. By default, PHP/CURL will always return the body but not the header. This explains why setting CURL_NOBODY to TRUE excludes the body and setting CURL_HEADER to TRUE includes the header, as shown in Example A-8.

If you don’t limit how long PHP/CURL waits for a response from a server, it may wait forever—especially if the file you’re fetching is on a busy server or if you’re trying to connect to a nonexistent or inactive IP address. (The latter happens frequently when a spider follows dead links on a website.) Setting a time-out value, as shown in Example A-9, causes PHP/CURL to end the session if the download takes longer than the time-out value (in seconds).

One of the slickest features of PHP/CURL is the ability to manage cookies sent to and received from a website. Use the CURLOPT_COOKIEFILE option to define the file where previously stored cookies exist. At the end of the session, PHP/CURL writes new cookies to the file indicated by CURLOPT_COOKIEJAR. Example A-10 is an example; I have never seen an application where these two options don’t reference the same file.

When specifying the location of a cookie file, always use the complete location of the file and do not use relative addresses. More information about managing cookies is available in Chapter 21.

The CURLOPT_HTTPHEADER configuration allows a PHP/CURL session to send an outgoing header message to the server. The script in Example A-11 uses this option to tell the target server the MIME type it accepts, the content type it expects, and that the user agent is capable of decompressing compressed web responses.

Note that CURLOPT_HTTPHEADER expects to receive data in an array.

You need to use this option only if the target website uses SSL encryption and the protocol in CURLOPT_URL is https:. An example is shown in Example A-12.

Depending on the version of PHP/CURL you use, this option may be required; if you don’t use it, the target server will attempt to download a client certificate, which is unnecessary in all but rare cases.

As shown in Example A-13, you may use the CURLOPT_USERPWD option with a valid username and password to access websites that use basic authentication. In contrast to when using a browser, you will have to submit the username and password to every page accessed within the basic authentication realm.

If you use this option in conjunction with CURLOPT_FOLLOWLOCATION, you should also use the CURLOPT_UNRESTRICTED_AUTH option, which will ensure that the username and password are sent to all pages you’re redirected to, provided they are part of the same realm.

Exercise caution with using CURLOPT_USERPWD, as you can inadvertently send username and password information to the wrong server, where it may appear in access log files.

The CURLOPT_POST and CURLOPT_POSTFIELDS options configure PHP/CURL to emulate forms with the POST method. Since the default method is GET, you must first tell PHP/CURL to use the POST method. Then you must specify the POST data that you want to be sent to the target web server. An example is shown in Example A-14.

Notice that the POST data looks like a standard query string sent in a GET method. Incidentally, to send form information with the GET method, simply attach the query string to the target URL.

The CURLOPT_VERBOSE option controls the quantity of status messages created during a file transfer. You may find this helpful during debugging, but it is best to turn off this option during the production phase, because it produces many entries in your server log file. A typical succession of log messages for a single file download looks like Example A-15.

If you’re in verbose mode on a busy server, you’ll create very large log files. Example A-16 shows how to turn off verbose mode.

By default, PHP/CURL uses port 80 for all HTTP sessions, unless you are connecting to an SSL-encrypted server, in which case port 443 is used.[103] These are the standard port numbers for HTTP and HTTPS protocols, respectively. If you’re connecting to a custom protocol or wish to connect to a non-web protocol, use CURLOPT_PORT to set the desired port number, as shown in Example A-17.



[102] You can find a complete set of PHP/CURL options at http://www.php.net/manual/en/function.curl-setopt.php.

[103] Well-known and standard port numbers are defined at http://www.iana.org/assignments/port-numbers.