The PHP/CURL session is configured with the curl_setopt()
function. Each individual configuration option is set with a separate call to this function. The script in Example A-1 is unusual in its brevity. Usually there are many calls to curl_setopt()
. Over 90 separate configuration options are available within PHP/CURL, making the interface very versatile.[102] The average PHP/CURL user, however, uses only a small subset of the available options. The following sections describe the PHP/CURL options you are most likely to use. While these options are listed here in order of relative importance, you may declare them in any order. If the session is left open, the configuration may be reused as often as needed within the same session.
Use the CURLOPT_URL
option to define the target URL for your PHP/CURL session, as shown in Example A-2.
Example A-2. Defining the target URL
curl_setopt($s, CURLOPT_URL, "http://www.schrenk.com/index.php");
You should use a fully formed URL describing the protocol, domain, and file in every PHP/CURL file request.
The CURLOPT_RETURNTRANSFER
option must be set to TRUE
, as in Example A-3, if you want the result to be returned in a string. If you don’t set this option to TRUE
, PHP/CURL echoes the result to the terminal.
The CURLOPT_REFERER
option allows your webbot to spoof a hyper-reference that was clicked to initiate the request for the target file. The example in Example A-4 tells the target server that someone clicked a link on http://www.a_domain.com/index.php to request the target web page.
The CURLOPT_FOLLOWLOCATION
option tells PHP/CURL that you want it to follow every page redirection it finds. It’s important to understand that PHP/CURL only honors header redirections and not redirections set with a refresh meta tag or with JavaScript, as shown in Example A-5.
Example A-5. Redirects that PHP/CURL can and cannot follow
# Example of redirection that PHP/CURL will follow header("Location: http://www.schrenk.com"); ?> <!-- Examples of redirections that PHP/CURL will not follow--> <meta http-equiv="Refresh" content="0;url=http://www.schrenk.com"> <script>document.location="http://www.schrenk.com"</script>
Anytime you use CURLOPT_FOLLOWLOCATION
, set CURLOPT_MAXREDIRS
to the maximum number of redirections you care to follow. Limiting the number of redirections keeps your webbot out of infinite loops, where redirections point repeatedly to the same URL. My introduction to CURLOPT_MAXREDIRS
came while trying to solve a problem brought to my attention by a network administrator, who initially thought that someone (using a webbot I had written) had launched a DoS attack on his server. In reality, the server misinterpreted the webbot’s header request as a hacking exploit and redirected the webbot to an error page. Then a bug on the error page caused it to repeatedly redirect the webbot to the error page, causing an infinite loop (and near-infinite bandwidth usage). The addition of CURLOPT_MAXREDIRS
solved the problem, as demonstrated in Example A-6.
Use this option to define the name of your user agent, as shown in Example A-7. The user agent name is recorded in server access log files and is available to server-side scripts in the $_SERVER['HTTP_USER_AGENT']
variable.
Example A-7. Setting the user agent name
$agent_name = "test_webbot"; curl_setopt($s, CURLOPT_USERAGENT, $agent_name);
Many web servers examine the user agent name to determine what content to send to specific browsers. For example, a single website may serve different content to standard browsers and mobile devices, depending on the user agent it sees in this parameter.
The list of applicable user agent names constantly changes. For an updated list, simply perform an Internet search with search terms like “user agent names.”
These options tell PHP/CURL to return either the web page’s header or body. By default, PHP/CURL will always return the body but not the header. This explains why setting CURL_NOBODY
to TRUE
excludes the body and setting CURL_HEADER
to TRUE
includes the header, as shown in Example A-8.
If you don’t limit how long PHP/CURL waits for a response from a server, it may wait forever—especially if the file you’re fetching is on a busy server or if you’re trying to connect to a nonexistent or inactive IP address. (The latter happens frequently when a spider follows dead links on a website.) Setting a time-out value, as shown in Example A-9, causes PHP/CURL to end the session if the download takes longer than the time-out value (in seconds).
One of the slickest features of PHP/CURL is the ability to manage cookies sent to and received from a website. Use the CURLOPT_COOKIEFILE
option to define the file where previously stored cookies exist. At the end of the session, PHP/CURL writes new cookies to the file indicated by CURLOPT_COOKIEJAR
. Example A-10 is an example; I have never seen an application where these two options don’t reference the same file.
Example A-10. Telling PHP/CURL where to read and write cookies
curl_setopt($s, CURLOPT_COOKIEFILE, "c:\bots\cookies.txt"); // Read cookie file curl_setopt($s, CURLOPT_COOKIEJAR, "c:\bots\cookies.txt"); // Write cookie file
When specifying the location of a cookie file, always use the complete location of the file and do not use relative addresses. More information about managing cookies is available in Chapter 21.
The CURLOPT_HTTPHEADER
configuration allows a PHP/CURL session to send an outgoing header message to the server. The script in Example A-11 uses this option to tell the target server the MIME type it accepts, the content type it expects, and that the user agent is capable of decompressing compressed web responses.
Note that CURLOPT_HTTPHEADER
expects to receive data in an array.
You need to use this option only if the target website uses SSL encryption and the protocol in CURLOPT_URL
is https:
. An example is shown in Example A-12.
Example A-12. Configuring PHP/CURL not to use a local client certificate
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate
Depending on the version of PHP/CURL you use, this option may be required; if you don’t use it, the target server will attempt to download a client certificate, which is unnecessary in all but rare cases.
As shown in Example A-13, you may use the CURLOPT_USERPWD
option with a valid username and password to access websites that use basic authentication. In contrast to when using a browser, you will have to submit the username and password to every page accessed within the basic authentication realm.
Example A-13. Configuring PHP/CURL for basic authentication schemes
curl_setopt($s, CURLOPT_USERPWD, "username
:password
"); curl_setopt($s, CURLOPT_UNRESTICTED_AUTH, TRUE);
If you use this option in conjunction with CURLOPT_FOLLOWLOCATION
, you should also use the CURLOPT_UNRESTRICTED_AUTH
option, which will ensure that the username and password are sent to all pages you’re redirected to, provided they are part of the same realm.
Exercise caution with using CURLOPT_USERPWD
, as you can inadvertently send username and password information to the wrong server, where it may appear in access log files.
The CURLOPT_POST
and CURLOPT_POSTFIELDS
options configure PHP/CURL to emulate forms with the POST
method. Since the default method is GET
, you must first tell PHP/CURL to use the POST
method. Then you must specify the POST
data that you want to be sent to the target web server. An example is shown in Example A-14.
Example A-14. Configuring POST
method transfers
curl_setopt($s, CURLOPT_POST, TRUE); // Use POST method $post_data = "var1=1&var2=2&var3=3"; // Define POST data values curl_setopt($s, CURLOPT_POSTFIELDS, $post_data);
Notice that the POST
data looks like a standard query string sent in a GET
method. Incidentally, to send form information with the GET
method, simply attach the query string to the target URL.
The CURLOPT_VERBOSE
option controls the quantity of status messages created during a file transfer. You may find this helpful during debugging, but it is best to turn off this option during the production phase, because it produces many entries in your server log file. A typical succession of log messages for a single file download looks like Example A-15.
Example A-15. Typical messages from a verbose PHP/CURL session
* About to connect() to www.schrenk.com port 80 * Connected to www.schrenk.com (66.179.150.101) port 80 * Connection #0 left intact * Closing connection #0
If you’re in verbose mode on a busy server, you’ll create very large log files. Example A-16 shows how to turn off verbose mode.
By default, PHP/CURL uses port 80 for all HTTP sessions, unless you are connecting to an SSL-encrypted server, in which case port 443 is used.[103] These are the standard port numbers for HTTP and HTTPS protocols, respectively. If you’re connecting to a custom protocol or wish to connect to a non-web protocol, use CURLOPT_PORT
to set the desired port number, as shown in Example A-17.
[102] You can find a complete set of PHP/CURL options at http://www.php.net/manual/en/function.curl-setopt.php.
[103] Well-known and standard port numbers are defined at http://www.iana.org/assignments/port-numbers.