While PHP is capable when it comes to simple file downloads, most real-life applications require additional functionality to handle advanced issues such as form submission, authentication, redirection, and so on. These functions are difficult to facilitate with PHP’s built-in functions alone. Forunately, every PHP install should include a library called PHP/CURL, which automatically takes care of these advanced topics. Most of this book’s examples exploit the benefit of PHP/CURL’s ability to download files.
The open source cURL project is the product of Swedish developer Daniel Stenberg and a team of developers. The cURL library is available for use with nearly any computer language you can think of. When cURL is used with PHP, it’s known as PHP/CURL.
The name cURL is either a blend of the words client and URL or an acronym for the words client URL Request Library—you decide. cURL does everything that PHP’s built-in networking functions do and a lot more. Appendix A expands on PHP/CURL’s features, but here’s a quick overview of the things PHP/CURL can do for you, a webbot developer.
Unlike the built-in PHP network functions, PHP/CURL supports multiple transfer protocols, including FTP, FTPS, HTTP, HTTPS, Gopher, Telnet, and LDAP. Of these protocols, the most important is probably HTTPS, which allows webbots to download from encrypted websites that employ the Secure Sockets Layer (SSL) protocol.
PHP/CURL provides easy ways for a webbot to emulate browser form submission to a server. PHP/CURL supports all of the standard methods, or form submission protocols, as you’ll learn in Chapter 6.
PHP/CURL allows webbots to enter password-protected websites that use basic authentication. You’ve encountered authentication if you’ve seen this familiar gray box, shown in Figure 3-4, asking for your username and password. PHP/CURL makes it easy to write webbots that enter and use password-protected websites.
Without PHP/CURL, it is difficult for webbots to read and write cookies, those small bits of data that websites use to create session variables that track your movement. Websites also use cookies to manage shopping carts and authenticate users. PHP/CURL makes it easy for your webbot to interpret the cookies that webservers send it; it also simplifies the process of showing webservers all the cookies your webbot has written. Chapter 20 and Chapter 21 have much more to say on the subject of webbots and cookies.
Redirection occurs when a web browser looks for a file in one place, but the server tells it that the file has moved and that it should download it from another location. For example, the website www.company.com may use redirection to force browsers to go to www.company.com/spring_sale when a seasonal promotion is in place. Browsers handle redirections automatically, and PHP/CURL allows webbots to have the same functionality.
Every time a webserver receives a file request, it stores the requesting agent’s name in a log file called an access log file. This log file stores the time of access, the IP address of the requester, and the agent name, which identifies the type of program that requested the file. Generally, agent names identify the browser that the web surfer was using to view the website.
Some agent names that a server log file may record are shown in Example 3-4. The first four names are browsers; the last is the Google spider.
Example 3-4. Agent names as seen in a file access log[12]
Mozilla/5.0 (Windows NT 6.1;) Gecko/20100921 Firefox/4.0b7pre Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1;) Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.25 Chrome/12.0.706.0 Googlebot/2.1 (+http://www.google.com/bot.html)
A webbot using PHP/CURL can assume any appropriate (or inappropriate) agent name. For example, sometimes it is advantageous to identify your webbots, as Google does. Other times, it is better to make your webbot look like a browser. If you write webbots that use the LIB_http
library (described later), your webbot’s agent name will be Test Webbot. If you download a file from a webserver with PHP’s fopen()
or file()
functions, your agent name will be the version of PHP installed on your computer.
PHP/CURL allows webbot developers to change the referer, which is the reference that servers use to detect which link the web surfer clicked. Sometimes webservers use the referer to verify that file requests are coming from the correct place. For example, a website might enforce a rule that prevents downloading of images unless the referring web page is also on the same webserver. This prohibits people from bandwidth stealing, or writing web pages using images on someone else’s server. PHP/CURL allows a webbot to set the referer to an arbitrary value.
PHP/CURL also gives webbots the ability to recognize when a webserver isn’t going to respond to a file request. This ability is vital because, without it, your webbot might hang (forever) waiting for a server response that will never happen. With PHP/CURL, you can specify how long a webbot will wait for a response from a server before it gives up and moves on.
[12] A more complete list of known user agent names is found at http://www.useragentstring.com/pages/useragentstring.php.