Before you can appreciate PHP/CURL, you’ll need to familiarize yourself with PHP’s built-in functions for downloading files from the Internet.
PHP includes two simple built-in functions for downloading files from a network—fopen()
and fgets()
. The fopen()
function does two things. First, it creates a network socket, which represents the link between your webbot and the network resource you want to retrieve. Second, it implements the HTTP protocol, which defines how data is transferred. With those tasks completed, fgets()
leverages the networking ability of your computer’s operating system to pull the file from the Internet.
Let’s use PHP’s built-in functions to create your first webbot, which downloads a “Hello, world!” web page from this book’s companion website. The short script is shown in Example 3-1.
Example 3-1. Downloading a file from the Web with fopen()
and fgets()
# Define the file you want to download $target = "http://www.WebbotsSpidersScreenScrapers.com/hello_world.html"; $file_handle = fopen($target, "r"); # Fetch the file while (!feof($file_handle)) echo fgets($file_handle, 4096); fclose($file_handle);
As shown in Example 3-1, fopen()
establishes a network connection to the target, or file you want to download. It references this connection with a file handle, or network link called $file_handle
. The script then uses fopen()
to fetch and echo the file in 4,096-byte chunks until it has downloaded and displayed the entire file. Finally, the script executes an fclose()
to tell PHP that it’s finished with the network handle.
Before we can execute the example in Example 3-1, we need to examine the two ways to execute a webbot: You can run a webbot either in a browser or in a command shell.[11]
If you have a choice, it is usually better to execute webbots from a shell or command line. Webbots generally don’t care about web page formatting, so they will display exactly what is returned from a webserver. Browsers, in contrast, will interpret HTML tags as instructions for rendering the web page. For example, Example 3-2 shows what Example 3-1 looks like when executed in a shell.
To run a webbot script in a browser, simply load the script on a webserver and execute it by loading its URL into the browser’s location bar as you would any other web page. Contrast Example 3-2 with Figure 3-3, where the same script is run within a browser. The HTML tags are gone, as well as all of the structure of the returned file; the only things displayed are two lines of text. Running a webbot in a browser only shows a partial picture and often hides important information that a webbot needs.
Browser buffering is another complication you might run into if you try to execute a webbot in a browser. Buffering is useful when you’re viewing web pages because it allows a browser to wait until it has collected enough of a web page before it starts rendering or displaying the web page. However, browser buffering is troublesome for webbots because they frequently run for extended periods of time—much longer than it would take to download a typical web page. During prolonged webbot execution, status messages written by the webbot may not be displayed by the browser while it is buffering the display.
I have one webbot that runs continuously; in fact, it once ran for seven months before stopping during a power outage. This webbot could never run effectively in a browser because browsers are designed to render web pages with files of finite length. Browsers assume short download periods and may buffer an entire web page before displaying anything—therefore, never displaying the output of your webbot.
An alternative to fopen()
and fgets()
is the function file()
, which downloads formatted files and places them into an array. This function differs from fopen()
in two important ways: One way is that, unlike fopen()
, it does not require you to create a file handle, because it creates all the network preparations for you. The other difference is that it returns the downloaded file as an array, with each line of the downloaded file in a separate array element. The script in Example 3-3 downloads the same web page used in Example 3-1, but it uses the file()
command.
Example 3-3. Downloading files with file()
<? // Download the target file $target = "http://www.WebbotsSpidersScreenScrapers.com/hello_world.html"; $downloaded_page_array = file($target); // Echo contents of file for($xx=0; $xx<count($downloaded_page_array); $xx++) echo $downloaded_page_array[$xx]; ?>
The file()
function is particularly useful for downloading comma-separated value (CSV) files, in which each line of text represents a row of data with columnar formatting (as in an Excel spreadsheet). Loading files line-by-line into an array, however, is not particularly useful when downloading HTML files because the data in a web page is not defined by rows or columns; in a CSV file, however, rows and columns have specific meaning.