Downloading Files with PHP’s Built-in Functions

Before you can appreciate PHP/CURL, you’ll need to familiarize yourself with PHP’s built-in functions for downloading files from the Internet.

PHP includes two simple built-in functions for downloading files from a network—fopen() and fgets(). The fopen() function does two things. First, it creates a network socket, which represents the link between your webbot and the network resource you want to retrieve. Second, it implements the HTTP protocol, which defines how data is transferred. With those tasks completed, fgets() leverages the networking ability of your computer’s operating system to pull the file from the Internet.

Let’s use PHP’s built-in functions to create your first webbot, which downloads a “Hello, world!” web page from this book’s companion website. The short script is shown in Example 3-1.

As shown in Example 3-1, fopen() establishes a network connection to the target, or file you want to download. It references this connection with a file handle, or network link called $file_handle. The script then uses fopen() to fetch and echo the file in 4,096-byte chunks until it has downloaded and displayed the entire file. Finally, the script executes an fclose() to tell PHP that it’s finished with the network handle.

Before we can execute the example in Example 3-1, we need to examine the two ways to execute a webbot: You can run a webbot either in a browser or in a command shell.[11]

To run a webbot script in a browser, simply load the script on a webserver and execute it by loading its URL into the browser’s location bar as you would any other web page. Contrast Example 3-2 with Figure 3-3, where the same script is run within a browser. The HTML tags are gone, as well as all of the structure of the returned file; the only things displayed are two lines of text. Running a webbot in a browser only shows a partial picture and often hides important information that a webbot needs.

Browser buffering is another complication you might run into if you try to execute a webbot in a browser. Buffering is useful when you’re viewing web pages because it allows a browser to wait until it has collected enough of a web page before it starts rendering or displaying the web page. However, browser buffering is troublesome for webbots because they frequently run for extended periods of time—much longer than it would take to download a typical web page. During prolonged webbot execution, status messages written by the webbot may not be displayed by the browser while it is buffering the display.

I have one webbot that runs continuously; in fact, it once ran for seven months before stopping during a power outage. This webbot could never run effectively in a browser because browsers are designed to render web pages with files of finite length. Browsers assume short download periods and may buffer an entire web page before displaying anything—therefore, never displaying the output of your webbot.

An alternative to fopen() and fgets() is the function file(), which downloads formatted files and places them into an array. This function differs from fopen() in two important ways: One way is that, unlike fopen(), it does not require you to create a file handle, because it creates all the network preparations for you. The other difference is that it returns the downloaded file as an array, with each line of the downloaded file in a separate array element. The script in Example 3-3 downloads the same web page used in Example 3-1, but it uses the file() command.

The file() function is particularly useful for downloading comma-separated value (CSV) files, in which each line of text represents a row of data with columnar formatting (as in an Excel spreadsheet). Loading files line-by-line into an array, however, is not particularly useful when downloading HTML files because the data in a web page is not defined by rows or columns; in a CSV file, however, rows and columns have specific meaning.



[11] See Chapter 22 for more information on executing webbots as scheduled events.