LIB_http

Since PHP/CURL is very flexible and has many configurations, it is often handy to use it within a wrapper function, which simplifies the complexities of a code library into something easier to understand. For your convenience, this book uses a library called LIB_http, which provides wrapper functions to the PHP/CURL features you’ll use most. The remainder of this chapter describes the basic functions of the LIB_http library.

LIB_http is a collection of PHP/CURL routines that simplify downloading files. It contains defaults and abstractions that facilitate downloading files, managing cookies, and completing online forms. The name of the library refers to the HTTP protocol used by the library. Some of the reasons for using this library will not be evident until we cover its more advanced features. Even simple file downloads, however, are made easier and more robust with LIB_http because of PHP/CURL. The most recent version of LIB_http is available at this book’s website.

To simplify its use, LIB_http sets a series of default conditions for you, as described below:

These defaults are set at the beginning of the file. Feel free to change any of these settings to meet your specific needs.

The LIB_http library provides a set of wrapper functions that simplify complicated PHP/CURL interfaces. Each of these interfaces calls a common routine, http(), which performs the specified task, using the values passed to it by the wrapper interfaces. All functions in LIB_http share a similar format: A target and referring URL are passed, and an array is returned, containing the contents of the requested file, transfer status, and error conditions.

While LIB_http has many functions, we’ll restrict our discussion to simply fetching files from the Internet using HTTP. The remaining features are described as needed throughout the book.

The function http_get() downloads files with the GET method; it has many advantages over PHP’s built-in functions for downloading files from the Internet. Not only is the interface simple, but this function offers all the previously described advantages of using PHP/CURL. The script in Example 3-5 shows how files are downloaded with http_get().

These are the inputs for the script in Example 3-5:

  • target_url is the fully formed URL of the desired file.

  • referring_url is the fully formed URL of the referer.

These are the outputs for the script in Example 3-5:

  • $array['FILE'] contains the contents of the requested file.

  • $array['STATUS'] contains status information regarding the file transfer.

  • $array['ERROR'] contains a textual description of any errors.

When a web agent requests a file from the Web, the server returns the file contents, as discussed in the previous section, along with the HTTP header, which describes various properties related to a web page. Browsers and webbots rely on the HTTP header to determine what to do with the contents of the downloaded file.

The data that is included in the HTTP header varies from application to application, but it may define cookies, the size of the downloaded file, redirections, encryption details, or authentication directives. Since the information in the HTTP header is critical to properly using a network file, LIB_http configures PHP/CURL to automatically handle the more common header directives. Example 3-6 shows how this function is used.

These are the inputs for the script in Example 3-6:

  • target_url is the fully formed URL of the desired file.

  • referring_url is the fully formed URL of the referer.

These are the outputs for the script in Example 3-6:

  • $array['FILE'] contains the contents of the requested file, including the HTTP header.

  • $array['STATUS'] contains status information about the file transfer.

  • $array['ERROR'] contains a textual description of any errors.

The example in Example 3-7 uses the http_get_withheader() function to download a file and display the contents of the returned array.

The script in Example 3-7 downloads the page and displays the requested page, any errors, and a variety of status information related to the fetch and download.

Example 3-8 shows what is produced when the script in Example 3-7 is executed, with the array that includes the page header, error conditions, and status. Notice that the contents of the returned file are limited to only the HTTP header, because we requested only the header and not the entire page. Also, notice that the first line in a HTTP header is the HTTP code, which indicates the status of the request. An HTTP code of 200 tells us that the request was successful. The HTTP code also appears in the status array element.[13]

The information returned in $array['STATUS'] is extraordinarily useful for learning how the fetch was conducted. Included in this array are values for download speed, access times, and file sizes—all valuable when writing diagnostic webbots that monitor the performance of a website.

When a Content-Type line appears in an HTTP header, it defines the MIME, or the media type of file sent from the server. The MIME type tells the web agent what to do with the file. For example, the Content-Type in the previous example was text/html, which indicates that the file is a web page. Knowing if the file they just downloaded was an image or an HTML file helps browsers know if they should display the file as text or render an image. For example, the HTTP header information for a JPEG image is shown in Example 3-9.

Most webbots in this book will use the library LIB_http to download pages from the Internet. If you plan to explore any of the webbot examples that appear later in this book, you should obtain a copy of this library; the latest version is available for download at this book’s website. We’ll explore some of the defaults and functions of LIB_http here.



[13] A complete list of HTTP codes can be found in Appendix B.