LIB_http

Since PHP/CURL is very flexible and has many configurations, it is often handy to use it within a wrapper function, which simplifies the complexities of a code library into something easier to understand. For your convenience, this book uses a library called LIB_http, which provides wrapper functions to the PHP/CURL features you’ll use most. The remainder of this chapter describes the basic functions of the LIB_http library.

LIB_http is a collection of PHP/CURL routines that simplify downloading files. It contains defaults and abstractions that facilitate downloading files, managing cookies, and completing online forms. The name of the library refers to the HTTP protocol used by the library. Some of the reasons for using this library will not be evident until we cover its more advanced features. Even simple file downloads, however, are made easier and more robust with LIB_http because of PHP/CURL. The most recent version of LIB_http is available at this book’s website.

Familiarizing Yourself with the Default Values

To simplify its use, LIB_http sets a series of default conditions for you, as described below:

Your webbot’s agent name is Test Webbot.
Your webbot will time out if a file transfer doesn’t complete within 25 seconds.
Your webbot will store cookies in the file c:\cookie.txt.
Your webbot will automatically follow a maximum of four redirections, as directed by servers in HTTP headers.
Your webbot will, if asked, tell the remote server that you do not have a local authentication certificate. (This is only important if you access a website employing SSL encryption, which is used to protect confidential information on e-commerce websites.)

These defaults are set at the beginning of the file. Feel free to change any of these settings to meet your specific needs.

Using LIB_http

The LIB_http library provides a set of wrapper functions that simplify complicated PHP/CURL interfaces. Each of these interfaces calls a common routine, http(), which performs the specified task, using the values passed to it by the wrapper interfaces. All functions in LIB_http share a similar format: A target and referring URL are passed, and an array is returned, containing the contents of the requested file, transfer status, and error conditions.

While LIB_http has many functions, we’ll restrict our discussion to simply fetching files from the Internet using HTTP. The remaining features are described as needed throughout the book.

http_get()

The function http_get() downloads files with the GET method; it has many advantages over PHP’s built-in functions for downloading files from the Internet. Not only is the interface simple, but this function offers all the previously described advantages of using PHP/CURL. The script in Example 3-5 shows how files are downloaded with http_get().

Example 3-5. Using http_get()

# Usage: http_get()
array http_get (string target_url, string referring_url)

These are the inputs for the script in Example 3-5:

target_url is the fully formed URL of the desired file.
referring_url is the fully formed URL of the referer.

These are the outputs for the script in Example 3-5:

$array['FILE'] contains the contents of the requested file.
$array['STATUS'] contains status information regarding the file transfer.
$array['ERROR'] contains a textual description of any errors.

http_get_withheader()

When a web agent requests a file from the Web, the server returns the file contents, as discussed in the previous section, along with the HTTP header, which describes various properties related to a web page. Browsers and webbots rely on the HTTP header to determine what to do with the contents of the downloaded file.

The data that is included in the HTTP header varies from application to application, but it may define cookies, the size of the downloaded file, redirections, encryption details, or authentication directives. Since the information in the HTTP header is critical to properly using a network file, LIB_http configures PHP/CURL to automatically handle the more common header directives. Example 3-6 shows how this function is used.

Example 3-6. Using http_get()

# Usage: http_get_withheader()
array http_get_withheader (string target_url, string referring_url)

These are the inputs for the script in Example 3-6:

target_url is the fully formed URL of the desired file.
referring_url is the fully formed URL of the referer.

These are the outputs for the script in Example 3-6:

$array['FILE'] contains the contents of the requested file, including the HTTP header.
$array['STATUS'] contains status information about the file transfer.
$array['ERROR'] contains a textual description of any errors.

The example in Example 3-7 uses the http_get_withheader() function to download a file and display the contents of the returned array.

Example 3-7. Using http_get_withheader()

# Include http library
include("LIB_http.php");

# Define the target and referer web pages
$target = "http://www.schrenk.com/publications.php";
$ref    = "http://www.schrenk.com";

# Request the header
$return_array = http_get_withheader($target, $ref);

# Display the header
echo "FILE CONTENTS \n";
var_dump($return_array['FILE']);
echo "ERRORS \n";
var_dump($return_array['ERROR']);

echo "STATUS \n";
var_dump($return_array['STATUS']);

The script in Example 3-7 downloads the page and displays the requested page, any errors, and a variety of status information related to the fetch and download.

Example 3-8 shows what is produced when the script in Example 3-7 is executed, with the array that includes the page header, error conditions, and status. Notice that the contents of the returned file are limited to only the HTTP header, because we requested only the header and not the entire page. Also, notice that the first line in a HTTP header is the HTTP code, which indicates the status of the request. An HTTP code of 200 tells us that the request was successful. The HTTP code also appears in the status array element.^[13]

Example 3-8. File contents, errors, and the download status array returned by LIB_http

FILE CONTENTS
string(215) "HTTP/1.1 200 OK
Date: Sat, 08 Oct 2011 16:38:51 GMT
Server: Apache/2.0.53 (FreeBSD) mod_ssl/2.0.53 OpenSSL/0.9.7g PHP/5
X-Powered-By: PHP/5
Content-Type: text/html; charset=ISO-8859-1

"
ERRORS
string(0) ""

STATUS
array(20) {
  ["url"]=>
  string(39) "http://www.schrenk.com/publications.php"
  ["content_type"]=>
  string(29) "text/html; charset=ISO-8859-1"
  ["http_code"]=>
  int(200)
  ["header_size"]=>
  int(215)
  ["request_size"]=>
  int(200)
  ["filetime"]=>
  int(-1)
  ["ssl_verify_result"]=>
  int(0)
  ["redirect_count"]=>
  int(0)
  ["total_time"]=>
  float(0.683)
  ["namelookup_time"]=>
  float(0.005)
  ["connect_time"]=>
  float(0.101)
  ["pretransfer_time"]=>
  float(0.101)
  ["size_upload"]=>
  float(0)
  ["size_download"]=>
  float(0)
  ["speed_download"]=>
  float(0)
  ["speed_upload"]=>
  float(0)
  ["download_content_length"]=>
  float(0)
  ["upload_content_length"]=>
  float(0)
  ["starttransfer_time"]=>
  float(0.683)
  ["redirect_time"]=>
  float(0)
}

The information returned in $array['STATUS'] is extraordinarily useful for learning how the fetch was conducted. Included in this array are values for download speed, access times, and file sizes—all valuable when writing diagnostic webbots that monitor the performance of a website.

Learning More About HTTP Headers

When a Content-Type line appears in an HTTP header, it defines the MIME, or the media type of file sent from the server. The MIME type tells the web agent what to do with the file. For example, the Content-Type in the previous example was text/html, which indicates that the file is a web page. Knowing if the file they just downloaded was an image or an HTML file helps browsers know if they should display the file as text or render an image. For example, the HTTP header information for a JPEG image is shown in Example 3-9.

Example 3-9. An HTTP header for an image file request

HTTP/1.1 200 OK
Date: Wed, 23 Mar 2011 00:06:13 GMT
Server: Apache/1.3.12 (Unix) mod_throttle/3.1.2 tomcat/1.0 PHP/5
Last-Modified: Wed, 23 Jul 2008 18:03:29 GMT
ETag: "74db-9063-3d3eebf1"
Accept-Ranges: bytes
Content-Length: 36963
Content-Type: image/jpeg

Examining LIB_http’s Source Code

Most webbots in this book will use the library LIB_http to download pages from the Internet. If you plan to explore any of the webbot examples that appear later in this book, you should obtain a copy of this library; the latest version is available for download at this book’s website. We’ll explore some of the defaults and functions of LIB_http here.

LIB_http Defaults

At the very beginning of the library is a set of defaults, as shown in Example 3-10.

Example 3-10. LIB_http defaults

define("WEBBOT_NAME", "Test Webbot");       # How your webbot will appear in server logs
define("CURL_TIMEOUT", 25);                 # Time (seconds) to wait for network response
define("COOKIE_FILE", "c:\cookie.txt");     # Location of cookie file

LIB_http Functions

The functions shown in Example 3-11 are available within LIB_http. All of these functions return the array defined earlier, containing downloaded files, error messages, and the status of the file transfer.

Example 3-11. LIB_http functions

http_get($target, $ref)                              # Simple get request (w/o header)
http_get_withheader($target, $ref)                   # Simple get request (w/ header)
http_get_form($target, $ref, $data_array)            # Form (method ="GET", w/o header)
http_get_form_withheader($target, $ref, $data_array) # Form (method ="GET", w/ header)
http_post_form($target, $ref, $data_array)           # Form (method ="POST", w/o header)
http_post_withheader($target, $ref, $data_array)     # Form (method ="POST", w/ header)
http_header($target, $ref)                           # Only returns header

^[13] A complete list of HTTP codes can be found in Appendix B.