Creating the Image-Capturing Webbot

This example webbot relies on a library called LIB_download_images, which is available from this book’s website. This library contains the following functions:

For clarity, I will break down this library into highlights and accompanying explanations.

Re-creating a file structure for stored images

Figure 9-1. Re-creating a file structure for stored images

The first script (Example 9-2) shows the main webbot used in Example 9-1 and Figure 9-1.

Example 9-2. Executing the image-capturing webbot

include("LIB_download_images.php");
$target="http://www.nasa.gov/mission_pages/viking/index.html";
download_images_for_page($target);

This short webbot script loads the LIB_download_images library, defines a target web page, and calls the download_images_for_page() function, which gets the images and stores them in a complementary directory structure on the local drive.

Note

Please be aware that the scripts in this chapter, which are available at http://www.WebbotsSpidersScreenScrapers.com, are created for demonstration purposes only. Although they should work in most cases, they aren’t production ready. You may find long or complicated directory structures, odd filenames, or unusual file formats that will cause these scripts to crash.

Our image-grabbing webbot uses the function download_binary_file(), which is designed to download binary files, like images. Other binary files you may encounter could include executable files, compressed files, or system files. Up to this point, the only file downloads discussed have been ASCII (text) files, like web pages. The distinction between downloading binary and ASCII files is important because they have different formats and can cause confusion when downloaded. For example, random byte combinations in binary files may be misinterpreted as end-of-file markers in ASCII file downloads. If you download a binary file with a script designed for ASCII files, you stand a good chance of downloading a partial or corrupt file.

Even though PHP has its own, built-in binary-safe download functions, this webbot uses a custom download script that leverages PHP/CURL functionality to download images from SSL sites (when the protocol is HTTPS), follow HTTP file redirections, and send referer information to the server.

Sending proper referer information is crucial because many websites will stop other websites from “borrowing” images. Borrowing images from other websites (without hosting the images on your server) is bad etiquette and is commonly called hijacking. If your webbot doesn’t include a proper referer value, its activity could be confused with a website that is hijacking images. Example 9-3 shows the file download script used by this webbot.

The script that creates directories (shown in Figure 9-1) is derived from a user-contributed routine found on the PHP website (http://www.php.net). Users commonly submit scripts like this one when they find something they want to share with the PHP community. In this case, it’s a function that expands on mkdir() by creating complete directory structures with multiple directories at once. I modified the function slightly for our purposes. This function, shown in Example 9-4, creates any file path that doesn’t already exist on the hard drive and, if needed, it will create multiple directories for a single file path. For example, if the image’s file path is images/templates/November, this function will create all three directories—images, templates, and November—to satisfy the entire file path.

This script in Example 9-4 places all the path directories into an array and attempts to re-create that array, one directory at a time, on the local filesystem. Only nonexistent directories are created.

The main function for this webbot, download_images_for_page(), is broken down into highlights and explained below. As mentioned earlier, this function—and the entire LIB_download_images library—is available at this book’s website.