Example Spider

Our example spider will reuse the image harvester (described in Chapter 9) that downloads images for an entire website. The image harvester is this spider’s payload—the task that it will perform on every web page it visits. While this spider performs a useful task, its primary purpose is to demonstrate how spiders work, so design compromises were made that affect the spider’s scalability for use on larger tasks. After we explore this example spider, I’ll conclude with recommendations for making a scalable spider suitable for larger projects.

Example 17-1 and Example 17-2 are the main scripts for the example spider. Initially, the spider is limited to collecting links. Since the payload adds complexity, we’ll include it after you’ve had an opportunity to understand how the basic spider works.

Example 17-1. Main spider script, initialization

# Initialization
include("LIB_http.php");                     // http library
include("LIB_parse.php");                    // parse library
include("LIB_resolve_addresses.php");        // Address resolution library
include("LIB_exclusion_list.php");           // List of excluded keywords
include("LIB_simple_spider.php");            // Spider routines used by this app

set_time_limit(3600);                        // Don't let PHP time out

$SEED_URL = "http://www.YourSiteHere.com";
$MAX_PENETRATION = 1;                        // Set spider penetration depth
$FETCH_DELAY = 1;                            // Wait 1 second between page fetches
$ALLOW_OFFISTE = false;                      // Don't let spider roam from seed domain
$spider_array = array();                     // Initialize the array that holds links

The script in Example 17-1 loads the required libraries and initializes settings that tell the spider how to operate. This project introduces two new libraries: an exclusion list (LIB_exclusion_list.php) and the spider library used for this exercise (LIB_simple_spider.php). We’ll explain both of these new libraries as we use them.

In any PHP spider design, the default script time-out of 30 seconds needs to be set to a period more appropriate for spiders, since script execution may take minutes or even hours. Since spiders may have notoriously long execution times, the script in Example 17-1 sets the PHP script time-out to one hour (3,600 seconds) with the set_time_limit(3600) command.

The example spider is configured to collect enough information to demonstrate how spiders work but not so much that the sheer volume of data distracts from the demonstration. You can set these settings differently once you understand the effects they have on the operation of your spider. For now, the maximum penetration level is set to 1. This means that the spider will harvest links from the seed URL and the pages that the links on the seed URL reference, but it will not download any pages that are more than one link away from the seed URL. Even when you tie the spider’s hands—as we’ve done here—it still collects a ridiculously large amount of data. When limited to one penetration level, the spider still harvested 583 links when pointed at http://www.schrenk.com. This number excludes redundant links, which would otherwise raise the number of harvest links to 1,930. For demonstration purposes, the spider also rejects links that are not on the parent domain.

The main spider script, shown in Example 17-2, is quite simple. Much of this simplicity, however, comes at the cost of storing links in an array, instead of a more scalable (and more complicated) database. As you can see, the functions in the libraries make it easy to download web pages, harvest links, exclude unwanted links, and fully resolve addresses.

Example 17-2. Main spider script, harvesting links

# Get links from $SEED_URL
echo "Harvesting Seed URL\n";
$temp_link_array = harvest_links($SEED_URL);
$spider_array = archive_links($spider_array, 0, $temp_link_array);

# Spider links from remaining penetration levels
for($penetration_level=1; $penetration_level<=$MAX_PENETRATION;
$penetration_level++)
    {
    $previous_level = $penetration_level - 1;
    for($xx=0; $xx<count($spider_array[$previous_level]); $xx++)
        {
        unset($temp_link_array);
        $temp_link_array = harvest_links($spider_array[$previous_level][$xx]);
        echo "Level=$penetration_level, xx=$xx of
             ".count($spider_array[$previous_level])." \n";
        $spider_array = archive_links($spider_array, $penetration_level,
                        $temp_link_array);
        }
    }

When the spider uses www.schrenk.com as a seed URL, it harvests and rejects links, as shown in Example 17-3.

Now that you’ve seen the main spider script, an exploration of the routines in LIB_simple_spider will provide insight to how it really works.