LIB_simple_spider

Special spider functions are found in the LIB_simple_spider library. This library provides functions that parse links from a web page when given a URL, archive harvested links in an array, identify the root domain for a URL, and identify links that should be excluded from the archive.

This library, as well as the other scripts featured in this chapter, is available for download at this book’s website.

Example 17-3. Running the simple spider from Example 17-1 and Example 17-2

Harvested: http://video.google.com/videoplay?docid=4221457095668033104&hl=en
Harvested: http://www.apogeonline.com/libri/88-503-2658-0/scheda
Harvested: http://www.schrenk.com/index.php
Harvested: http://www.schrenk.com/strategies.php
Harvested: http://www.schrenk.com/webbots.php
Harvested: http://www.schrenk.com/publications.php
Harvested: http://www.schrenk.com/profile.php
Harvested: http://www.schrenk.com/contact.php
Harvested: http://www.schrenk.com/recommended_reading/recommended_reading.php?we
bbots_spiders_and_screen_scrapers
Harvested: http://www.amazon.com/gp/product/1593271204/?tag=schrenkcom-20
Harvested: http://www.schrenk.com/contact.php
Harvested: http://www.schrenk.com/strategies.php
Harvested: http://www.schrenk.com/webbots.php
Harvested: http://www.schrenk.com/contact.php
Level=1, xx=0 of 9
Ignored offsite link: http://www.tcij.org/training/courses/nov-7-and-8
Ignored offsite link: http://www.defcon.org/html/defcon-17/dc-17-speakers.html#S
chrenk
Ignored offsite link: http://www.tcij.org/
Ignored offsite link: http://schrenk.com/nostarch/webbots
Ignored offsite link: http://www.gotop.com.tw/
Ignored offsite link: http://www.vvoj.eu/
Ignored offsite link: http://www.fondspascaldecroos.org/index.php?page=394&detai
l=1810
Ignored offsite link: http://www.defcon.org/
Ignored offsite link: http://extra.volkskrant.nl/verpleeghuizen/
Ignored offsite link: http://schrenk.com/nostarch/webbots
Ignored offsite link: http://www.hotelworldexpo.com/
Ignored offsite link: http://cesweb.org
Ignored offsite link: http://www.phparch.com
Ignored offsite link: http://video.google.com/videoplay?docid=422145709566803310
4&hl=en
Ignored offsite link: http://www.apogeonline.com/libri/88-503-2658-0/scheda
Ignored redundant link: http://www.schrenk.com/strategies.php
Ignored redundant link: http://www.schrenk.com/webbots.php
Ignored redundant link: http://www.schrenk.com/publications.php
Ignored redundant link: http://www.schrenk.com/contact.php

harvest_links()

The harvest_links() function downloads the specified web page and returns all the links in an array. This function, shown in Example 17-4, uses the $DELAY setting to keep the spider from sending too many requests to the server over too short a period.^[55]

Example 17-4. Harvesting links from a web page with the harvest_links() function

function harvest_links($url)
    {
    # Initialize
    global $DELAY;
    $link_array = array();

    # Get page base for $url (used to create fully resolved URLs for the links)
    $page_base = get_base_page_address($url);

    # $DELAY creates a random delay period between 1 second and full delay period
    $random_delay = rand(1, rand(1, $DELAY));
    # Download webpage
    sleep($random_delay);
    $downloaded_page = http_get($url, "");

    # Parse links
    $anchor_tags = parse_array($downloaded_page['FILE'], "<a", "</a>", EXCL);
    # Get http attributes for each tag into an array
    for($xx=0; $xx<count($anchor_tags); $xx++)
        {
        $href = get_attribute($anchor_tags[$xx], "href");
        $resolved_addrses = resolve_address($href, $page_base);
        $link_array[] = $resolved_address;
        echo "Harvested: ".$resolved_addres." \n";
        }
    return $link_array;
    }

The script in Example 17-5 uses the link array collected by the previous function to create an archival array. The first element of the archival array identifies the penetration level where the link was found, while the second contains the actual link.

Example 17-5. Archiving links in $spider_array

function archive_links($spider_array, $penetration_level, $temp_link_array)
    {
    for($xx=0; $xx<count($temp_link_array); $xx++)
        {
        # Don't add excluded links to $spider_array
        if(!excluded_link($spider_array, $temp_link_array[$xx]))
            {
            $spider_array[$penetration_level][] = $temp_link_array[$xx];
            }
        }
    return $spider_array;
    }

get_domain()

The function get_domain() parses the root domain from the target URL. For example, given a target URL like https://www.schrenk.com/store/product_list.php, the root domain is schrenk.com.

The function get_domain() compares the root domains of the links to the root domain of the seed URL to determine if the link is for a URL that is not in the seed URL’s domain, as shown in Example 17-6.

Example 17-6. Parsing the root domain from a fully resolved URL

function get_domain($url)
    {
    // Remove protocol from $url
    $url = str_replace("http://", "", $url);
    $url = str_replace("https://", "", $url);

    // Remove page and directory references
    if(stristr($url, "/"))
        $url = substr($url, 0, strpos($url, "/"));

    return $url;
    }

This function is only used when the configuration for $ALLOW_OFFSITE is set to false.

exclude_link()

This function examines each link and determines if it should be included in the archive of harvested links. Reasons for excluding a link may include the following:

The link is contained within JavaScript.
The link already appears in the archive.
The link contains excluded keywords are listed in the exclusion array.
The link is to a different domain.

Example 17-7. Excluding unwanted links

function excluded_link($spider_array, $link)
    {
    # Initialization
    global $exclusion_array, $ALLOW_OFFSITE;
    $exclude = false;

    // Exclude links that are JavaScript commands
    if(stristr($link, "javascript"))
        {
        echo "Ignored JavaScript function: $link\n";
        $exclude=true;
        }

    // Exclude redundant links
    for($xx=0; $xx<count($spider_array); $xx++)
        {
        $saved_link="";
        while(isset($saved_link))
            {
            $saved_link=array_pop($spider_array[$xx]);
            if($link == array_pop($spider_array[$xx]))
                {
                echo "Ignored redundant link: $link\n";
                $exclude=true;
                break;
                }
            }
        }

    // Exclude links found in $exclusion_array
    for($xx=0; $xx<count($exclusion_array); $xx++)
        {
        if(stristr($link, $exclusion_array[$xx]))
            {
            echo "Ignored excluded link: $link\n";
            $exclude=true;
            break;
            }
        }

    // Exclude offsite links if requested
    if($ALLOW_OFFSITE==false)
        {
        if(get_domain($link)!=get_domain($SEED_URL))
            {
            echo "Ignored offsite link: $link\n";
            $exclude=true;
            break;
            }
        }
    return $exclude;
    }

There are several reasons to exclude links. For example, it’s best to ignore any links referenced within JavaScript because—without a proper JavaScript interpreter—those links may yield unpredictable results. Removing redundant links makes the spider run faster and reduces the amount of data the spider needs to manage. The exclusion list allows the spider to ignore undesirable links to places like Google AdSense, banner ads, or other places you don’t want the spider to go.

^[55] A stealthier spider would shuffle the order of web page requests.

LIB_simple_spider

harvest_links()

archive_links()

get_domain()

exclude_link()