Special spider functions are found in the LIB_simple_spider
library. This library provides functions that parse links from a web page when given a URL, archive harvested links in an array, identify the root domain for a URL, and identify links that should be excluded from the archive.
This library, as well as the other scripts featured in this chapter, is available for download at this book’s website.
Example 17-3. Running the simple spider from Example 17-1 and Example 17-2
Harvested: http://video.google.com/videoplay?docid=4221457095668033104&hl=en Harvested: http://www.apogeonline.com/libri/88-503-2658-0/scheda Harvested: http://www.schrenk.com/index.php Harvested: http://www.schrenk.com/strategies.php Harvested: http://www.schrenk.com/webbots.php Harvested: http://www.schrenk.com/publications.php Harvested: http://www.schrenk.com/profile.php Harvested: http://www.schrenk.com/contact.php Harvested: http://www.schrenk.com/recommended_reading/recommended_reading.php?we bbots_spiders_and_screen_scrapers Harvested: http://www.amazon.com/gp/product/1593271204/?tag=schrenkcom-20 Harvested: http://www.schrenk.com/contact.php Harvested: http://www.schrenk.com/strategies.php Harvested: http://www.schrenk.com/webbots.php Harvested: http://www.schrenk.com/contact.php Level=1, xx=0 of 9 Ignored offsite link: http://www.tcij.org/training/courses/nov-7-and-8 Ignored offsite link: http://www.defcon.org/html/defcon-17/dc-17-speakers.html#S chrenk Ignored offsite link: http://www.tcij.org/ Ignored offsite link: http://schrenk.com/nostarch/webbots Ignored offsite link: http://www.gotop.com.tw/ Ignored offsite link: http://www.vvoj.eu/ Ignored offsite link: http://www.fondspascaldecroos.org/index.php?page=394&detai l=1810 Ignored offsite link: http://www.defcon.org/ Ignored offsite link: http://extra.volkskrant.nl/verpleeghuizen/ Ignored offsite link: http://schrenk.com/nostarch/webbots Ignored offsite link: http://www.hotelworldexpo.com/ Ignored offsite link: http://cesweb.org Ignored offsite link: http://www.phparch.com Ignored offsite link: http://video.google.com/videoplay?docid=422145709566803310 4&hl=en Ignored offsite link: http://www.apogeonline.com/libri/88-503-2658-0/scheda Ignored redundant link: http://www.schrenk.com/strategies.php Ignored redundant link: http://www.schrenk.com/webbots.php Ignored redundant link: http://www.schrenk.com/publications.php Ignored redundant link: http://www.schrenk.com/contact.php
The harvest_links()
function downloads the specified web page and returns all the links in an array. This function, shown in Example 17-4, uses the $DELAY
setting to keep the spider from sending too many requests to the server over too short a period.[55]
Example 17-4. Harvesting links from a web page with the harvest_links()
function
function harvest_links($url) { # Initialize global $DELAY; $link_array = array(); # Get page base for $url (used to create fully resolved URLs for the links) $page_base = get_base_page_address($url); # $DELAY creates a random delay period between 1 second and full delay period $random_delay = rand(1, rand(1, $DELAY)); # Download webpage sleep($random_delay); $downloaded_page = http_get($url, ""); # Parse links $anchor_tags = parse_array($downloaded_page['FILE'], "<a", "</a>", EXCL); # Get http attributes for each tag into an array for($xx=0; $xx<count($anchor_tags); $xx++) { $href = get_attribute($anchor_tags[$xx], "href"); $resolved_addrses = resolve_address($href, $page_base); $link_array[] = $resolved_address; echo "Harvested: ".$resolved_addres." \n"; } return $link_array; }
The script in Example 17-5 uses the link array collected by the previous function to create an archival array. The first element of the archival array identifies the penetration level where the link was found, while the second contains the actual link.
Example 17-5. Archiving links in $spider_array
function archive_links($spider_array, $penetration_level, $temp_link_array) { for($xx=0; $xx<count($temp_link_array); $xx++) { # Don't add excluded links to $spider_array if(!excluded_link($spider_array, $temp_link_array[$xx])) { $spider_array[$penetration_level][] = $temp_link_array[$xx]; } } return $spider_array; }
The function get_domain()
parses the root domain from the target URL. For example, given a target URL like https://www.schrenk.com/store/product_list.php, the root domain is schrenk.com.
The function get_domain()
compares the root domains of the links to the root domain of the seed URL to determine if the link is for a URL that is not in the seed URL’s domain, as shown in Example 17-6.
Example 17-6. Parsing the root domain from a fully resolved URL
function get_domain($url) { // Remove protocol from $url $url = str_replace("http://", "", $url); $url = str_replace("https://", "", $url); // Remove page and directory references if(stristr($url, "/")) $url = substr($url, 0, strpos($url, "/")); return $url; }
This function is only used when the configuration for $ALLOW_OFFSITE
is set to false.
This function examines each link and determines if it should be included in the archive of harvested links. Reasons for excluding a link may include the following:
The link is contained within JavaScript.
The link already appears in the archive.
The link contains excluded keywords are listed in the exclusion array.
The link is to a different domain.
Example 17-7. Excluding unwanted links
function excluded_link($spider_array, $link) { # Initialization global $exclusion_array, $ALLOW_OFFSITE; $exclude = false; // Exclude links that are JavaScript commands if(stristr($link, "javascript")) { echo "Ignored JavaScript function: $link\n"; $exclude=true; } // Exclude redundant links for($xx=0; $xx<count($spider_array); $xx++) { $saved_link=""; while(isset($saved_link)) { $saved_link=array_pop($spider_array[$xx]); if($link == array_pop($spider_array[$xx])) { echo "Ignored redundant link: $link\n"; $exclude=true; break; } } } // Exclude links found in $exclusion_array for($xx=0; $xx<count($exclusion_array); $xx++) { if(stristr($link, $exclusion_array[$xx])) { echo "Ignored excluded link: $link\n"; $exclude=true; break; } } // Exclude offsite links if requested if($ALLOW_OFFSITE==false) { if(get_domain($link)!=get_domain($SEED_URL)) { echo "Ignored offsite link: $link\n"; $exclude=true; break; } } return $exclude; }
There are several reasons to exclude links. For example, it’s best to ignore any links referenced within JavaScript because—without a proper JavaScript interpreter—those links may yield unpredictable results. Removing redundant links makes the spider run faster and reduces the amount of data the spider needs to manage. The exclusion list allows the spider to ignore undesirable links to places like Google AdSense, banner ads, or other places you don’t want the spider to go.