Chapter 9. Image-Capturing Webbots

In this chapter, I’ll describe a webbot that identifies and downloads all of the images on a web page. This webbot also stores images in a directory structure similar to the directory structure on the target website. This project will show how a seemingly simple webbot can be made more complex by addressing these common problems:

Finding the page base, or the address that defines the address from which all relative addresses are referenced
Dealing with changes to the page base, caused by page redirection
Converting relative addresses into fully resolved URLs
Replicating complex directory structures
Properly downloading image files with binary formats

In Chapter 17, you’ll expand on these concepts to develop a spider that downloads images from an entire website, not just one page.

Example Image-Capturing Webbot

Our image-capturing webbot downloads a target web page (in this case, the Viking Mission web page on the NASA website) and parses all references to images on the page. The webbot downloads each image, echoes the image’s name and size to the console, and stores the file on the local hard drive. Example 9-1 shows what the webbot’s output looks like when executed from a shell.

Example 9-1. The image-capturing bot, when executed from a shell

target = http://www.nasa.gov/mission_pages/viking/index.html
image: /templateimages/redesign/modules/overlay/site_error.gif size: 181
 image: /images/content/479931main_pia09942-390.jpg size: 16078
 image: /images/content/142889main_Viking_Lander_2.jpg size: 27486
 image: /images/content/152358main_pia01522-64.jpg size: 1309
 image: /images/content/150824main_viking_64.jpg size: 1507
 image: /images/content/143164main_viking_orbiter2.jpg size: 1724
 image: /images/content/142840main_viking1_lander.jpg size: 1909
 image: /images/content/141692main_frontpage_moonshadows.jpg size: 1750
 image: /images/content/152611main_pia00572-th.jpg size: 2108
 image: /images/content/193841main_viking_30_vid_226.jpg size: 14415
 image: /images/content/193840main_trailblazer_226.jpg size: 14429
 image: /images/content/193839main_mars_as_art_226.jpg size: 14613
 image: /images/content/193837main_30_yrs_226.jpg size: 5527
 image: /images/content/193842main_viking_image_archive_226.jpg size: 12164
 image: /images/content/104463main_worldbook_mars_100.jpg size: 2068

On this website, like many others, several unique images share the same filename but have different file paths. For example, the image /templates/logo.gif may represent a different graphic than /templates/affiliate/logo.gif. To solve this problem, the webbot re-creates a local copy of the directory structure that exists on the target web page. Figure 9-1 shows the directory structure the webbot created when it saved these images it downloaded from the NASA example.