Non-Interactive Downloads Using wget

Manually saving individual pages from a browser works fine when you are only looking at a few. At some point you will want to automate the process, especially when you want to archive an entire site. wget is the perfect tool for automating these downloads, so I will spend a few pages describing how it can be used.

wget is a Unix command-line tool for the non-interactive download of web pages. You can download from http://www.gnu.org/software/wget/, if your system does not already have it installed. A binary for Microsoft Windows is also available. It is a very flexible tool with a host of options listed in its manual page.

Downloading a Single Page

Capturing a single web page with wget is straightforward. Give it a URL, with no other options, and it will download the page into the current working directory with the same filename as that on the web site:

               % wget http://www.oreilly.com/index.html
    --08:52:06--  http://www.oreilly.com/index.html
               => `index.html'
    Resolving www.oreilly.com... 208.201.239.36, 208.201.239.37
    Connecting to www.oreilly.com[208.201.239.36]:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 54,774 [text/html]

    100%[=====================================================>]
    54,774       135.31K/s

    08:52:07 (134.96 KB/s) - `index.html' saved [54774/54774]

Using the -nv option (non-verbose) suppresses most of these status messages, and the -q option silences it completely.

Saving the file with the same name might be a problem if you are downloading index.html pages from multiple sites. wget handles this by saving later files with the same names with numeric suffixes (index.html.1, index.html.2, etc.). But this can get confusing, so the -O option lets you specify your own output file, or you can use -O with a - in place of the filename to direct the page to standard output. For example:

               % wget -O - http://www.oreilly.com/index.html

In its basic mode, wget will only download the specific page that you ask it to. But many pages require stylesheets and images in order to be displayed correctly. Trying to view the local copy of the HTML page in your browser may produce unpredictable results. To address that, you can use the -p option, which instructs wget to download all prerequisite files along with the specific target page:

               % wget -p http://www.oreilly.com/index.html

This invocation of the command will create subdirectories as needed, mirroring the structure on the original web site, and will store images and so on into the appropriate locations. This collection of files should allow you to open the local copy of index.html in your browser and have it appear identical to the original.

Copying an Entire Web Site

Scam-related web sites tend to be short-lived. They are typically set up at the same time the spam emails are sent out and then either shut down by the ISP or web-hosting company as soon as they are informed about the scam or they are taken down by the operator after a few days in order to prevent people like us from investigating them. So when you see a site that you want to look into, you need to act quickly. But oftentimes it is just not convenient to drop everything else and focus on a new scam.

The solution is to make a copy of the entire target site on your local machine. That gives you a permanent record of the site and allows you to study it at your convenience. wget is perfect for this job. A simple one-line command will mirror an entire site. The logging output of the program can be voluminous, but it helps you follow exactly what is being downloaded.

               % wget -m http://www.craic.com
    --14:05:48--  http://www.craic.com/
               => `www.craic.com/index.html'
    Resolving www.craic.com... 208.12.16.5
    Connecting to www.craic.com[208.12.16.5]:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 15,477 [text/html]
        0K .......... .....                             100%   64.17 MB/s
    14:05:48 (64.17 MB/s) - `www.craic.com/index.html' saved [15477/15477]

    --14:05:48--  http://www.craic.com/rss/rss.xml
               => `www.craic.com/rss/rss.xml'
    Connecting to www.craic.com[208.12.16.5]:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 6,251 [text/xml]
        0K ......                                       100%  165.59 MB/s
    14:05:48 (165.59 MB/s) - `www.craic.com/rss/rss.xml' saved [6251/6251]

    --14:05:48--  http://www.craic.com/craic.css
               => `www.craic.com/craic.css'
    Connecting to www.craic.com[208.12.16.5]:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 754 [text/css]
        0K                                           100%    7.19 MB/s
    14:05:48 (7.19 MB/s) - `www.craic.com/craic.css' saved [754/754]
    [...]
    FINISHED --14:05:49--
    Downloaded: 3,979,383 bytes in 101 files

The default behavior will create a directory called http://www.craic.com in the current working directory and place all downloaded files into subdirectories corresponding to the target web site.

Warning

By default, wget will only download pages that are on the target web site. It will not follow any links to other sites. This can be overridden using the -H option, but think long and hard before you use this. Depending on the sites that are linked to, you might be setting yourself up to download millions of pages.

You can visit the downloaded content with a web browser, using the Open File menu item and loading the file for the home page. All being well, everything will look fine and you will be able to navigate through your copy of the site. But two factors can upset this. First, the links to other pages may have been written as absolute URLs in some of the web pages, such as /foo/bar/image.jpg, rather than relative links, such as ../bar/image.jpg. The program will convert such links for you if you use the -k option along with -m:

               % wget -m -k http://www.craic.com

This is very convenient but means that some of the links on these pages are no longer identical to the original site. This can lead to confusion when you compare these pages to the originals or to similar pages from other sites. To avoid this, you might want to download the site twice, once as the untouched original version and the second with the updated links.

wget will handle directory listings and will download all listed files when mirroring the target site. But it does exhibit an odd behavior when doing this. Rather than download the listing as a single file, which is called index.html by default, it downloads nine variants of the same file:

    index.html
    index.html?D=A
    index.html?D=D
    index.html?M=A
    index.html?M=D
    index.html?N=A
    index.html?N=D
    index.html?S=A
    index.html?S=D

This is a little disconcerting until you realize that these represent the same data sorted in different ways. The column headings in a directory listing page are links that will return the data ranked by that column. The eight versions with suffixes are ranked by name (N), last-modified date (D), size (S), and description (D) in ascending (A) or descending (D) order. These variants can be ignored in favor of the basic index.html file.

An important issue that you will encounter with certain sites is that not all the pages on the site will be downloaded. In fact, in some cases you may find that no pages are downloaded at all. This is a side effect of wget being well behaved. By that, I mean it follows the Robot Exclusion Standard that allows a web site to restrict the pages that web spiders can copy. These restrictions are stated in a robots .txt file in the top level of a web site hierarchy or within individual pages using a META tag of the form:

    <META name="ROBOTS" content="NOINDEX, NOFOLLOW">

They are typically used to prevent web spiders (also known as spiders or robots) such as googlebot from consuming too much of a server’s available bandwidth or to prevent certain parts of a site from being included in the indexes of search engines such as Google. This process works on the honor system. The operator of a site defines how they want spiders to treat their site and a well-behaved program, such as wget, respects those wishes. If the site does not want any files downloaded, then our attempt to mirror it will produce nothing.

This makes sense as a way to control large-scale spiders, but when I download a single site all I am really doing is using wget to save me the effort of downloading the pages one by one in a browser. That activity is not restricted by the standard. I’m not consuming any more of the web server’s bandwidth and I’m not accessing different files. So in this scenario, I could argue that the Robots Exclusion Standard does not apply.

In the world of Internet scams, this is not usually a problem. I have yet to see such a site with a robots.txt file, and they could hardly complain about me having stolen copyrighted material, seeing as most of them copy the pages and images from the companies that they impersonate. Unfortunately, wget has no way of overriding its good behavior, so if you want to get around the restriction then you need to look for an alternative spider or write your own using Perl and the LWP module.

Another very important feature of wget is its ability to save the HTTP headers that are sent by a web server immediately before it sends the content of the requested page. I discuss this in detail in Chapter 6, which focuses on web servers.

Tip

You can find more information on wget, and web-scraping applications in general, in the book Spidering Hacks, written by Kevin Hemenway and Tara Calishain (O’Reilly).

The Wayback Machine

I cannot leave the topic of archiving web sites without mentioning the Internet Archive and the Wayback Machine. The Internet Archive (http://www.archive.org) is a non-profit group, based in San Francisco. Since 1996, they have been archiving copies of web sites onto their large cluster of Linux nodes. Unlike search engines such as Google, they do not simply keep track of the current content on each site. Instead they revisit sites every few weeks or months and archive a new version if the content has changed. The intent is to capture and archive content that would otherwise be lost whenever a site is changed or closed down. Their grand vision is to archive the entire Internet. Today they have around 40 billion web pages from all kinds of sites.

The primary interface to their collection is via the Wayback Machine. You type in the URL of a site that you are interested in and it returns a listing of all versions of the site that it has available. The O’Reilly web site makes a good example as the archive contains many versions. Using their original domain name (http://ora.com) pulls up even more versions, as you can see with this URL: http://web.archive.org/web/*/http://www.ora.com. These are shown in Figure 5-2.

Browsing through these results shows you how the O’Reilly site has evolved over the last eight or nine years. You can follow the introduction of new technologies and see which ones lived up to their promise and which fell by the wayside. Not all the links

Figure 5-2. Search results from the Wayback Machine

work, not all the images are displayed, and CGI scripts do not function, but in general the archived versions function offer an experience very similar to the original site. Looking back at the web sites of news organizations or companies where you used to work can become quite addictive.

The archive is an especially valuable resource when the site that you are interested in is no longer available. This can happen when a company goes out of business, when a project gets closed down, or when a government acts to silence dissenting voices. The latter has become an important issue in the past couple of years as countries such as China and Iran have closed down blogs and web sites that they deemed subversive. The archive can play an important role in fighting censorship on the Internet.

One example concerns the Send-Safe package, which is used for sending out spam. It is one of the most sophisticated of these products and has been marketed commercially for several years. An important selling point is the database of proxy mail servers that their program can use to conceal the origin of the messages. How they set up these proxy servers is unclear, and there has been speculation that many of these were the result of infections by computer viruses such as Sobig.

This speculation has brought unwanted attention to companies that sell bulk emailers. Perhaps because of this, the Send-Safe web site (http://www.send-safe.com) was taken offline early in 2005. Simply shutting down the web server has the effect of making that company disappear from the Internet. This would be very frustrating for anyone wishing to look into their activities, except for the fact that the site had been archived and is still available via Wayback Machine.

The example highlights a sobering aspect of the Internet Archive. That is the fact that now anything that you make available on the Internet stands a chance of being available forever. For many types of information this is wonderful, but we all make mistakes and something that seemed like a good idea at the time will always be with you. Anything that you post on the Web, from the political opinions that you express in your blog to those pictures of you dressed up for Halloween at the frat party, could come back to haunt you years from now when you apply for a job or run for president.

The good news is that the web-crawling software used by the Internet Archive respects the Robot Exclusion Standard , so you can prevent your content being archived by adding a suitable robots.txt file to your site. These two forces of global archiving and personal privacy will provide a very interesting debate as the years go by, the depth of the archive increases, and examples of its use and abuse become more widely known.