Manually saving individual pages from a browser works fine when
you are only looking at a few. At some point you will want to automate
the process, especially when you want to archive an entire site.
wget
is the perfect tool for
automating these downloads, so I will spend a few pages describing how
it can be used.
wget
is a Unix command-line
tool for the non-interactive download of web pages. You can download from http://www.gnu.org/software/wget/, if your system does
not already have it installed. A binary for Microsoft Windows is also
available. It is a very flexible tool with a host of options listed in
its manual page.
Capturing a single web page with wget
is straightforward. Give it a URL, with
no other options, and it will download the page into the current
working directory with the same filename as that on the web
site:
% wget http://www.oreilly.com/index.html
--08:52:06-- http://www.oreilly.com/index.html
=> `index.html'
Resolving www.oreilly.com... 208.201.239.36, 208.201.239.37
Connecting to www.oreilly.com[208.201.239.36]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54,774 [text/html]
100%[=====================================================>]
54,774 135.31K/s
08:52:07 (134.96 KB/s) - `index.html' saved [54774/54774]
Using the -nv
option
(non-verbose) suppresses most of these status messages, and the
-q
option silences it
completely.
Saving the file with the same name might be a problem if you are
downloading index.html pages from
multiple sites. wget
handles this
by saving later files with the same names with numeric suffixes
(index.html.1, index.html.2, etc.). But this can get
confusing, so the -O
option lets
you specify your own output file, or you can use -O
with a -
in place of the filename to direct the
page to standard output. For example:
% wget -O - http://www.oreilly.com/index.html
In its basic mode, wget
will
only download the specific page that you ask it to. But many pages
require stylesheets and images in order to be displayed correctly.
Trying to view the local copy of the HTML page in your browser may
produce unpredictable results. To address that, you can use the
-p
option, which instructs wget
to download all prerequisite files
along with the specific target page:
% wget -p http://www.oreilly.com/index.html
This invocation of the command will create subdirectories as needed, mirroring the structure on the original web site, and will store images and so on into the appropriate locations. This collection of files should allow you to open the local copy of index.html in your browser and have it appear identical to the original.
Scam-related web sites tend to be short-lived. They are typically set up at the same time the spam emails are sent out and then either shut down by the ISP or web-hosting company as soon as they are informed about the scam or they are taken down by the operator after a few days in order to prevent people like us from investigating them. So when you see a site that you want to look into, you need to act quickly. But oftentimes it is just not convenient to drop everything else and focus on a new scam.
The solution is to make a copy of the entire target site on your
local machine. That gives you a permanent record of the site and
allows you to study it at your convenience. wget
is perfect for this job. A simple
one-line command will mirror an entire site. The logging output of the
program can be voluminous, but it helps you follow exactly what is
being downloaded.
% wget -m http://www.craic.com
--14:05:48-- http://www.craic.com/
=> `www.craic.com/index.html'
Resolving www.craic.com... 208.12.16.5
Connecting to www.craic.com[208.12.16.5]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15,477 [text/html]
0K .......... ..... 100% 64.17 MB/s
14:05:48 (64.17 MB/s) - `www.craic.com/index.html' saved [15477/15477]
--14:05:48-- http://www.craic.com/rss/rss.xml
=> `www.craic.com/rss/rss.xml'
Connecting to www.craic.com[208.12.16.5]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6,251 [text/xml]
0K ...... 100% 165.59 MB/s
14:05:48 (165.59 MB/s) - `www.craic.com/rss/rss.xml' saved [6251/6251]
--14:05:48-- http://www.craic.com/craic.css
=> `www.craic.com/craic.css'
Connecting to www.craic.com[208.12.16.5]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 754 [text/css]
0K 100% 7.19 MB/s
14:05:48 (7.19 MB/s) - `www.craic.com/craic.css' saved [754/754]
[...]
FINISHED --14:05:49--
Downloaded: 3,979,383 bytes in 101 files
The default behavior will create a directory called http://www.craic.com in the current working directory and place all downloaded files into subdirectories corresponding to the target web site.
By default, wget
will only
download pages that are on the target web site. It will not follow
any links to other sites. This can be overridden using the -H
option, but think long and hard before
you use this. Depending on the sites that are linked to, you might
be setting yourself up to download millions of
pages.
You can visit the downloaded content with a web browser, using
the Open File
menu item and loading
the file for the home page. All being well, everything will look fine
and you will be able to navigate through your copy of the site. But
two factors can upset this. First, the links to other pages may have
been written as absolute URLs in some of the web pages, such as
/foo/bar/image.jpg, rather than
relative links, such as ../bar/image.jpg. The program will convert
such links for you if you use the -k
option along with -m
:
% wget -m -k http://www.craic.com
This is very convenient but means that some of the links on these pages are no longer identical to the original site. This can lead to confusion when you compare these pages to the originals or to similar pages from other sites. To avoid this, you might want to download the site twice, once as the untouched original version and the second with the updated links.
wget
will handle directory
listings and will download all listed files when mirroring the target
site. But it does exhibit an odd behavior when doing this. Rather than
download the listing as a single file, which is called index.html by default, it downloads nine
variants of the same file:
index.html index.html?D=A index.html?D=D index.html?M=A index.html?M=D index.html?N=A index.html?N=D index.html?S=A index.html?S=D
This is a little disconcerting until you realize that these
represent the same data sorted in different ways. The column headings
in a directory listing page are links that will return the data ranked
by that column. The eight versions with suffixes are ranked by name
(N
), last-modified date (D
), size (S
), and description (D
) in ascending (A
) or descending (D
) order. These variants can be ignored in
favor of the basic index.html
file.
An important issue that you will encounter with certain sites is
that not all the pages on the site will be downloaded. In fact, in
some cases you may find that no pages are downloaded at all. This is a
side effect of wget
being well
behaved. By that, I mean it follows the Robot Exclusion Standard that allows a web
site to restrict the pages that web spiders can copy. These restrictions are stated in a robots .txt file in the top level of a web site
hierarchy or within individual pages using a META
tag of the form:
<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
They are typically used to prevent web spiders (also known as
spiders or robots) such as googlebot
from consuming too much of a
server’s available bandwidth or to prevent certain parts of a site
from being included in the indexes of search engines such as Google.
This process works on the honor system. The operator of a site defines
how they want spiders to treat their site and a well-behaved program,
such as wget
, respects those
wishes. If the site does not want any files downloaded, then our
attempt to mirror it will produce nothing.
This makes sense as a way to control large-scale spiders, but
when I download a single site all I am really
doing is using wget
to save me the
effort of downloading the pages one by one in a browser. That activity
is not restricted by the standard. I’m not consuming any more of the
web server’s bandwidth and I’m not accessing different files. So in
this scenario, I could argue that the Robots Exclusion Standard does
not apply.
In the world of Internet scams, this is not usually a problem. I
have yet to see such a site with a robots.txt file, and they could hardly
complain about me having stolen copyrighted material, seeing as most
of them copy the pages and images from the companies that they
impersonate. Unfortunately, wget
has no way of overriding its good behavior, so if you want to get
around the restriction then you need to look for an alternative spider
or write your own using Perl and the LWP module.
Another very important feature of wget
is its ability to save the HTTP headers
that are sent by a web server immediately before it sends the content
of the requested page. I discuss this in detail in Chapter 6, which focuses on
web servers.
I cannot leave the topic of archiving web sites without mentioning the Internet Archive and the Wayback Machine. The Internet Archive (http://www.archive.org) is a non-profit group, based in San Francisco. Since 1996, they have been archiving copies of web sites onto their large cluster of Linux nodes. Unlike search engines such as Google, they do not simply keep track of the current content on each site. Instead they revisit sites every few weeks or months and archive a new version if the content has changed. The intent is to capture and archive content that would otherwise be lost whenever a site is changed or closed down. Their grand vision is to archive the entire Internet. Today they have around 40 billion web pages from all kinds of sites.
The primary interface to their collection is via the Wayback Machine. You type in the URL of a site that you are interested in and it returns a listing of all versions of the site that it has available. The O’Reilly web site makes a good example as the archive contains many versions. Using their original domain name (http://ora.com) pulls up even more versions, as you can see with this URL: http://web.archive.org/web/*/http://www.ora.com. These are shown in Figure 5-2.
Browsing through these results shows you how the O’Reilly site has evolved over the last eight or nine years. You can follow the introduction of new technologies and see which ones lived up to their promise and which fell by the wayside. Not all the links
work, not all the images are displayed, and CGI scripts do not function, but in general the archived versions function offer an experience very similar to the original site. Looking back at the web sites of news organizations or companies where you used to work can become quite addictive.
The archive is an especially valuable resource when the site that you are interested in is no longer available. This can happen when a company goes out of business, when a project gets closed down, or when a government acts to silence dissenting voices. The latter has become an important issue in the past couple of years as countries such as China and Iran have closed down blogs and web sites that they deemed subversive. The archive can play an important role in fighting censorship on the Internet.
In a similar way, the archive can prove useful in a forensics investigation. Most sites that are involved in scams don’t stick around long enough for the archive to copy them, but some of the sites associated with spam do show up.
One example concerns the Send-Safe package, which is used for sending out spam. It is one of the most sophisticated of these products and has been marketed commercially for several years. An important selling point is the database of proxy mail servers that their program can use to conceal the origin of the messages. How they set up these proxy servers is unclear, and there has been speculation that many of these were the result of infections by computer viruses such as Sobig.
This speculation has brought unwanted attention to companies that sell bulk emailers. Perhaps because of this, the Send-Safe web site (http://www.send-safe.com) was taken offline early in 2005. Simply shutting down the web server has the effect of making that company disappear from the Internet. This would be very frustrating for anyone wishing to look into their activities, except for the fact that the site had been archived and is still available via Wayback Machine.
The example highlights a sobering aspect of the Internet Archive. That is the fact that now anything that you make available on the Internet stands a chance of being available forever. For many types of information this is wonderful, but we all make mistakes and something that seemed like a good idea at the time will always be with you. Anything that you post on the Web, from the political opinions that you express in your blog to those pictures of you dressed up for Halloween at the frat party, could come back to haunt you years from now when you apply for a job or run for president.
The good news is that the web-crawling software used by the Internet Archive respects the Robot Exclusion Standard , so you can prevent your content being archived by adding a suitable robots.txt file to your site. These two forces of global archiving and personal privacy will provide a very interesting debate as the years go by, the depth of the archive increases, and examples of its use and abuse become more widely known.