Almost every scam on the Internet today involves a web site, especially those engaged in identity theft. Dissecting the structure of a site is therefore an essential part of Internet forensics. This chapter shows you how to find hidden clues in the HTML code of a single web page and in the architecture of the entire site. First, I cover the basics of looking at the source of web pages using your browser, and then I show how you can use other tools to automate the process of archiving entire web sites. Many of the pages that you will encounter are generated by server-side scripts, and I describe approaches that may reveal some of the inner workings of these, even when you cannot access their source code.
Some clues contribute minor details to our knowledge about the scam. Some enable us to link one scam to another and build a much larger picture. On occasion we get lucky and uncover a mass of detailed information about the operation.
First, consider individual web pages: the HTML source of a single page can reveal a surprising amount about its creator, and the links contained therein help you map out the structure of the entire site. All web browsers allow you to view the source for a page and to save that to a file on your local computer. While these fundamental operations may seem trivial, there are a couple of important issues of which you need to be aware.
The first is that many of today’s web pages include other files, without which they cannot be properly displayed. Images are the most obvious example, but stylesheets and JavaScript files have become increasingly common. In most cases, the links to those files are relative, not absolute, meaning they will not be available if the saved web page is opened in a browser. Either the links have to be updated in the downloaded web page or the supporting files must also be saved.
The second problem is that most web pages do not include the URL from which they were downloaded. That means that you have to save that URL string in a separate file or insert it as a comment in the saved web page. Doing either of these manually is an inconvenience.
Some browser developers have addressed these problems. Mozilla
Firefox will save any associated files when a web page is saved as
Web Page, complete
, as shown in Figure 5-1.
Those files are saved to a directory that is created in the same location as the saved web page. So, for example, if I save index.html to a directory, then I will find a subdirectory called index_files that contains any images, stylesheets, and so forth that were referenced by the original file. Furthermore, most links to those files will have been updated to point to the saved copies. I use the term “most” because Firefox is not able to update links that are included as parameters to JavaScript functions, such as image rollover functions. With those exceptions, the saved page and its ancillary files can be opened from a browser on that machine and the page should look the same as the original.
Although this is convenient, it does mean that the saved web page in no longer identical to the original. In fact, Firefox makes a number of changes to the HTML it saves. I presume that these are intended to ensure that saved pages contain valid HTML but the effect is that it makes comparing saved pages with originals very difficult. Consider these few lines of HTML from my home page:
<table width="90%" border="0" align="center" cellpadding="0" cellspacing="0"> <tr>
Firefox rearranges the attributes in the <table>
tag so that they lie in
alphabetical order. It also adds a new <tbody>
ahead of the first <tr>
tag:
<table align="center" border="0" cellpadding="0" cellspacing="0" width="90%"> <tbody><tr>
This type of unseen modification of files can be the source of
much confusion when you want to compare files. To avoid it, you can
either download files individually in Firefox, saving them as Web Page, HTML Only
or use the non-interactive
download tool, wget
.
Internet Explorer can also save all the files associated with a
page, and it solves the second problem of associating the saved web page
with the original URL. It inserts a comment line at the top of the page,
before the <html>
tag, which
records the original URL. This example shows the comment from a
downloaded copy of my home page:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- saved from url=(0021)http://www.craic.com/ --> <HTML lang=en>
The number in parentheses right before the URL represents the number of characters in that URL string.
Comments like this are a useful way of recording where a page came from. They are especially interesting when they are found in the pages of phishing web sites. Here is an example from a fake U.S. Bank site that shows exactly where the original page is located:
<!-- saved from url=(0105)http://www.updates-usbank.com/ internetBanking/RequestRouterRequestCmdId=DisplayLoginPage/ login_faild.html -->
In some cases, a page may be downloaded from an intermediary web site, rather than the original. A comment line may be the only way to track this information. On occasion you come across a page with more than one comment, like this:
<!-- saved from url=(0044)http://iqnet.ro/poser/eb/signOutConfirm.html --> <!-- saved from url=(0041)http://pages.ebay.com/signOutConfirm.html -->
This is particularly informative as it defines the steps that this
page has taken in its evolution from the original version. It has been
downloaded from http://ebay.com, uploaded to
iqnet.ro
(in Romania), downloaded
from there, and finally uploaded to the site http://ebay.arribada-updates.com (located in Mexico),
which is where I found it. Although these comment lines are not present
in all HTML files, they are well worth looking for.