Viewing HTML Source

Web pages are designed so they look good when rendered in a browser. Making the HTML source easy to read is rarely a priority. The result is that complex web pages are often represented by HTML source code that is virtually undecipherable.

Displaying HTML tags, attributes, and so on in different colors can be of great help in resolving this issue. The Mozilla Firefox browser provides this type of display in its View → Page Source menu item. The editor emacs can provide this feature if you use Global Font Lock Mode. Placing the following two lines in your .emacs file will enable this.

    (global-font-lock-mode t)
    (setq font-lock-maximum-decoration t)

Users of the editor vim will find a similar syntax-coloring feature enabled by default in their application.

A problem with many web pages is that newline characters have been omitted between tags, resulting in extremely long lines that force you to scroll left and right in order to see the entire text. Take a look at the HTML source for http://www.cnn.com or http://www.microsoft.com for examples of this.

One way to resolve that problem is to use the program HTML Tidy (http://www.w3.org/People/Raggett/tidy/). Its primary function is to identify and correct errors in HTML, such as missing tags or quotes. But it can also be used to improve the formatting of correct HTML by indenting tags and changing the case of certain elements. The program is freely available for all major platforms from http://tidy.sourceforge.net/. This command reformats the file original.html, adding newlines and indenting pairs of tags as appropriate. The output is sent to the file improved.html and any errors that are encountered are output to the terminal.

            % tidy -i original.html > improved.html

Unfortunately, tidy is sometimes too good at its job, reporting so many errors and warnings that it refuses to process the page. The Microsoft and CNN pages also serve as examples of this.

The Perl script readable_html.pl, included here as Example 5-1, offers a simple alternative that adds a newline after every closing tag in a HTML page.

Example 5-1. readable_html.pl

#! /usv/bin/perl -w
die "Usage: $0 <html file>\n" unless @ARGV < 2;
$ARGV[0] = '-' if @ARGV == 0;

open INPUT, "< $ARGV[0]" or
            die "$0: Unable to open html file $ARGV[0]\n";
while(<INPUT>) {
    s/(\<\/.*?\>)/$1\n/g;
    print $_;
}
close INPUT;

Extracting Links Within a Page

Extracting the references to other pages, images, and scripts contained within a web page is an essential step in mapping out the structure of a web site.

Reading through the HTML source, looking for these links quickly becomes extremely tedious. The Firefox browser has a useful feature that will extract and display all the links for you. Go to the Page Info item in the Tools menu and select the Links tab to see to anchors, stylesheets, and forms. Go to the Media tab to uncover links to the images.

Even with this aid, the process is laborious. Example 5-2 shows a Perl script that will retrieve the HTML source from a URL, extract all the links, and then output them. The script uses the LWP::Simple and HTML::LinkExtor modules, which can be downloaded from CPAN if your system does not already have them installed.

Example 5-2. extract_links.pl

#!/usr/bin/perl -w
use HTML::LinkExtor;
use LWP::Simple;
die "Usage: $0 <url>\n" unless @ARGV == 1;
my $doc = get($ARGV[0]) or die "$0: Unable to get url: $ARGV[0]\n";
my $parser = HTML::LinkExtor->new(undef, $ARGV[0]);
$parser->parse($doc)->eof;
my %hash = ();
foreach my $linkarray ($parser->links) {
    $hash{$$linkarray[2]} = $$linkarray[0];
}
foreach my $key (sort { $hash{$a} cmp $hash{$b} or $a cmp $b }
                 keys %hash) {
   printf qq[%-6s  %s\n], $hash{$key}, $key;
}

Extracting links from the original URL, as opposed to an archived version, is important as they reflect the structure of the original web site, rather than that of a local archive of relevant images and so forth that may have been generated by a browser. The output of the script is an ordered, non-redundant list of the links, preceded by the type of tag that each is associated with. For example:

               % ./extract_links.pl http://www.craic.com
    a       http://www.craic.com/about_us.html
    a       http://www.craic.com/contact.html
    a       http://www.craic.com/index.html
    img     http://www.craic.com/images/banner_title.gif
    img     http://www.craic.com/images/logo.jpg
    link    http://www.craic.com/craic.css
    [...]

Page Creation Software

One clue about the origin of a web page is the type of software that was used to create it. Many pages are hand coded or generated by PHP or Perl scripts, but many more are coded using page design software, such as Microsoft FrontPage , Microsoft Word, Adobe GoLive, or Macromedia Dreamweaver . Every software package leaves behind a signature in the HTML that it generates. Sometimes this takes the form of certain styles of code, such as the way lines are indented, and other times the signature is more explicit.

Adobe GoLive identifies itself using a meta tag with the name generator:

    <meta name="generator" content="Adobe GoLive 4">

Microsoft FrontPage does the same and adds another with the name ProgId:

    <META NAME="GENERATOR" CONTENT="Microsoft FrontPage 5.0">
    <META NAME="ProgId" CONTENT="FrontPage.Editor.Document">

Macromedia Dreamweaver can be identified by the prefix MM it uses with the JavaScript functions that it often includes in the HTML it produces:

    function MM_preloadImages() { //v3.0
    function MM_swapImgRestore() { //v3.0
    function MM_findObj(n, d) { //v4.01
    function MM_swapImage() { //v3.0

Microsoft Word can generate web pages by converting Word documents into HTML. These can be identified by the meta tags it introduces:

    <meta name=ProgId content=Word.Document>
    <meta name=Generator content="Microsoft Word 10">
    <meta name=Originator content="Microsoft Word 10">

Even if these have been removed by editing, a page generated by Word can be identified by the extensive use of styles that have the prefix mso:

    p.MsoNormal, li.MsoNormal, div.MsoNormal
        {mso-style-parent:"";
        [...]
        mso-pagination:widow-orphan;
        [...]
        mso-header-margin:.5in;
        mso-footer-margin:.5in;
        mso-paper-source:0;}

It is possible that a web page contains more than one of these signatures. These indicate that the page has been modified from its original form. In some cases it may be possible to infer the order in which the software tools were applied.

Other Information

Unfortunately, not all the information about a web page is contained within the page. Additional information is supplied by the web server in the form of HTTP header lines that precede the page itself during a web transaction. Information such as the date and time when the page was last modified and the web cookies that might be associated with the page are typically only available from these headers. I discuss this information in Chapter 6, which is devoted to the information that a web server reveals about itself.