Web pages are designed so they look good when rendered in a browser. Making the HTML source easy to read is rarely a priority. The result is that complex web pages are often represented by HTML source code that is virtually undecipherable.
Displaying HTML tags, attributes, and so on in different colors
can be of great help in resolving this issue. The Mozilla Firefox
browser provides this type of display in its View → Page Source menu item. The editor emacs
can provide this feature if you use
Global Font Lock Mode. Placing the
following two lines in your .emacs
file will enable this.
(global-font-lock-mode t) (setq font-lock-maximum-decoration t)
Users of the editor vim
will
find a similar syntax-coloring feature enabled by default in their
application.
A problem with many web pages is that newline characters have been omitted between tags, resulting in extremely long lines that force you to scroll left and right in order to see the entire text. Take a look at the HTML source for http://www.cnn.com or http://www.microsoft.com for examples of this.
One way to resolve that problem is to use the program HTML Tidy (http://www.w3.org/People/Raggett/tidy/). Its primary function is to identify and correct errors in HTML, such as missing tags or quotes. But it can also be used to improve the formatting of correct HTML by indenting tags and changing the case of certain elements. The program is freely available for all major platforms from http://tidy.sourceforge.net/. This command reformats the file original.html, adding newlines and indenting pairs of tags as appropriate. The output is sent to the file improved.html and any errors that are encountered are output to the terminal.
% tidy -i original.html > improved.html
Unfortunately, tidy
is
sometimes too good at its job, reporting so many errors and warnings
that it refuses to process the page. The Microsoft and CNN pages also
serve as examples of this.
The Perl script readable_html.pl, included here as Example 5-1, offers a simple alternative that adds a newline after every closing tag in a HTML page.
Example 5-1. readable_html.pl
#! /usv/bin/perl -w die "Usage: $0 <html file>\n" unless @ARGV < 2; $ARGV[0] = '-' if @ARGV == 0; open INPUT, "< $ARGV[0]" or die "$0: Unable to open html file $ARGV[0]\n"; while(<INPUT>) { s/(\<\/.*?\>)/$1\n/g; print $_; } close INPUT;
Extracting the references to other pages, images, and scripts contained within a web page is an essential step in mapping out the structure of a web site.
Reading through the HTML source, looking for these links quickly becomes extremely tedious. The Firefox browser has a useful feature that will extract and display all the links for you. Go to the Page Info item in the Tools menu and select the Links tab to see to anchors, stylesheets, and forms. Go to the Media tab to uncover links to the images.
Even with this aid, the process is laborious. Example 5-2 shows a Perl
script that will retrieve the HTML source from a URL, extract all the
links, and then output them. The script uses the LWP::Simple
and HTML::LinkExtor
modules, which can be
downloaded from CPAN if your system does not already have them
installed.
Example 5-2. extract_links.pl
#!/usr/bin/perl -w use HTML::LinkExtor; use LWP::Simple; die "Usage: $0 <url>\n" unless @ARGV == 1; my $doc = get($ARGV[0]) or die "$0: Unable to get url: $ARGV[0]\n"; my $parser = HTML::LinkExtor->new(undef, $ARGV[0]); $parser->parse($doc)->eof; my %hash = (); foreach my $linkarray ($parser->links) { $hash{$$linkarray[2]} = $$linkarray[0]; } foreach my $key (sort { $hash{$a} cmp $hash{$b} or $a cmp $b } keys %hash) { printf qq[%-6s %s\n], $hash{$key}, $key; }
Extracting links from the original URL, as opposed to an archived version, is important as they reflect the structure of the original web site, rather than that of a local archive of relevant images and so forth that may have been generated by a browser. The output of the script is an ordered, non-redundant list of the links, preceded by the type of tag that each is associated with. For example:
% ./extract_links.pl http://www.craic.com
a http://www.craic.com/about_us.html
a http://www.craic.com/contact.html
a http://www.craic.com/index.html
img http://www.craic.com/images/banner_title.gif
img http://www.craic.com/images/logo.jpg
link http://www.craic.com/craic.css
[...]
One clue about the origin of a web page is the type of software that was used to create it. Many pages are hand coded or generated by PHP or Perl scripts, but many more are coded using page design software, such as Microsoft FrontPage , Microsoft Word, Adobe GoLive, or Macromedia Dreamweaver . Every software package leaves behind a signature in the HTML that it generates. Sometimes this takes the form of certain styles of code, such as the way lines are indented, and other times the signature is more explicit.
Adobe GoLive identifies itself using a meta
tag with the name generator
:
<meta name="generator" content="Adobe GoLive 4">
Microsoft FrontPage does the same and adds another with the name
ProgId
:
<META NAME="GENERATOR" CONTENT="Microsoft FrontPage 5.0"> <META NAME="ProgId" CONTENT="FrontPage.Editor.Document">
Macromedia Dreamweaver can be identified by the prefix MM
it uses with the JavaScript functions
that it often includes in the HTML it produces:
function MM_preloadImages() { //v3.0 function MM_swapImgRestore() { //v3.0 function MM_findObj(n, d) { //v4.01 function MM_swapImage() { //v3.0
Microsoft Word can generate web pages by converting Word
documents into HTML. These can be identified by the meta
tags it introduces:
<meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 10"> <meta name=Originator content="Microsoft Word 10">
Even if these have been removed by editing, a page generated by
Word can be identified by the extensive use of styles that have the
prefix mso
:
p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; [...] mso-pagination:widow-orphan; [...] mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;}
It is possible that a web page contains more than one of these signatures. These indicate that the page has been modified from its original form. In some cases it may be possible to infer the order in which the software tools were applied.
Unfortunately, not all the information about a web page is contained within the page. Additional information is supplied by the web server in the form of HTTP header lines that precede the page itself during a web transaction. Information such as the date and time when the page was last modified and the web cookies that might be associated with the page are typically only available from these headers. I discuss this information in Chapter 6, which is devoted to the information that a web server reveals about itself.