Chapters 5 and 6 covered what can be learned about a web site and the server that hosts it. This chapter takes a look at things from the other side: what the server can learn about us.
A web server needs to know certain things about a browser to return the requested page successfully. First and foremost is the IP address of the machine that is sending the request. Without that, the server doesn’t know where to send the data. Next are the capabilities of the browser. Not all browsers can handle all types of content, and all common browsers will tell the server what they can and can’t accept.
A basic HTTP transaction, fetching a simple web page, starts out with the browser sending a request to the server. That contains the name of the document to be returned, along with the version of the http protocol and the method that should be used to service the request. Also included are a number of headers that convey ancillary information that can help the server tailor its response to the request. Table 7-1 shows a set of these headers that accompanied an example request.
Table 7-1. An example of the header lines in a simple HTTP transaction
Header | Value |
---|---|
| 208.12.16.2 |
| |
| GET |
| text/xml application/xml application/xhtml+xml text/html;q=0.9 text/plain;q=0.8 image/png */*;q=0.5 |
| ISO-8859-1 utf-8;q=0.7 *;q=0.7 |
| gzip deflate |
| en-us en;q=0.5 |
| keep-alive |
| www.craic.com |
| 300 |
| Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/2004 1107 Firefox/1.0 |
These are only some of the possible headers. Additional background can be found in this document: http://www.w3.org/Protocols/HTTP/HTRQ_Headers.html.
Implicit in a transaction, and so not needing its own header, is
the IP address of the requesting browser. The type of browser that is
making the request is specified in the User-Agent
string, declaring it to be Mozilla
Firefox running under Linux, for example.
The browser also has to inform the server what types of content it can accept. Most browsers will in fact accept anything the server chooses to send. But there is a difference between accepting the content and knowing what to do with it. If the browser can’t display video, for example, you will typically get a pop up asking if you want to save the page to a file. But most browsers use this header to let the server know what type of content they prefer, given the choice. This lets the server choose one form of content over another. These days, the major browsers can handle all the common formats, so its use is less important. The exception to that, however, comes from mobile phone browsers. These are highly constrained due to small screen size and limited bandwidth, so a server that delivers content to these devices will make good use of the Accept header and return, perhaps, a WML page rather than standard HTML or an error message if a certain type of phone requests a large MPEG movie.
Alongside the Accept
header are
optional headers that tell the server what language the content should
be sent in along with the related content encoding, whether or not
alternatives are available, and what compression schemes can be handled
if the server can send compressed data to conserve bandwidth. These
headers are often ignored but can be very useful if your site has
versions in multiple languages, for example. In some of the headers that
list alternatives, you will often see a semicolon followed by q=
and a value between 0 and 1. For
example:
ACCEPT: text/html;q=0.9,text/plain;q=0.8,*/*;q=0.5
These are called quality, or sometimes degradation, values, and they are used to help the server decide which alternative form of content should be returned. You can think of them as quantifying the client browser’s preference, given a choice. So in this example the browser would prefer to receive HTML text rather than plain text, but in a pinch it will accept anything. The gory details can be found in this document: http://www.w3.org/Protocols/HTTP/Negotiation.html.
The Host
header is an extremely
important piece of information. This is the hostname that the browser is
trying to connect to. You might think that this is inherent in the URL
used to initiate the transaction, but servers often host multiple web
sites. This header lets the server direct the request to the correct
virtual host.
The headers also include a Connection
line and perhaps a Keep-Alive
line. These tell the server to keep
the connection between it and the browser open for a period of time once
the requested page has been sent. Users often look at several pages on
any given site and keeping the connection open allows subsequent
requests to be serviced more efficiently.
If the request was initiated by clicking on a link on a web page,
as opposed to typing a URL into the browser directly, then a Referer
header will be included that tells the
server the URL of the page that you came from. This is invaluable to
commerce sites that want to track where their customers found out about
their services.
Throughout this chapter, you will see the term Referer
, used as a http header to identify
the URL of the page that contained a link to the current page. The
correct spelling is Referrer, but somewhere along
the line an R was dropped. This error managed to sneak into the
official http specification and now lives forever in every browser and
web server on the Net.
To see what your browser is telling the world about your system you need to visit a site that reflects that information back to you. There are many of these out there on the Net. Two that are available at the time of writing are http://ats.nist.gov/cgi-bin/cgi.tcl/echo.cgi and http://www.ugcs.caltech.edu/~presto/echo.cgi. Alternatively you can set up the Perl script shown in Example 7-1 on your own server.
Example 7-1. browser.cgi
#!/usr/bin/perl -w # Echo the environment variables that are sent from the browser use CGI; my $cgi = new CGI; print "Content-type: text/html\n\n"; print "<html>\n<head>\n"; print "<title>Browser Information</title>\n"; print "</head>\n<body>\n"; print "Information sent by your browser:<br>\n"; printf "Remote Host: %s<br>\n", $cgi->remote_host(); printf "Refering Page: %s<br>\n", $cgi->referer(); printf "Request Method: %s<br>\n", $cgi->request_method(); foreach my $type (sort { $a cmp $b } $cgi->http()) { printf "%s: %s<br>\n", $type, $cgi->http($type); } print "</body>\n</html>\n";
Go to that URL from your browser and you should see output similar to this:
Information available to this server from your browser: Remote Host: 208.12.16.2 Refering Page: Request Method: GET HTTP_ACCEPT: text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.7 HTTP_ACCEPT_ENCODING: gzip,deflate HTTP_ACCEPT_LANGUAGE: en-us,en;q=0.5 HTTP_CACHE_CONTROL: max-age=0 HTTP_CONNECTION: keep-alive HTTP_HOST: www.craic.com HTTP_KEEP_ALIVE: 300 HTTP_USER_AGENT: Mozilla/5.0 (X11;U;Linux i686;en-US;rv:1.7.5) Gecko/20041107 Firefox/1.0