Chapter 7. Web Browsers

Chapters 5 and 6 covered what can be learned about a web site and the server that hosts it. This chapter takes a look at things from the other side: what the server can learn about us.

A web server needs to know certain things about a browser to return the requested page successfully. First and foremost is the IP address of the machine that is sending the request. Without that, the server doesn’t know where to send the data. Next are the capabilities of the browser. Not all browsers can handle all types of content, and all common browsers will tell the server what they can and can’t accept.

A basic HTTP transaction, fetching a simple web page, starts out with the browser sending a request to the server. That contains the name of the document to be returned, along with the version of the http protocol and the method that should be used to service the request. Also included are a number of headers that convey ancillary information that can help the server tailor its response to the request. Table 7-1 shows a set of these headers that accompanied an example request.

These are only some of the possible headers. Additional background can be found in this document: http://www.w3.org/Protocols/HTTP/HTRQ_Headers.html.

Implicit in a transaction, and so not needing its own header, is the IP address of the requesting browser. The type of browser that is making the request is specified in the User-Agent string, declaring it to be Mozilla Firefox running under Linux, for example.

The browser also has to inform the server what types of content it can accept. Most browsers will in fact accept anything the server chooses to send. But there is a difference between accepting the content and knowing what to do with it. If the browser can’t display video, for example, you will typically get a pop up asking if you want to save the page to a file. But most browsers use this header to let the server know what type of content they prefer, given the choice. This lets the server choose one form of content over another. These days, the major browsers can handle all the common formats, so its use is less important. The exception to that, however, comes from mobile phone browsers. These are highly constrained due to small screen size and limited bandwidth, so a server that delivers content to these devices will make good use of the Accept header and return, perhaps, a WML page rather than standard HTML or an error message if a certain type of phone requests a large MPEG movie.

Alongside the Accept header are optional headers that tell the server what language the content should be sent in along with the related content encoding, whether or not alternatives are available, and what compression schemes can be handled if the server can send compressed data to conserve bandwidth. These headers are often ignored but can be very useful if your site has versions in multiple languages, for example. In some of the headers that list alternatives, you will often see a semicolon followed by q= and a value between 0 and 1. For example:

    ACCEPT: text/html;q=0.9,text/plain;q=0.8,*/*;q=0.5

These are called quality, or sometimes degradation, values, and they are used to help the server decide which alternative form of content should be returned. You can think of them as quantifying the client browser’s preference, given a choice. So in this example the browser would prefer to receive HTML text rather than plain text, but in a pinch it will accept anything. The gory details can be found in this document: http://www.w3.org/Protocols/HTTP/Negotiation.html.

The Host header is an extremely important piece of information. This is the hostname that the browser is trying to connect to. You might think that this is inherent in the URL used to initiate the transaction, but servers often host multiple web sites. This header lets the server direct the request to the correct virtual host.

The headers also include a Connection line and perhaps a Keep-Alive line. These tell the server to keep the connection between it and the browser open for a period of time once the requested page has been sent. Users often look at several pages on any given site and keeping the connection open allows subsequent requests to be serviced more efficiently.

If the request was initiated by clicking on a link on a web page, as opposed to typing a URL into the browser directly, then a Referer header will be included that tells the server the URL of the page that you came from. This is invaluable to commerce sites that want to track where their customers found out about their services.

To see what your browser is telling the world about your system you need to visit a site that reflects that information back to you. There are many of these out there on the Net. Two that are available at the time of writing are http://ats.nist.gov/cgi-bin/cgi.tcl/echo.cgi and http://www.ugcs.caltech.edu/~presto/echo.cgi. Alternatively you can set up the Perl script shown in Example 7-1 on your own server.

Go to that URL from your browser and you should see output similar to this:

    Information available to this server from your browser:
    Remote Host: 208.12.16.2
    Refering Page:
    Request Method: GET
    HTTP_ACCEPT: text/xml,application/xml,application/xhtml+xml,
    text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    HTTP_ACCEPT_ENCODING: gzip,deflate
    HTTP_ACCEPT_LANGUAGE: en-us,en;q=0.5
    HTTP_CACHE_CONTROL: max-age=0
    HTTP_CONNECTION: keep-alive
    HTTP_HOST: www.craic.com
    HTTP_KEEP_ALIVE: 300
    HTTP_USER_AGENT: Mozilla/5.0 (X11;U;Linux i686;en-US;rv:1.7.5)
    Gecko/20041107 Firefox/1.0