Protecting Your Privacy

Now you’ve seen how much information a web server can record about its visitors you might be feeling a little uneasy. Let’s turn the tables and discuss how you can control the information that your browser gives to the servers to which it connects.

There are many reasons why you might want not want a server to know anything about you. Seeing as you are reading this book, you might be investigating a dodgy web site and be concerned that the bad guys could identify you. You might be visiting sites that your government views as subversive and be worried about surveillance. Or you might be doing something illegal and not want to get caught.

The technology of the Internet, through its speed, ubiquity, and complete disdain for traditional national boundaries, has raised many complex issues involving civil liberties, censorship, law enforcement, and property laws. The technologies to protect or disguise your identity that are described here are at the heart of several of these debates. I encourage you to think about their ethical and political implications. The Electronic Frontier Foundation (EFF) (http://www.eff.org) is a vigorous champion of freedom on the Internet, and their site is an excellent resource.

If you want to disguise or hide your identity, then you have several choices, ranging from simple browser settings to sophisticated encryption and networking software.

The easiest approach is to modify the User-Agent string that your browser sends to the server. With some browsers , this is trivial. Konqueror, for example, can be set up to impersonate specific browsers on specific sites, or to send no User-Agent string at all. If you write you own Perl script to fetch web pages, using the LWP module, you can have it masquerade as anything you want. You should give it a unique name so that it can be identified, allowing a server to allow it access or not.

This sort of disguise can conceal the browser and operating system that you use, but that’s about it. In fact, it may work against you because some sites deliver browser-specific content. If you pretend to be using Internet Explorer when you are really using Safari, you may receive content that cannot be properly displayed.

The next step is to use a Proxy that sits between your browser and the server you want to visit. A proxy is an intermediate server that takes your request, forwards it to the target server, accepts the content from that server, and passes that back to you. It has the potential to modify both the request you send and the content it receives. They come in many forms. Some are used to cache frequently requested pages rather than fetch them from the original site every time. Some companies funnel requests from internal users through a proxy to block visits to objectionable web sites. There are two types that are particularly relevant to our interests. The first is a local proxy that provides some of the privacy features that are lacking from most browsers. The second is an external proxy through which we send our requests and that can mask our IP address.

Privoxy is an example of a local proxy that provides a wide range of filtering capabilities. It can process the outgoing requests sent from your browser to modify User-Agent and other headers. It can also modify incoming content to block cookies, pop ups, and ads.

The software is open source and is available from http://www.privoxy.org. You install it on your client computer, rather than on a server, and then configure your browser to send all http and SSL requests to port 8118 on localhost. Figure 7-1 shows the proxy configuration dialog box for Firefox running on Mac OS X. Other browsers have a similar interface.

The software then applies a series of filters to the request according to the actions that you have defined. You set these up by going to the URL http://config.privoxy.org, which is actually served by privoxy running on your machine. Configuring the software is quite daunting due to the large number of options. I’ll limit my description to just a few of the more important ones.

To change the configuration, go to http://config.privoxy.org/show-status and click on the Edit button next to the default.action filename in the first panel of that page. This pulls up a confusing page that lists a great many actions, most of which apply to incoming content and can be safely ignored. Click on the first Edit button in the section entitled “Editing Actions File default.action”. This brings up a page of actions, each with radio buttons that can enable or disable that filter. You are strongly advised not to mess with any filters that you do not understand.

Perhaps the most useful of these is the hide-referrer action, which is enabled by default. Normally your browser would forward the URL of the page that contained the link to the current page. With this filter you can remove this header completely, you can set it to a fixed arbitrary URL, or you can set it to the root page for the target site. The latter is the preferred option, as some sites will only serve images if the request was referred from a page on their site. Earlier in this chapter, I mentioned how query strings from Google searches can be included in the referrer header and can then be logged by the target site. Using this privoxy filter allows you to hide this information. The hide-user-agent action can be used to disguise the identity of the browser. Click on the enable button next to this item. Below it will appear an entry box that contains the string: Privoxy/3.0 (Anonymous). You don’t want to use this because it tells the server that you are disguising your identity. Instead take the default User-Agent string from your browser and strip out the text that identifies the version of either the browser or the operating system. For example, if the original string was this:

    Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.5)
    Gecko/20041107 Firefox/1.0

You would replace it with this abbreviated form:

    Mozilla/5.0 (Macintosh) Firefox

This allows the server to figure what type of browser is being used and deliver appropriate content, while not revealing information that might be useful to an attacker. Figure 7-2 shows the relevant section of the configuration page.

You can check what privoxy is actually doing to your requests by going to http://config.privoxy.org/show-request, which shows the headers before and after it has modified them.

Neither of these approaches do anything to hide the IP address of your computer. To do that, you need an external proxy that will forward your request to the target server and return the content to your browser. There are many sites on the Internet that have been set up to provide this service. Typically you go to their home page and type in the URL you want to view. In a basic proxy, the IP address of that site will appear in the log of the target server. Sites vary in their level of sophistication. Some will redirect requests among their own set of servers so that no one address is used all the time. Others maintain a list of active proxies elsewhere on the Net and redirect through these, adding further steps between yourself and the target server. A Google search will turn up many examples—these are a few that are active at the time of writing:

Sites like these are set up for various reasons. Some people believe strongly in Internet freedom and want to provide a service to the community. Others are set up to help people who want to view pornography or other questionable, but legal, material, perhaps making some money in the process by serving up ads to their users. Undoubtedly there are some, lurking in the back alleys of the Net, that cater for those interested in illegal material such as child pornography.

Proxies are a dual-use technology. They can just as well protect a whistle-blower or dissident as they can protect a pedophile downloading child pornography. That poses a serious liability for people that operate proxy sites. If their server is involved in illegal activity, whether they know it or not, it will be their door that the FBI will be knocking on. Many proxies have been set up with the best of intentions only to find their service abused. Some have been shut down by the authorities, some have shut themselves down, and, without wanting to sound too paranoid, you can bet that some them are honeypots, set up by the authorities, that exist solely to intercept and trace illegal traffic.

Proxy servers can protect the identity of an individual who accesses a specific server. But they do nothing to protect someone from a government that is able to monitor and trace traffic passing through the network, either by packet sniffing or through the use of compromised proxies. Truly anonymous browsing needs to use technology at a whole other level of sophistication that combines proxies with encryption. That technology, albeit in its infancy, is already available to us. One of the front-runners in this field is Tor, a project started by the Free Haven Project and the U.S. Naval Research Lab that was recently brought under the wing of the EFF (http://tor.eff.org). Tor uses a network of servers, or nodes, dispersed across the Internet to implement what is called an onion routing network. This paper provides a detailed technical background to the project: http://tor.eff.org/cvs/tor/doc/design-paper/tor-design.pdf.

It works by redirecting a http request through multiple Tor nodes until finally sending it to the target web server. All communication between nodes is encrypted in such a way that no single node has enough information to decode the messages. Each node is a proxy, but not in the simple sense that we’ve been talking about thus far.

A Tor transaction starts with a regular web browser making a request for a page on a remote web server. The Tor client consults a directory of available nodes and picks one at random as the first hop towards the target server. It then extends the path from that node to a second one, and so on until there are deemed to be enough to ensure anonymity. The final node in the path is called the exit node. It will send the unencrypted request to the target web server and pass the content back along the same path to the client. All data sent between nodes on the network is encrypted and each node has a separate set of encryption keys generated for it by the client. The upshot is that any given node in the system, other than the client, only knows about the node it received data from and the one it sent data to. The use of separate encryption keys prevents any node from eavesdropping on the data it passes down the chain. This idea of building a path incrementally through the network is conceptually like peeling away the layers of an onion, hence the name onion routing.

The path selection and encryption prevents anyone observing the traffic passing through the network. The target web server sees only the IP address of the exit node, and it is impossible to trace a path back to the client. Furthermore, the lifespan of a path through the network is short—typically less than a minute—so that consecutive requests for pages from a single client will most likely come from different exit nodes.

Tor is available for Windows, Mac OS X, and Unix. Installation as a client is straightforward. Installing privoxy is recommended alongside Tor, and happens automatically with the Mac OS X installation. To use the network you need to set your browser to use a proxy. That configuration is identical to the one described earlier for privoxy.

Once you have it configured, the software works quietly in the background. It does slow things down, sometimes significantly. This is a function of the number of server nodes and the traffic going through them at any one time. The Tor project team encourages users of the system to contribute to its success by setting up server nodes. The more servers there are, the better the performance and the more secure the system.

Here is an example of some edited Apache log entries for a regular browser following a series of links from one page to another:

    208.12.16.2  "GET /index.html HTTP/1.1"
    208.12.16.2  "GET /mobile/ora/index.html HTTP/1.1"
    208.12.16.2  "GET /mobile/ora/wurfl_cgi_listing.html HTTP/1.1"

The owner of the web server can see a single machine and the path they take through their site. Now look at the same path when run through Tor:

    64.246.50.101  "GET /index.html HTTP/1.1"
    24.207.210.2   "GET /mobile/ora/index.html HTTP/1.1"
    67.19.27.123   "GET /mobile/ora/wurfl_cgi_listing.html HTTP/1.1"

Each page appears to have been retrieved from a separate browser, none of which is the true source of the request.

As it stands, Tor is a great way to protect your communications from attempts at eavesdropping, and it effectively shields your IP address from any site that you visit. Of course, no system is perfect. Even though a site cannot determine your IP address, it can still detect that someone is visiting their site by way of the Tor network, which might indicate that they are under investigation.

We can download the list of all the current active Tor nodes (http://belegost.seul.org/), and then look for their IP addresses in our logs. At the time of this writing, there are only 134 of these so this is not difficult. Sets of log records with these IP addresses, close together in time, would suggest that a site is being accessed via the Tor network. Looking at the collection of pages that were visited and, if possible, the referring pages, could allow us to piece together the path taken by that visitor. For this reason, it is especially important that you set up privoxy in conjunction with Tor and have it hide your referring page.

Tor is a work in progress. The technology behind it is sophisticated, well thought out, and well implemented. It addresses most of the technical issues that face any scheme for anonymous communication. While the network is still small, it is growing and has solid backing from the EFF and others. How it will deal with the inevitable problem of abuse remains to be seen. Finding a technical solution to this social problem is probably impossible.

As a practical matter, if you are going to be poking around web sites that are involved in phishing or other shady business, then it makes sense to hide your identity from them using Tor. It’s a simple precaution that can prevent the outside possibility that someone will get upset with you and flood you with spam or try and break into your machine.

On a lighter note, I do have to warn you about certain side effects when you use Tor for regular browsing. Some sites, such as Google, look at the IP address that your request is coming from and deliver content tailored to that part of the world. With Tor, you cannot predict which exit node your request will finally emerge from. It had me scratching my head for quite a while the first time my Google search returned its results in Japanese!