As mentioned earlier, the term proxy server can refer to many different things. And even within the definition of proxies that are of particular interest to webbot developers, you’ll find wide diversity. This section describes some of the proxy servers available to you, as well as their advantages and disadvantages.
For a variety of reasons, which will be described later, thousands of proxy servers are available on the Internet for you to use freely. These proxies are known as open proxies. Just as with the proxy servers we discussed earlier, when you connect your webbot or browser to an open proxy, you assume that proxy’s IP address—and, by default, its physical location.
To experience an open proxy for yourself, do an Internet search on the term “open proxy.” Within the search results, you will find links to services that list hundreds, if not thousands, of open proxies. Figure 27-6 shows a representative list (from http://www.xroxy.com[77]).
In addition to the proxy’s IP address and port number, most of these lists also describe other information about the proxy, such as the proxy type (this will probably be HTTP, SOCKS, or SOCKS5), the country of origin, whether the proxy server supports (SSL) encryption, the latency (the amount of time it takes to get a response from the proxy), and some type of reliability rating.
In addition to those proxy parameters listed above, other parameters are sometimes listed. These parameters define how much, or how little, the proxy discloses about the user of the proxy.
While these proxies function like any other, the originating (your) IP address is forwarded in the HTTP_X_FORWARDED_FORWARDED
variable, which is exposed to the web server and may be recorded in the website’s access log file.
Anonymous Proxies
The originating IP address is not passed to the web server, but it may still be possible to detect that the traffic was directed through a proxy.
Spoofing Proxies
These proxies fool the destination server into believing that the traffic originated from a totally different location.
I prefaced the first paragraph in this section with the words “for a variety of reasons,” because like many things you’ll find online, not everything is as it appears to be. Before you use an open proxy, you should ask yourself why anyone would open up their network and allow strangers to consume his resources. The truth is that there are very few legitimate reasons for anyone to do so. So why are these open proxies made available?
Many open proxies are actually misconfigured servers that allow open relaying connections. This can happen for many reasons, including when a system administrator installs a mail server and never bothers to change the default settings.
It is also strongly suspected that law enforcement agencies, governments, and cyber voyeurs use proxy servers either to detect or conceal criminal activity or to uncover covert political movements. And other open proxies are unknowingly run by regular people who inadvertently installed them when they downloaded unwanted malware or viruses.
Open proxies are good for learning, but I would not recommend them for production use. Since you don’t control the open proxy’s environment, and since the service isn’t guaranteed, there is no way to predict if the proxy’s performance will continue or if the proxy will even be there when you really need it. The other problem with open proxies is that, as we mentioned earlier, you don’t know who is operating the service, so never use an open proxy when you are transmitting confidential information like usernames or passwords.
Since open proxies are often available only by accident, the availability of open proxy servers changes constantly. To solve this problem, a number of online businesses sell a service that maintains a constantly updated database of available open proxies and their performances. This information is sold to developers, who use it to pick the most appropriate proxy for their needs. Some of these services will even make available their own proxy, with a consistent interface, which automatically picks the best open proxy for your needs. These services come and go, but a quick online search should disclose a bunch of them.
Tor is an anonymous proxy service that is based on US naval technology. While the military is believed to still use this technology, it is a now an open source project and maintained by the nonprofit Tor Project (http://www.TorProject.org).
Unlike open proxies, Tor is a voluntary community of proxies that relay traffic through a varying route of community servers until finally exiting at a Tor endpoint. This technique makes tracing traffic back to its origin very difficult. Tor also encrypts all traffic, so there is reduced danger of being identified by network sniffers. Because of Tor’s availability (it’s free) and success, it has been embraced by journalists, military personal, law enforcement, political dissenters, webbot developers, and people like you and me.
A lot could be written about Tor, but you will only find the basics here. You are strongly encouraged to visit the Tor Project website to learn more and possibly even contribute to the project.
To use Tor, you need to install Polipo, which is part of the Tor distribution package. Polipo is a proxy that runs locally on your computer. It communicates with the Tor network, and between Polipo and the Tor network, a path for your data is selected from the community of Tor relay (proxy) servers (see Figure 27-7). The websites you access when using Tor only get to see the IP address of the final Tor endpoint. And as mentioned earlier, all network traffic within the Tor network is encrypted.
As one continues to use Tor, the path through the Tor relays to the Tor endpoint continually changes, as depicted in Figure 27-8.
The combination of encryption, constantly changing routing paths, and mixing of your network activity with that of many other people make Tor a fairly good anonymous environment.
You connect to Tor through Polipo, which runs as a local server at IP address 127.0.0.1 at port number 8118. You should use this IP address and port combination, or the address and port recommended in the documentation, and configure PHP/CURL or a browser as you would with any other proxy service.
As mentioned earlier, Tor creates a “fairly good” anonymous environment, but it is important to remember that it is not necessarily completely anonymous. It is possible for websites to bypass Tor with Java, Flash, or other browser plug-ins. More importantly, successful use of Tor requires that you maintain safe browsing habits. Tor won’t help you if you’re someone who readily enters personal information into website forms.
The other disadvantage of Tor is that it will be slower than browsing directly. Sometimes Tor can be annoyingly slow. For that reason, Tor is best suited for lightweight webbot applications that don’t rely on a lot of media. For webbot applications that only download HTML web pages without graphics, I find Tor’s performance completely acceptable. Tor, however, is not suitable for any file sharing or streamed media applications. You should also be aware that Tor is maintained and hosted by volunteers and that it would be bad form to create webbots that selfishly burn Tor’s limited bandwidth or the CPU cycles donated by the Tor community.
In addition to open proxies and Tor, a variety of commercial proxy products are available to purchase. The quality, features, and price vary from provider to provider, but most have the ability to restrict IP addresses to originate from a specific country.
It is not the intent of this book to endorse any specific proxy service providers. However, two of the bigger players in this segment are Anonymizer (http://www.Anonymizer.com) and HideMyIP (http://HideMyIP.com). You can find a wide selection of similar proxy services by performing an online search with the term “anonymous browsing.” Available commercial proxy services range from marginal to downright amazing. Some of the more compelling proxy services utilize many thousands of IP addresses, have low network latency, and change IP addresses every few seconds. Less desirable proxy services are slow, change IP addresses infrequently, and put a small pool of available addresses at your disposal. Pricing also varies widely. Some proxy services are available for a few dollars a month, but the more advanced proxies—which provide the most anonymity—are priced per HTTP GET
and can become quite expensive in commercial webbot environments.
One thing that many commercial, or consumer-oriented, proxy services have in common is that they deviate from traditional proxies in the way they are configured. Instead of setting a proxy’s IP address and port in a browser or PHP/CURL configuration, these programs work with the browser to intercept web traffic and automatically route it through their network. While this “configure-less” environment makes it easier for consumers to set up and use the proxy, it is much harder (next to impossible) for a webbot employing PHP/CURL to make use of such services. While these configure-less proxy services are difficult for PHP scripts to use, they are ideal for the browser macro applications discussed in Chapter 23 and Chapter 24.