Chapter 5. Web Clients

This chapter will talk about the HTTP client side of Twisted Web, starting with quick web resource retrieval for one-off applications and ending with the Agent API for developing flexible web clients.

Basic HTTP Resource Retrieval

Twisted has several high-level convenience classes for quick one-off resource retrieval.

twisted.web.client.getPage asynchronously retrieves a resource at a given URL. It returns a Deferred, which fires its callback with the resource as a string. Example 5-1 demonstrates the use of getPage; it retrieves and prints the resource at the user-supplied URL.

We can test this script with:

python print_resource.py http://www.google.com

which will print the contents of Google’s home page to the screen.

An invalid URL will produce something like the following:

$ python print_resource.py http://notvalid.foo
[Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.DNSLookupError'>:
DNS lookup failed: address 'notvalid.foo' not found:
[Errno 8] nodename nor servname provided, or not known.
]

Despite its name, getPage can make any HTTP request type. To make an HTTP POST request with getPage, supply the method and postdata keyword arguments: for example, getPage(sys.argv[1], method='POST', postdata="My test data").

getPage also supports using cookies, following redirects, and changing the User-Agent for the request.

getPage and downloadPage are useful for getting small jobs done, but the main Twisted HTTP client API, which supports a broad range of RFC-compliant behaviors in a flexible and extensible way, is the Agent.

Example 5-3 implements the same functionality as print_resource.py from Example 5-1 using the Agent API.

The agent version requires a bit more work but is much more general-purpose. Let’s break down the steps involved:

  1. Initialize an instance of twisted.web.client.Agent. Because the agent handles connection setup, it must be initialized with a reactor.

  2. Make an HTTP request with the agent’s request method. It takes at minimum the HTTP method and URL. On success, agent.request returns a Deferred that fires with a Response object encapsulating the response to the request.

  3. Register a callback with the Deferred returned by agent.request to handle the Response body as it becomes available through response.deliverBody. Because the response is coming across the network in chunks, we need a Protocol that will process the data as it is received and notify us when the body has been completely delivered.

    To accomplish this, we create a Protocol subclass called ResourcePrinter, similar to how we did when constructing basic TCP servers and clients in Chapter 2. The big difference is that we want to be able to continue processing the event outside of ResourcePrinter. That link to the outside world will be a Deferred that is passed to a ResourcePrinter instance on initialization and is fired when the connection has been terminated. That Deferred is created and returned by printResource so more callbacks can be registered for additional processing. As chunks of the response body arrive, the reactor invokes dataReceived, and we print the data to the screen. When the reactor invokes connectionLost, we trigger the Deferred.

  4. Once the connection has been terminated, stop the reactor. To do this, we register callbacks to a stop function with the Deferred triggered by connectionLost and returned by printResource. Recall that addBoth registers the same function with both the callback and errback chains, so the reactor will be stopped whether or not the download was successful.

  5. Finally, run the reactor, which will kick off the HTTP request.

Running this example with python agent_print_resource.py http://www.google.com produces the same output as Example 5-1.

Retrieving Response Metadata

Agent supports all HTTP methods and arbitrary HTTP headers. Example 5-4 demonstrates this functionality with an HTTP HEAD request.

The Response object in the Deferred returned by agent.request contains lots of useful HTTP response metadata, including the HTTP status code, HTTP version, and headers. Example 5-4 also demonstrates extracting this information.

Testing this script with a URL like:

python print_metadata.py http://www.google.com/

produces the following output:

HTTP version: ('HTTP', 1, 1)
Status code: 200
Status phrase: OK
Response headers:
X-Xss-Protection ['1; mode=block']
Set-Cookie ['PREF=ID=b1401ec53122a4e5:FF=0:TM=1340750440...
Expires ['-1']
Server ['gws']
Cache-Control ['private, max-age=0']
Date ['Tue, 26 Jun 2012 22:40:40 GMT']
P3p ['CP="This is not a P3P policy! See http://www.google.com/support/...
Content-Type ['text/html; charset=ISO-8859-1']
X-Frame-Options ['SAMEORIGIN']

To POST HTTP data with Agent, we need to construct a producer, providing the IBodyProducer interface, which will produce the POST data when the Agent needs it.

To provide the IBodyProducer interface, which is enforced by Twisted’s use of zope.interface.implements, a class must implement the following methods, as well as a length attribute tracking the length of the data the producer will eventually produce:

For this example, we can construct a simple StringProducer that just writes out the POST data to the waiting consumer when startProducing is invoked. StringProducer is passed as the bodyProducer argument to agent.request.

Example 5-5 shows a complete POSTing client. Beyond the StringProducer, the code is almost identical to the resource-requesting client in Example 5-3.

class ResourcePrinter(Protocol):
    def __init__(self, finished):
        self.finished = finished

    def dataReceived(self, data):
        print data

    def connectionLost(self, reason):
        self.finished.callback(None)

def printResource(response):
    finished = Deferred()
    response.deliverBody(ResourcePrinter(finished))
    return finished

def printError(failure):
    print >>sys.stderr, failure

def stop(result):
    reactor.stop()

if len(sys.argv) != 3:
    print >>sys.stderr, "Usage: python post_resource.py URL 'POST DATA'"
    exit(1)

agent = Agent(reactor)
body = StringProducer(sys.argv[2])
d = agent.request('POST', sys.argv[1], bodyProducer=body)
d.addCallbacks(printResource, printError)
d.addBoth(stop)

reactor.run()

To test this example, we need a URL that accepts POST requests. http://www.google.com is not such a URL, as it turns out. This:

python post_data.py http://www.google.com 'Hello World'

prints:

The request method POST is inappropriate for the URL /. That’s all we know.

This is an occasion where being able to spin up a basic web server easily for testing would be useful. Fortunately, we covered Twisted web servers in the previous chapter!

Example 5-6 is a simple web server that echoes the body of a POST, only reversed.

python test_server.py will start the web server listening on port 8000. With that server running, we can then test our client with:

$ python post_data.py http://127.0.0.1:8000 'Hello World'
dlroW olleH

This chapter introduced Twisted HTTP clients. High-level helpers getPage and downloadPage make quick resource retrieval easy. The Agent is a flexible and comprehensive API for writing web clients.

The Twisted Web Client HOWTO discusses the Agent API in detail, including handling proxies and cookies.

The Twisted Web examples directory has a variety of HTTP client examples.