Chapter 5. Web Clients

This chapter will talk about the HTTP client side of Twisted Web, starting with quick web resource retrieval for one-off applications and ending with the Agent API for developing flexible web clients.

Basic HTTP Resource Retrieval

Twisted has several high-level convenience classes for quick one-off resource retrieval.

Printing a Web Resource

twisted.web.client.getPage asynchronously retrieves a resource at a given URL. It returns a Deferred, which fires its callback with the resource as a string. Example 5-1 demonstrates the use of getPage; it retrieves and prints the resource at the user-supplied URL.

Example 5-1. print_resource.py

from twisted.internet import reactor
from twisted.web.client import getPage
import sys

def printPage(result):
    print result

def printError(failure):
    print >>sys.stderr, failure

def stop(result):
    reactor.stop()

if len(sys.argv) != 2:
    print >>sys.stderr, "Usage: python print_resource.py <URL>"
    exit(1)

d = getPage(sys.argv[1])
d.addCallbacks(printPage, printError)
d.addBoth(stop)

reactor.run()

We can test this script with:

python print_resource.py http://www.google.com

which will print the contents of Google’s home page to the screen.

An invalid URL will produce something like the following:

$ python print_resource.py http://notvalid.foo
[Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.DNSLookupError'>:
DNS lookup failed: address 'notvalid.foo' not found:
[Errno 8] nodename nor servname provided, or not known.
]

Despite its name, getPage can make any HTTP request type. To make an HTTP POST request with getPage, supply the method and postdata keyword arguments: for example, getPage(sys.argv[1], method='POST', postdata="My test data").

getPage also supports using cookies, following redirects, and changing the User-Agent for the request.

Downloading a Web Resource

twisted.web.client.downloadPage asynchronously downloads a resource at a given URL to the specified file. Example 5-2 demonstrates the use of getPage.

Example 5-2. download_resource.py

from twisted.internet import reactor
from twisted.web.client import downloadPage
import sys

def printError(failure):
    print >>sys.stderr, failure

def stop(result):
    reactor.stop()

if len(sys.argv) != 3:
    print >>sys.stderr, "Usage: python download_resource.py <URL> <output file>"
    exit(1)

d = downloadPage(sys.argv[1], sys.argv[2])
d.addErrback(printError)
d.addBoth(stop)

reactor.run()

We can test this script with:

python download_resource.py http://www.google.com google.html

which will save the contents of Google’s home page to the file google.html.

Agent

getPage and downloadPage are useful for getting small jobs done, but the main Twisted HTTP client API, which supports a broad range of RFC-compliant behaviors in a flexible and extensible way, is the Agent.

Requesting Resources with Agent

Example 5-3 implements the same functionality as print_resource.py from Example 5-1 using the Agent API.

Example 5-3. agent_print_resource.py

import sys

from twisted.internet import reactor
from twisted.internet.defer import Deferred
from twisted.internet.protocol import Protocol
from twisted.web.client import Agent

class ResourcePrinter(Protocol):
    def __init__(self, finished):
        self.finished = finished

    def dataReceived(self, data):
        print data

    def connectionLost(self, reason):
        self.finished.callback(None)

def printResource(response):
    finished = Deferred()
    response.deliverBody(ResourcePrinter(finished))
    return finished

def printError(failure):
    print >>sys.stderr, failure

def stop(result):
    reactor.stop()

if len(sys.argv) != 2:
    print >>sys.stderr, "Usage: python agent_print_resource.py URL"
    exit(1)

agent = Agent(reactor)
d = agent.request('GET', sys.argv[1])
d.addCallbacks(printResource, printError)
d.addBoth(stop)

reactor.run()

The agent version requires a bit more work but is much more general-purpose. Let’s break down the steps involved:

Initialize an instance of twisted.web.client.Agent. Because the agent handles connection setup, it must be initialized with a reactor.
Make an HTTP request with the agent’s request method. It takes at minimum the HTTP method and URL. On success, agent.request returns a Deferred that fires with a Response object encapsulating the response to the request.
Register a callback with the Deferred returned by agent.request to handle the Response body as it becomes available through response.deliverBody. Because the response is coming across the network in chunks, we need a Protocol that will process the data as it is received and notify us when the body has been completely delivered.
To accomplish this, we create a Protocol subclass called ResourcePrinter, similar to how we did when constructing basic TCP servers and clients in Chapter 2. The big difference is that we want to be able to continue processing the event outside of ResourcePrinter. That link to the outside world will be a Deferred that is passed to a ResourcePrinter instance on initialization and is fired when the connection has been terminated. That Deferred is created and returned by printResource so more callbacks can be registered for additional processing. As chunks of the response body arrive, the reactor invokes dataReceived, and we print the data to the screen. When the reactor invokes connectionLost, we trigger the Deferred.
Once the connection has been terminated, stop the reactor. To do this, we register callbacks to a stop function with the Deferred triggered by connectionLost and returned by printResource. Recall that addBoth registers the same function with both the callback and errback chains, so the reactor will be stopped whether or not the download was successful.
Finally, run the reactor, which will kick off the HTTP request.

Running this example with python agent_print_resource.py http://www.google.com produces the same output as Example 5-1.

Retrieving Response Metadata

Agent supports all HTTP methods and arbitrary HTTP headers. Example 5-4 demonstrates this functionality with an HTTP HEAD request.

The Response object in the Deferred returned by agent.request contains lots of useful HTTP response metadata, including the HTTP status code, HTTP version, and headers. Example 5-4 also demonstrates extracting this information.

Example 5-4. print_metadata.py

import sys

from twisted.internet import reactor
from twisted.web.client import Agent
from twisted.web.http_headers import Headers

def printHeaders(response):
    print 'HTTP version:', response.version
    print 'Status code:', response.code
    print 'Status phrase:', response.phrase
    print 'Response headers:'
    for header, value in response.headers.getAllRawHeaders():
        print header, value

def printError(failure):
    print >>sys.stderr, failure

def stop(result):
    reactor.stop()

if len(sys.argv) != 2:
    print >>sys.stderr, "Usage: python print_metadata.py URL"
    exit(1)

agent = Agent(reactor)
headers = Headers({'User-Agent': ['Twisted WebBot'],
                   'Content-Type': ['text/x-greeting']})

d = agent.request('HEAD', sys.argv[1], headers=headers)
d.addCallbacks(printHeaders, printError)
d.addBoth(stop)

reactor.run()

Testing this script with a URL like:

python print_metadata.py http://www.google.com/

produces the following output:

HTTP version: ('HTTP', 1, 1)
Status code: 200
Status phrase: OK
Response headers:
X-Xss-Protection ['1; mode=block']
Set-Cookie ['PREF=ID=b1401ec53122a4e5:FF=0:TM=1340750440...
Expires ['-1']
Server ['gws']
Cache-Control ['private, max-age=0']
Date ['Tue, 26 Jun 2012 22:40:40 GMT']
P3p ['CP="This is not a P3P policy! See http://www.google.com/support/...
Content-Type ['text/html; charset=ISO-8859-1']
X-Frame-Options ['SAMEORIGIN']

POSTing Data with Agent

To POST HTTP data with Agent, we need to construct a producer, providing the IBodyProducer interface, which will produce the POST data when the Agent needs it.

Tip

The producer/consumer design pattern facilitates streaming potentially large amounts of data in a way that is memory- and CPU-efficient even if processes are producing and consuming at different rates.

You can also read more about Twisted’s producer/consumer APIs.

To provide the IBodyProducer interface, which is enforced by Twisted’s use of zope.interface.implements, a class must implement the following methods, as well as a length attribute tracking the length of the data the producer will eventually produce:

startProducing
stopProducing
pauseProducing
resumeProducing

For this example, we can construct a simple StringProducer that just writes out the POST data to the waiting consumer when startProducing is invoked. StringProducer is passed as the bodyProducer argument to agent.request.

Example 5-5 shows a complete POSTing client. Beyond the StringProducer, the code is almost identical to the resource-requesting client in Example 5-3.

Example 5-5. post_data.py

import sys
from twisted.internet import reactor
from twisted.internet.defer import Deferred, succeed
from twisted.internet.protocol import Protocol
from twisted.web.client import Agent
from twisted.web.iweb import IBodyProducer

from zope.interface import implements

class StringProducer(object):
    implements(IBodyProducer)

    def __init__(self, body):
        self.body = body
        self.length = len(body)

    def startProducing(self, consumer):
        consumer.write(self.body)
        return succeed(None)

    def pauseProducing(self):
        pass

    def stopProducing(self):
        pass

class ResourcePrinter(Protocol):
    def __init__(self, finished):
        self.finished = finished

    def dataReceived(self, data):
        print data

    def connectionLost(self, reason):
        self.finished.callback(None)

def printResource(response):
    finished = Deferred()
    response.deliverBody(ResourcePrinter(finished))
    return finished

def printError(failure):
    print >>sys.stderr, failure

def stop(result):
    reactor.stop()

if len(sys.argv) != 3:
    print >>sys.stderr, "Usage: python post_resource.py URL 'POST DATA'"
    exit(1)

agent = Agent(reactor)
body = StringProducer(sys.argv[2])
d = agent.request('POST', sys.argv[1], bodyProducer=body)
d.addCallbacks(printResource, printError)
d.addBoth(stop)

reactor.run()

To test this example, we need a URL that accepts POST requests. http://www.google.com is not such a URL, as it turns out. This:

python post_data.py http://www.google.com 'Hello World'

prints:

The request method POST is inappropriate for the URL /. That’s all we know.

This is an occasion where being able to spin up a basic web server easily for testing would be useful. Fortunately, we covered Twisted web servers in the previous chapter!

Example 5-6 is a simple web server that echoes the body of a POST, only reversed.

Example 5-6. test_server.py

from twisted.internet import reactor
from twisted.web.resource import Resource
from twisted.web.server import Site

class TestPage(Resource):
    isLeaf = True
    def render_POST(self, request):
        return request.content.read()[::-1]

resource = TestPage()
factory = Site(resource)
reactor.listenTCP(8000, factory)
reactor.run()

python test_server.py will start the web server listening on port 8000. With that server running, we can then test our client with:

$ python post_data.py http://127.0.0.1:8000 'Hello World'
dlroW olleH

More Practice and Next Steps

This chapter introduced Twisted HTTP clients. High-level helpers getPage and downloadPage make quick resource retrieval easy. The Agent is a flexible and comprehensive API for writing web clients.

The Twisted Web Client HOWTO discusses the Agent API in detail, including handling proxies and cookies.

The Twisted Web examples directory has a variety of HTTP client examples.