This chapter will talk about the HTTP client side of Twisted Web,
starting with quick web resource retrieval for one-off applications and
ending with the Agent
API for developing
flexible web clients.
Twisted has several high-level convenience classes for quick one-off resource retrieval.
twisted.web.client.getPage
asynchronously retrieves a resource at a given URL. It returns a Deferred
, which fires its callback with the
resource as a string. Example 5-1
demonstrates the use of getPage
; it
retrieves and prints the resource at the user-supplied URL.
from
twisted.internet
import
reactor
from
twisted.web.client
import
getPage
import
sys
def
printPage
(
result
):
result
def
printError
(
failure
):
>>
sys
.
stderr
,
failure
def
stop
(
result
):
reactor
.
stop
()
if
len
(
sys
.
argv
)
!=
2
:
>>
sys
.
stderr
,
"Usage: python print_resource.py <URL>"
exit
(
1
)
d
=
getPage
(
sys
.
argv
[
1
])
d
.
addCallbacks
(
printPage
,
printError
)
d
.
addBoth
(
stop
)
reactor
.
run
()
We can test this script with:
python print_resource.py http://www.google.com
which will print the contents of Google’s home page to the screen.
An invalid URL will produce something like the following:
$ python print_resource.py http://notvalid.foo
[Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.DNSLookupError'>:
DNS lookup failed: address 'notvalid.foo' not found:
[Errno 8] nodename nor servname provided, or not known.
]
Despite its name, getPage
can
make any HTTP request type. To make an HTTP POST request with getPage
, supply the method
and postdata
keyword arguments: for example,
getPage(sys.argv[1], method='POST',
postdata="My test data")
.
getPage
also supports using cookies, following
redirects, and changing the User-Agent for the request.
twisted.web.client.downloadPage
asynchronously downloads a resource at a given URL to the specified
file. Example 5-2 demonstrates the use of
getPage
.
from
twisted.internet
import
reactor
from
twisted.web.client
import
downloadPage
import
sys
def
printError
(
failure
):
>>
sys
.
stderr
,
failure
def
stop
(
result
):
reactor
.
stop
()
if
len
(
sys
.
argv
)
!=
3
:
>>
sys
.
stderr
,
"Usage: python download_resource.py <URL> <output file>"
exit
(
1
)
d
=
downloadPage
(
sys
.
argv
[
1
],
sys
.
argv
[
2
])
d
.
addErrback
(
printError
)
d
.
addBoth
(
stop
)
reactor
.
run
()
We can test this script with:
python download_resource.py http://www.google.com google.html
which will save the contents of Google’s home page to the file google.html.
getPage
and downloadPage
are useful for getting small jobs
done, but the main Twisted HTTP client API, which supports a broad range
of RFC-compliant behaviors in a flexible and extensible way, is the Agent
.
Example 5-3 implements the same
functionality as print_resource.py
from Example 5-1 using the
Agent
API.
import
sys
from
twisted.internet
import
reactor
from
twisted.internet.defer
import
Deferred
from
twisted.internet.protocol
import
Protocol
from
twisted.web.client
import
Agent
class
ResourcePrinter
(
Protocol
):
def
__init__
(
self
,
finished
):
self
.
finished
=
finished
def
dataReceived
(
self
,
data
):
data
def
connectionLost
(
self
,
reason
):
self
.
finished
.
callback
(
None
)
def
printResource
(
response
):
finished
=
Deferred
()
response
.
deliverBody
(
ResourcePrinter
(
finished
))
return
finished
def
printError
(
failure
):
>>
sys
.
stderr
,
failure
def
stop
(
result
):
reactor
.
stop
()
if
len
(
sys
.
argv
)
!=
2
:
>>
sys
.
stderr
,
"Usage: python agent_print_resource.py URL"
exit
(
1
)
agent
=
Agent
(
reactor
)
d
=
agent
.
request
(
'GET'
,
sys
.
argv
[
1
])
d
.
addCallbacks
(
printResource
,
printError
)
d
.
addBoth
(
stop
)
reactor
.
run
()
The agent version requires a bit more work but is much more general-purpose. Let’s break down the steps involved:
Initialize an instance of twisted.web.client.Agent
. Because the
agent handles connection setup, it must be initialized with a
reactor.
Make an HTTP request with the agent’s request
method. It takes at minimum the
HTTP method and URL. On success, agent.request
returns a Deferred
that fires with a Response
object encapsulating the
response to the request.
Register a callback with the Deferred
returned by agent.request
to handle the Response
body as it becomes available
through response.deliverBody
.
Because the response is coming across the network in chunks, we need
a Protocol
that will process the
data as it is received and notify us when the body has been
completely delivered.
To accomplish this, we create a Protocol
subclass
called ResourcePrinter
, similar to how we did when
constructing basic TCP servers and clients in Chapter 2. The
big difference is that we want to be able to continue processing the event outside of
ResourcePrinter
. That link to the outside world will
be a Deferred
that is passed to a ResourcePrinter
instance on initialization and is fired
when the connection has been terminated. That Deferred
is created and returned by printResource
so more
callbacks can be registered for additional processing. As chunks of the response body
arrive, the reactor invokes dataReceived
, and we print
the data to the screen. When the reactor invokes connectionLost
, we trigger the Deferred
.
Once the connection has been terminated, stop the reactor. To
do this, we register callbacks to a stop
function with the Deferred
triggered by connectionLost
and returned by printResource
. Recall that addBoth
registers the same function with
both the callback and errback chains, so the reactor will be stopped
whether or not the download was successful.
Finally, run the reactor, which will kick off the HTTP request.
Running this example with python agent_print_resource.py http://www.google.com produces the same output as Example 5-1.
Agent
supports all HTTP methods and arbitrary HTTP
headers. Example 5-4 demonstrates this functionality with an
HTTP HEAD request.
The Response
object in the Deferred
returned by agent.request
contains lots of useful HTTP
response metadata, including the HTTP status code, HTTP version, and
headers. Example 5-4 also demonstrates
extracting this information.
import
sys
from
twisted.internet
import
reactor
from
twisted.web.client
import
Agent
from
twisted.web.http_headers
import
Headers
def
printHeaders
(
response
):
'HTTP version:'
,
response
.
version
'Status code:'
,
response
.
code
'Status phrase:'
,
response
.
phrase
'Response headers:'
for
header
,
value
in
response
.
headers
.
getAllRawHeaders
():
header
,
value
def
printError
(
failure
):
>>
sys
.
stderr
,
failure
def
stop
(
result
):
reactor
.
stop
()
if
len
(
sys
.
argv
)
!=
2
:
>>
sys
.
stderr
,
"Usage: python print_metadata.py URL"
exit
(
1
)
agent
=
Agent
(
reactor
)
headers
=
Headers
({
'User-Agent'
:
[
'Twisted WebBot'
],
'Content-Type'
:
[
'text/x-greeting'
]})
d
=
agent
.
request
(
'HEAD'
,
sys
.
argv
[
1
],
headers
=
headers
)
d
.
addCallbacks
(
printHeaders
,
printError
)
d
.
addBoth
(
stop
)
reactor
.
run
()
Testing this script with a URL like:
python print_metadata.py http://www.google.com/
produces the following output:
HTTP version: ('HTTP', 1, 1) Status code: 200 Status phrase: OK Response headers: X-Xss-Protection ['1; mode=block'] Set-Cookie ['PREF=ID=b1401ec53122a4e5:FF=0:TM=1340750440... Expires ['-1'] Server ['gws'] Cache-Control ['private, max-age=0'] Date ['Tue, 26 Jun 2012 22:40:40 GMT'] P3p ['CP="This is not a P3P policy! See http://www.google.com/support/... Content-Type ['text/html; charset=ISO-8859-1'] X-Frame-Options ['SAMEORIGIN']
To POST HTTP data with Agent
, we need to construct a
producer, providing the IBodyProducer
interface, which will produce the
POST data when the Agent
needs it.
The producer/consumer design pattern facilitates streaming potentially large amounts of data in a way that is memory- and CPU-efficient even if processes are producing and consuming at different rates.
You can also read more about Twisted’s producer/consumer APIs.
To provide the IBodyProducer
interface, which is
enforced by Twisted’s use of zope.interface.implements
, a
class must implement the following methods, as well as a length
attribute tracking the length of the data the producer will eventually produce:
startProducing
stopProducing
pauseProducing
resumeProducing
For this example, we can construct a simple StringProducer
that just writes out the POST data to the waiting consumer when
startProducing
is invoked. StringProducer
is passed as the bodyProducer
argument to agent.request
.
Example 5-5 shows a complete POSTing client. Beyond the
StringProducer
, the code is almost identical to the
resource-requesting client in Example 5-3.
import
sys
from
twisted.internet
import
reactor
from
twisted.internet.defer
import
Deferred
,
succeed
from
twisted.internet.protocol
import
Protocol
from
twisted.web.client
import
Agent
from
twisted.web.iweb
import
IBodyProducer
from
zope.interface
import
implements
class
StringProducer
(
object
):
implements
(
IBodyProducer
)
def
__init__
(
self
,
body
):
self
.
body
=
body
self
.
length
=
len
(
body
)
def
startProducing
(
self
,
consumer
):
consumer
.
write
(
self
.
body
)
return
succeed
(
None
)
def
pauseProducing
(
self
):
pass
def
stopProducing
(
self
):
pass
class
ResourcePrinter
(
Protocol
):
def
__init__
(
self
,
finished
):
self
.
finished
=
finished
def
dataReceived
(
self
,
data
):
data
def
connectionLost
(
self
,
reason
):
self
.
finished
.
callback
(
None
)
def
printResource
(
response
):
finished
=
Deferred
()
response
.
deliverBody
(
ResourcePrinter
(
finished
))
return
finished
def
printError
(
failure
):
>>
sys
.
stderr
,
failure
def
stop
(
result
):
reactor
.
stop
()
if
len
(
sys
.
argv
)
!=
3
:
>>
sys
.
stderr
,
"Usage: python post_resource.py URL 'POST DATA'"
exit
(
1
)
agent
=
Agent
(
reactor
)
body
=
StringProducer
(
sys
.
argv
[
2
])
d
=
agent
.
request
(
'POST'
,
sys
.
argv
[
1
],
bodyProducer
=
body
)
d
.
addCallbacks
(
printResource
,
printError
)
d
.
addBoth
(
stop
)
reactor
.
run
()
To test this example, we need a URL that accepts POST requests. http://www.google.com is not such a URL, as it turns out. This:
python post_data.py http://www.google.com 'Hello World'
prints:
The request method POST is inappropriate for the URL /. That’s all we know.
This is an occasion where being able to spin up a basic web server easily for testing would be useful. Fortunately, we covered Twisted web servers in the previous chapter!
Example 5-6 is a simple web server that echoes the body of a POST, only reversed.
from
twisted.internet
import
reactor
from
twisted.web.resource
import
Resource
from
twisted.web.server
import
Site
class
TestPage
(
Resource
):
isLeaf
=
True
def
render_POST
(
self
,
request
):
return
request
.
content
.
read
()[::
-
1
]
resource
=
TestPage
()
factory
=
Site
(
resource
)
reactor
.
listenTCP
(
8000
,
factory
)
reactor
.
run
()
python test_server.py will start the web server listening on port 8000. With that server running, we can then test our client with:
$ python post_data.py http://127.0.0.1:8000 'Hello World'
dlroW olleH
This chapter introduced Twisted HTTP clients. High-level helpers
getPage
and downloadPage
make quick resource retrieval
easy. The Agent
is a flexible and
comprehensive API for writing web clients.
The Twisted Web Client HOWTO discusses the Agent
API in detail, including handling proxies and cookies.
The Twisted Web examples directory has a variety of HTTP client examples.