These days, of course, the file containing the data we want could be on a machine half a world away. Provided the file is accessible over the Internet, though, we can read it just as we do a local file. For example, the Hopedale data not only exists on our computers, but it’s also on a web page. At the time of writing, the URL for the file is http://robjhyndman.com/tsdldata/ecology1/hopedale.dat (you can look at it online!).
(Note that the examples in this section will work only if your computer is actually connected to the Internet.)
Module urllib.request contains a function called urlopen that opens a web page for reading. urlopen returns a file-like object that you can use much as if you were reading a local file.
There’s a hitch: because there are many kinds of files (images, music, videos, text, and more), the file-like object’s read and readline methods both return a type you haven’t yet encountered: bytes.
When dealing with type bytes, such as a piece of information returned by a call on function urllib.urlrequest.read, we need to decode it. In order to decode it, we need to know how it was encoded.
Common encoding schemes are described in the online Python documentation here: http://docs.python.org/3/library/codecs.html#standard-encodings. One of the most common encodings is UTF-8, an encoding created to represent Unicode: https://docs.python.org/3/howto/unicode.html.
The Hopedale data on the web is encoded using UTF-8. This program reads that web page and uses string method decode in order to decode the bytes object:
| import urllib.request |
| url = 'https://robjhyndman.com/tsdldata/ecology1/hopedale.dat' |
| with urllib.request.urlopen(url) as webpage: |
| for line in webpage: |
| line = line.strip() |
| line = line.decode('utf-8') |
| print(line) |